Global Register Allocation
Current main memory chips have access times on the order of 60ns to 70ns.
When one adds the time it takes for a memory request to pass from the
processor through the system bus and then the memory controllers and
decode logic, the memory access time can increase to 100ns or more.
A processor that runs with a 100MHz clock has a cycle time of 10ns.
If we assume that an addition can be executed in a single processor cycle,
then an ADD instruction that takes one of its operands from main memory
might spend 100ns waiting for that operand and only 10ns doing the addition.
The overall time required to complete a program would then be determined
almost entirely by the memory access time; increasing the processor speed
would have very little effect.
Hardware designers mitigate this problem by providing
caches
on the processor chip to buffer memory references.
Although the cache does an excellent job of reducing the cost of memory
operations, it is vital for the compiler to make best use of the available
register set to shorten execution time still further.
We have already seen how a non-optimizing compiler allocates registers
within a computation to minimize the number of registers used.
This strategy is appropriate when the target machine has few registers; in
the case of a machine with many registers (like the MIPS processor assumed
in the earlier example), an optimizing compiler must take a broader view
of the problem.
Simple example
Let's return to the
example illustrating competent code generation
using only local information:
int
GCD(int x, int y)
{ while (!(x == y))
if (x > y) x = x - y; else y = y - x;
return x;
}
The final MIPS assembly code generated for
this program was:
main: sub $sp,$sp,4
sw $fp,($sp)
move $fp,$sp
sub $sp,$sp,8
li $2,5
syscall
sw $2,-4($fp)
li $2,5
syscall
sw $2,-8($fp)
j L1
L2:
lw $4,-8($fp)
lw $5,-4($fp)
ble $5,$4,L3
lw $4,-8($fp)
lw $5,-4($fp)
sub $4,$5,$4
sw $4,-4($fp)
j L4
L3:
lw $4,-4($fp)
lw $5,-8($fp)
sub $4,$5,$4
sw $4,-8($fp)
L4:
L1:
lw $4,-8($fp)
lw $5,-4($fp)
bne $5,$4,L2
lw $4,-4($fp)
j L5
L5:
li $2,1
syscall
move $sp,$fp
lw $fp,($sp)
add $sp,$sp,4
j $ra
[Discuss-
- Every operation involves an access to memory for each operand
- Every arithmetic operation involves an access to memory for the result
- This sort of structure is not atypical -- most assignments
have no more than one operator (think about your own programs!)
]
The result of global register allocation
It's clear by inspection that we could implement the program using
only two registers (one for x and the other for y):
main: sub $sp,$sp,4
sw $fp,($sp)
move $fp,$sp
sub $sp,$sp,8
li $2,5
syscall
add $4,$2,0
li $2,5
syscall
add $5,$2,0
j L1
L2:
ble $4,$5,L3
sub $4,$4,$5
j L4
L3:
sub $5,$5,$4
sw $4,-8($fp)
L4:
L1:
bne $4,$5,L2
j L5
L5:
li $2,1
syscall
move $sp,$fp
lw $fp,($sp)
add $sp,$sp,4
j $ra
[Discuss-
- Certain registers have fixed uses and are not allocated
- Use of add to move the content of a register
- Reservation of storage for the activation record, even though it isn't
used
]
Restrictions on allocation
We can only allocate a variable to a register if it fits into a register
and if we have control over all accesses of that variable.
A rule of thumb is to consider only simple local variables whose addresses
are never computed.
[Discuss-
- Registers don't have addresses
- Aliasing: reference the same entity with several names
- Non-local variables may be allocated registers over a region where no
other accesses are possible
]
Allocation philosophy
There are two basic philosophies underlying register allocation:
- Variables have ``home'' memory locations, but we try to keep them in
registers for as much of their lifetimes as possible
- Variables are kept in registers, but from time to time it may be
necessary to ``spill'' them to memory
[Discuss-
- The ``home'' memory location is more in tune with the model that we
need for dealing with variables shared between procedures or aliased
- The ``spill'' approach is more reasonable when allocating registers for
value numbers, which don't have a natural ``home''
- We can always use the ``home'' model, only allocating an actual
``home'' when a spill occurs but then using it for all spills of
that entity
- It's possible to restrict the lifetime of a memory location in the
``spill'' approach, whereas the ``home'' approach usually leads to a
lifetime equal to the procedure activation
]
Economics
Allocating a variable to a register needs to make sense in terms of the
program's execution time.
We can characterize the costs for a particular situation in terms of three
parameters:
- LODSAVE
- The amount of execution time saved for a reference if the entity
referenced is in a register
- STRSAVE
- The amount of execution time saved for a definition if the entity
defined is in a register
- MOVCOST
- The cost of moving a value between a register and memory (if the costs
of loads and stores are different, that must be accounted for)
For more information about these parameters, see
The Priority-Based Coloring Approach to Register Allocation,
ACM Transactions on Programming Languages and Systems 12 (1990) 501-536.
[Discuss-
- Here ``definition'' is taken in the FORTRAN sense of an operation that
sets a value
- LODSAVE and STRSAVE only characterize the costs in a single instruction
execution
]
Register allocation as graph coloring
Global register allocation considers a complete procedure and a set of
entities to be allocated among k registers such that two entities
that exist at the same time are allocated to different registers.
This problem can be interpreted as one of coloring an interference
graph:
- Each node of the interference graph corresponds to one entity
- There is an (undirected) edge between two nodes if the corresponding
entities exist at the same time
Adjacent nodes must be given distinct colors;
the colors assigned to the nodes are the registers.
[Discuss-
- Determining the chromatic number (minimum number of colors
needed to color a graph so that adjacent nodes have different colors) is
NP-complete
- Our problem is to find a coloring with a given number of
colors (the number of allocatable registers)
- Basic strategy to find k-coloring: delete nodes with fewer than k
neighbors, color in reverse order of deletion
(Register Allocation via Coloring,
Computer Languages 6 (1981) 47-57)
- Problem: Diamond graph can be colored with two colors but no node has
fewer than 2 neighbors
- Deleting a node arbitrarily works...
(Improvements to Graph Coloring Register Allocation,
ACM Transactions on Programming Languages and Systems 16 (1994) 428-455)
]
Register allocation algorithm
- Find the set of units making up the
live range for each entity.
- Construct the interference graph with one node
for each live range
- Delete nodes with fewer than k neighbors, pushing them
onto a stack.
If nodes remain, select one,
delete it from the graph and push it onto the stack.
- Pop nodes from the stack, allocating them to registers so that a node
does not get allocated the same register as any of its neighbors.
If this cannot be done, split the live range.
- If any live range was split in step (4), go to step (1).
[Discuss-
- Details of the algorithm depend on the engineering of the compiler
- We are free to choose the entities to be allocated
- We are free to choose the units for expressing a live range
- Basic blocks
- Program points
- Selection of the nodes with k neighbors is determined by economics
- Method of splitting is heuristic
- In practice, two iterations suffice for almost all problems
]
Register allocation data structure
Each live range has four properties:
- Set of basic blocks constituting the live range
- List of live units making up this live range
- List of neighboring ranges in the interference graph
- Length of the list of neighboring ranges
- Set of registers that must not be used for this live range
A live unit is the intersection of a live range and a basic block.
It summarizes the properties of the basic block that are important for this
live range:
- Number of loads for the entity
- Number of stores of the entity
- Whether the first access is a use or a def
- Whether the entity must be loaded on entry
- Whether the entity must be stored on exit
Each basic block also has a list of the live units making it up, so the
information for a live entity can be accessed either from its live range
or its basic block.
[Discuss-
- The last two properties of a live unit are not strictly necessary --
they could be computed on the fly
- An entity has a home in memory, so it must be loaded when its first
access is a use and a predecessor of the basic block is outside the live
range
- If the entity is changed and is live on exit to a successor not in the
live range, then it must be stored in its memory home on exit
]
Compute the set of basic blocks constituting
each live range
This is the first step of the
global register allocation algorithm.
We know what the entities that we want to allocate are, so we can begin by
creating a live range object for each.
The set of basic blocks belonging to the live range
are then determined by the following function:
LiveRange(x) =
{b | for some p in b,
(used(p,x) or
def(p,x) or
(Live(p,x) and Reaching(p,x)))}
As the basic blocks making up the live range are determined,
live unit information should be computed for them and
added to the live unit list for the live range.
[Discuss-
- The live range consists of the places where the variable is referenced
and the places where it must be preserved
- If a variable is not referenced in a basic block, it must be preserved
only if a definition reaching that point is used later
- used(p,x) is true if variable x is used at program
point p
- def(p,x) is true if variable x is changed at program
point p
- We assume that if a variable is both used and assigned at a program
point, the use precedes the definition (i.e. multiple assignments are the
most complex program points)
- If a source language allows assignments within an expression, and
also allows uses of the variable assigned, then the execution
order of those operations must be fixed by the language definition
- When the order of two computations is fixed by the language definition,
the translator will invariably render them as distinct program points
because the decision about their order cannot be deferred
- Live(p,x) is a monotone data flow analysis problem:
- Reaching(p,x) is a monotone data flow analysis problem:
- L = ({Reaching, NotReaching}, Reaching, NotReaching, meet)
- The meet operation is applied over the predecessors of a node,
and is defined by the following table:
| meet | Reaching | NotReaching
|
|---|
| Reaching
| Reaching
| NotReaching
|
|---|
| NotReaching
| NotReaching
| NotReaching
|
|---|
- The set F contains two functions,
one 0-ary (constant) and one 1-ary:
- The mapping from a program point p to an element of F
is:
def(p,x) ? f0 : f1
]
The GCD example
For this example, we'll start with the
abstract target tree and interpret the
lw and sw nodes as markers for variable uses and
variable definitions respectively:
Mips(
Entry(8),
ReadInt(x),
ReadInt(y),
j(L1),
Label(L2),
CJump(ble(lw(x),lw(y)),L3),
IgnoreInt(sw(x,sub(lw(x),lw(y)))),
j(L4),
Label(L3),
IgnoreInt(sw(y,sub(lw(y),lw(x)))),
Label(L4),
Label(L1),
CJump(bne(lw(x),lw(y)),L2),
ReturnInt(lw(x)),
j(L5),
Label(L5),
PrintInt(),
Exit())
The entities assigned to registers will be the variables and the units
expressing a live range will be basic blocks.
| Block | Pred | Succ | used(x) | used(y) | def(x) | def(y) | Content
|
|---|
| 1
| Entry
| 5
| false
| false
| true
| false
| ReadInt(x)
|
| false
| false
| false
| true
| ReadInt(y)
|
| 2
| 5
| 3,4
| true
| true
| false
| false
| CJump(ble(lw(x),lw(y)),L3)
|
| 3
| 2
| 5
| true
| true
| true
| false
| IgnoreInt(sw(x,sub(lw(x),lw(y))))
|
| 4
| 2
| 5
| true
| true
| false
| true
| IgnoreInt(sw(y,sub(lw(y),lw(x))))
|
| 5
| 1,3,4
| 2,6
| true
| true
| false
| false
| CJump(bne(lw(x),lw(y)),L2)
|
| 6
| 5
| Exit
| true
| false
| false
| false
| ReturnInt(lw(x))
|
[Discuss-
- The live range for variable x is the set of all blocks,
the live range for variable y is the set {1, 2, 3, 4, 5}
- Live and Reaching don't affect this computation
]
Construct the interference graph
This is the second step of the
global register allocation algorithm.
At this point each live range object has the complete
set of basic blocks available, so all we need to do is
to go through and intersect each pair of sets.
If the intersection is not empty, an interference arc is added by placing
each live range on the others' list of
interfering live ranges.
Select a k-neighbor node to delete
This may be required during the third step of the
global register allocation algorithm.
A priority
(The Priority-Based Coloring Approach to Register Allocation,
ACM Transactions on Programming Languages and Systems 12 (1990) 501-536)
can be attached to each live range, showing which of them is estimated to
give the best results when allocated to a register.
When we need to choose a k-neighbor node to delete from the graph and push
onto the stack, we should select the live range with the lowest
priority.
The savings we get when a variable is in a register during a basic block
is determined from the machine-dependent parameters:
SavingsLiveUnit = LODSAVE*uses + STRSAVE*defs - MOVCOST*n
The savings must be weighted by the frequency, since we're interested in
minimizing the execution time:
WeightLiveUnit = 5Loop nesting depthLiveUnit
The priority is then the normalized, weighted savings over all live units
in the live range:
PrLiveRange =
(sum over LiveUnit in LiveRange
(SavingsLiveUnit * WeightLiveUnit)
) / NumberOfLiveUnitsLiveRange
[Discuss-
- The earlier a live range is pushed onto the stack, the later it will be
assigned a register.
Thus we want to push the live range for which it is least important to have
a register.
- The value of n by which MOVCOST is multiplied will be
0, 1 or 2:
- 2 if the value must be loaded at the beginning of the basic block and
stored at the end
- 1 if only one of these operations is needed
- 0 if neither is needed
- The depth of a basic block can be provided as additional information by
the translator of a language containing loop constructs.
Alternatively, the optimizer can analyze the control flow graph for nested
strongly-connected regions.
]
Splitting a live range
This may be required during the fourth step of the
global register allocation algorithm.
Initially, a single register is allocated to an entity during the whole of
its existence.
Perhaps fewer conflicts would result if its lifetime were split into two
distinct pieces and different registers used in each piece.
When splitting a live range, each piece will ideally start with a
definition of the entity (so that there will be no need for an extra load),
and will consist of a set of contiguously-executed basic blocks.
The following algorithm is taken from
The Priority-Based Coloring Approach to Register Allocation,
ACM Transactions on Programming Languages and Systems 12 (1990) 501-536:
- If there is a live unit in the range to be split that has free registers
and begins with a definition of the entity corresponding to the range,
select it and go to step 3.
- If there is a live unit in the range to be split that has free
registers, select it and go to step 3.
Otherwise discard this live range as a candidate for register allocation.
- Initialize the set of registers used to the set of registers used in
the basic block of the selected live unit.
- Select successor live units, breadth-first.
For each selection, check that the union of the set of registers used and
the set of registers used in the basic block has fewer than k members.
Go to step 5 when no such selection is possible.
- Remove the live units selected in steps 1, 2 and 4 from the live range.
The selected live units constitute one of the new live ranges, the live
units remaining in the original live range constitute the other.
Update the information in a live unit description
about whether the entity must be loaded on entry and/or stored on exit.
[Discuss-
- If some ranges cannot be split, but others can, then those that can't
could be passed along to the next iteration as candidates
rather than being discarded
]
Allocating registers to values instead of variables
The set of variables corresponding to live ranges was constrained by
what kind of variable it was.
Different constraints apply to the selection of values as candidates for
live ranges:
- Values appearing as the arguments or result of a phi function represent
different states of a single variable, and should therefore be combined
into a single live range.
- Except for values appearing in phi functions, only values with more
than one use should be considered.
[Discuss-
- Combination of values in phi functions is a disjoint UNION-FIND problem
(The Design and Analysis of Computer Algorithms,
Addison-Wesley, ISBN 0-206-00029-6, 1975, 124-139)
- Effective split of local and global register allocation
- Values not appearing in phi functions are local to a single expression
if they are only used once
]
Rematerialization
One problem with allocating registers to values instead of variables is
that the global value numbering process will identify a large number of
simple addressing computations as being redundant.
Depending upon the circumstances, it may be cheaper to redo these
computations (``rematerialize their values'') than it is to save and
restore their values.
For a more complete discussion, see Rematerialization,
SIGPLAN Notices 27, 7 (July, 1992) 311-321.
[Discuss-
- Rematerialization is usually limited to simple computations whose
operands are available everywhere
- Some such computations may show negative savings in many live units
]