Code Motion
Global value numbering enables us to decide whether a computation at a
given program point is redundant or not, and therefore whether to eliminate
it.
Another way of reducing the number of computations a program carries out is
to move them from frequently-executed parts of the program to parts that
are executed less often.
There are, of course, constraints on such moves that must be checked by
checking global relationships among computations.
Example
Embedding our array calculation in a loop illustrates the problems and
potential profits:
i = l1 + 1;
do {
j = l2 + 1;
do {
k = l3 + 1;
do {
a[i,j,k] = ( a[i-1,j,k] + a[i+1,j,k] +
a[i,j-1,k] + a[i,j+1,k] +
a[i,j,k-1] + a[i,j,k+1] ) / 6.0;
k++;
} while (k < u3);
j++;
} while (j < u2);
i++;
} while (i < u1);
Recall that the array assignment computed
i*C1 + j*C2 + k*C3
seven times.
[Discuss-
- Apologies for the shameful mixture of languages
- All of the recomputations are eliminated by global value numbering
- i*C1 + j*C2 + k*C3 still
appears in the innermost loop
- i*C1 really only needs to be computed once for each
trip around the outer loop
- Thus if the array were 50 x 50 x 50, i*C1 would
be computed 125,000 times instead of 50 times
]
Control flow graph and dominator tree
The dominance frontier of each dominator tree node is given in square
brackets beside the node:
Here are the tuples, grouped into basic blocks numbered according to the
diagram.
Only the ones we've already seen from the array assignment are shown,
and all program point numbers are omitted.
Moreover, the value numbers of the original tuples have been preserved and
new ones assigned to the new tuples.
Finally, i, j and k without subscripts represent
the initial variables; subscripted occurrences represent renamings
due to assignments within the for-loops:
1:
V11:=l1+1
i1:=V11
Jump(2)
2:
i2:=phi(i1,i3);
j1:=phi(j,j4);
k1:=phi(k,k5);
V21:=l2+1
j2:=V21
Jump(3)
3:
j3:=phi(j2,j4);
k2:=phi(k1,k5);
V31:=l3+1
k3:=V31
Jump(4)
4:
k4:=phi(k3,k5);
V1:=i2
V2:=V1*C1
V3:=j3
V4:=V3*C2
V5:=V2+V4
V6:=k4
V7:=V6*C3
V8:=V5+V7
V9:=V8-C01
V10:=a[V9]
...
V41:=V6+1
k5:=V41
V42:=V41<u3
JumpT(4,V42)
JumpF(5,V42)
5:
V51:=V3+1
j4:=V51
V52:=V51<u2
JumpT(3,V52)
JumpF(6,V52)
6:
V61:=V1+1
i3=V61
V62:=V61<u1
JumpT(2,V62)
JumpF(exit,V62)
[Discuss-
- Note that numbers like l1+1 are constants -- the
addition can be done at compile time
- Linkage among the basic blocks is explicit
]
Constraints on code motion
In general, one cannot move a computation
- backwards past the computation of any of its operands
- forwards past any computation using its value
- onto a path along which it would not have been computed
by the original program.
[Discuss-
- It may be possible to move the computation of an operand backward,
or the use of a result forward, and thus gain more freedom
- Moving a computation where it wouldn't have been done is usually a
``pessimization'' of the program
- If a computation that could lead to an error is moved onto a path along
which it would not have been computed by the original program, the
optimized program can fail when the unoptimized one does not
- It is not necessary to preserve the failure position, so we can move
computations that may fail
- Optimizations might delete a computation that fails, creating
an optimized program that runs where the unoptimized one fails.
- On some machines, speculative execution of code that will not
fail is useful -- it covers time required to access memory for a new
address
]
Example
Consider V1 and V2, which are
computed in block 4:
- Every path from blocks 1, 2 and 3 to the exit passes through block 2,
so the computation of V1 and V2
could be moved to any of these blocks without moving it onto a path along
which it might not be computed
- One of the operands of V1 is computed in block 2,
so V1 and V2 cannot be moved into
block 1
Thus the lumped computation could be carried out in any of the three
starred blocks.
[Discuss-
- ``Earliest'' placement is in block 2
- If the computation is done in block 2, then it need not be done in
block 4
- ``Code motion'' is really insert and delete
]
A more complex example
This example is taken from
Lazy code motion,
ACM SIGPLAN Notices, 27 (July, 1992) 224-234.
[Discuss-
- Find the earliest computation points for v (2 and 5)
- Note that this results in the fewest computations of
v
- Discuss the lifetime of v and the concept of register
pressure
- Note that the same number of computations of v will be
executed if computations are done in 2, 7, 14 and 15
- Reducing register pressure may avoid spills, so the extra computations
may cost nothing
]
Critical edges
A critical edge is an edge in the flow graph leading from a node
with multiple successors to a node with multiple predecessors.
Critical edges can cause problems for code motion.
Consider the implementation of
if (i != 0)
while (j > (k / i)) {
... /* No changes made to i or k here */
j--;
}
1:
V11:=i
V12:=V11!=0
JumpT(exit,V12)
JumpF(2,V12)
2:
j1:=phi(j,j2)
V21:=k
V22:=V21/V11
V23:=j1
V24:=V23>V22
JumpT(3,V24)
JumpF(exit,V24)
3:
...
V32:=j1-1
j2:=V32
Jump(2)
Both V21 and V22 are constant
within the loop, yet neither can be moved to block 1 because that would
move them onto a path along which they would not have been computed
by the original program.
[Discuss-
- Introducing the computation into block 1 would allow a division by 0
]
Removing critical edges
The problem can be easily resolved: Break the critical edge by introducing
an empty basic block:
[Discuss-
- The constant calculations in the loop exist on all paths from block 0
to the exit
- Moving them to block 0 gets them out of the loop
- We assume that critical edges will always be broken in this
way
- Simple transformation of the control flow graph before any further
processing is done
]
Pre-tested loops
The original example of the array calculation could have been written using
for statements instead of do-while:
for (i = l1 + 1; i < u1; i++)
for (j = l2 + 1; j < u2; j++)
for (k = l3 + 1; k < u3; k++)
a[i,j,k] = ( a[i-1,j,k] + a[i+1,j,k] +
a[i,j-1,k] + a[i,j+1,k] +
a[i,j,k-1] + a[i,j,k+1] ) / 6.0;
Here is the definition of a for statement:
for (i = l1 + 1; i < u1; i++) S;
i = l1 + 1;
while (i < u1) {
S;
i++;
}
[Discuss-
- Test at the beginning of the loop
- Initialization and test must be separate basic blocks
- Increment may belong to the same basic block as stuff from S
]
Control flow graph and dominator tree
The dominance frontier of each dominator tree node is given in square
brackets beside the node:
Here are the tuples, grouped into basic blocks numbered according to the
diagram:
1:
V11:=l1+1
i1:=V11
Jump(2)
2:
i2:=phi(i1,i3)
j1:=phi(j,j3)
V21:=i2
V22:=V21<u3
JumpT(3,V22)
JumpF(exit,V22)
3:
V31:=l2+1
j2:=V31
Jump(4)
4:
j3:=phi(j2,j4)
k1:=phi(k,k3)
V41:=j3
V42:=V41<u3
JumpT(5,V42)
JumpF(9,V42)
5:
V51:=l3+1
k2:=V51
Jump(6)
6:
k3:=phi(k2,k4)
V61:=k3
V62:=V61<u3
JumpT(7,V62)
JumpF(8,V62)
7:
V1:=i2
V2:=V1*C1
...
V71:=V6+1
k4:=V71
Jump(6)
8:
V81:=V41+1
j4:=V81
Jump(4)
9:
V61:=V21+1
i3=V61
Jump(2)
[Discuss-
- The computations V1 and V2
cannot be moved in this diagram because any move would place them on a path
along which they would not have been computed by the original program
- There are no critical edges
- Problem is that a pre-tested loop may not be executed at all
]
Landing pads
A partial solution to the problem is for the translator to modify the
implementation of the for statement by duplicating the
test:
[Discuss-
- This implementation produces a critical edge, from the first instance
of the test to the body
- The critical edge removal will insert an empty basic block on this edge
- The inserted block will act as a ``landing pad'' for calculations that
are constant within the loop
- Only the innermost loop will be affected, because the pre-test
of the innermost loop blocks motion out of the landing pad
]