Data Flow Analysis

We've seen how computations can be moved to other parts of the program so that redundancy in time is reduced. These moves were relatively straightforward for a human to carry out, but we need the optimizer to do them without human intervention. Data flow analysis is the general set of techniques that the compiler uses.

A typical data flow analysis problem

For a more complete discussion of this problem, see Optimal Code Motion: Theory and Practice, ACM Transactions on Programming Languages and Systems, 16 (1994) 1117-1155.

One of the constraints placed on code motion is not to move a computation to a path where it would not have been executed in the original program. To check this constraint, the compiler might compute a predicate d-safe(p,v), which is true if a computation of v at program point p does not introduce a new value on a terminating path starting at p.

[Discuss-

]

Evaluating d-safe(p,v)

We can infer the following from the intended meaning of d-safe(p,v): The local facts about what happens to a value at a program point can be captured by two local predicates:
used(p,v):	v appears in the tuple at p
through(p,v):	no operand of v appears before :=
		  in the tuple at p
Using these predicates, our inferred conditions can be described by:
d-safe(p,v) =
  used(p,v) or
    through(p,v) and
    for all successors p' of p, d-safe(p',v)
Let's try computing d-safe(p,v) for the following graph:

[Discuss-

]

Solution algorithm

The description above provides a function for computing the value of d-safe(p,v) at a point, given the values at successor points; it does not give an algorithm for computing the values at all points in the graph. Here's an algorithm:
  1. Put all p onto a work list, and set all values of d-safe(p,v) to ``true''
  2. Take an element off of the work list and compute its value
  3. If the new value differs from the old, put all p depending on this value onto the work list if they are not there already
  4. If the work list is not empty, go to step 2

[Discuss-

]

Earliest computation point

The earliest program point, p, at which value v can be computed is one for which d-safe(p,v) is true, but where there is some path from the start node to p along which there is no point p' for which d-safe(p',v) is true.

Recall that the earliest placement of a computation minimized the code size. To determine this placement, the compiler might compute a predicate earliest(p,v), which is true at all nodes reached from the start node via paths along which there is no point p' for which d-safe(p',v) is true. The earliest placement would then be at a point for which both d-safe(p',v) and earliest(p,v) were true.

[Discuss-

]

Evaluating earliest(p,v)

Using the local predicates, we can say:
earliest(p,v) =
  there exists a predecessor p' of p such that
    (not d-safe(p',v) and earliest(p',v))
We can compute this predicate by using the same algorithm as before, but

[Discuss-

]

Delay

We argued that, although the earliest computation point theoretically yielded the minimum code size, there might not be enough registers to hold all of the values. This would require spill code, which would increase both the execution time and the code size.

A computation can be delayed to a node without execution-time cost so long as none of the predecessors of that node uses the value computed. In terms of predicates:

delay(p,v) =
  (d-safe(p,v) and earliest(p,v)) or
  for all predecessors, p', of p
    (not used(p',v) and delay(p',v))
We can compute this predicate by using our general-purpose algorithm, but

[Discuss-

]

Latest placement

The final placement that we decided on in our example is determined by the node beyond which we cannot delay a computation without execution-time cost:
latest(p,v) =
  delay(p,v) and
  for some successor, p', of p)
    (used(p,v) or not delay(p',v)

[Discuss-

]

Characteristics of data flow problems

The examples of data flow problems that we've seen are solved by Boolean functions that involve: Even though they appear to be quite different, all were actually solved by the same algorithm with slightly different parameterizations. We would like to capture this essential similarity in a model, which we can then use to recognize and solve other data flow problems.

[Discuss-

]

Monotone data flow analysis

The model we'll use is the monotone analysis framework, which constitutes: A function is monotone if, for all a, b in A, a<=b implies f(a)<=f(b)

See A unified approach to path problems, Journal of the ACM 28 (July, 1981) 577-593, for an explanation of the theory behind this class of problems.

[Discuss-

]

Example

Let's see how the problem of d-safety fits the monotone analysis framework:

[Discuss-

]

Proving termination

Here is a general method for proving that any iteration terminates: Existence of a t with these properties guarantees termination (see The Science of Programming, Springer-Verlag, 1981).

Example:

while (fread(...) != EOF) {...}

B: We haven't gotten to the end of the file
t: Number of characters remaining in the file

[Discuss-

]

Proof of termination in a monotone analysis framework

Assume we use the work list algorithm:
B: The work list is empty
t: (sum of argument levels)*(total number of values) +
   (size of work list)
``Argument level'' is the index of the row or column of the function's table, with the origin at the lower right corner.

[Discuss-

]

Global constant propagation

We can compute constant values at compile time, so it is useful to be able to track them as they are assigned to variables and then used later. This requires that we be able to mark every value as one of:

[Discuss-

]

Constant propagation as monotone data flow analysis

For a more complete discussion, see Combining analyses, combining optimizations, ACM Transactions on Programming Languages and Systems 17 (March, 1995) 181-196.