Source Program Trees

Purpose

To illustrate the tree representation of typical source language semantics, to examine scanner and parser modules that build a source program tree, and to develop an implementation strategy that will serve as the basis for your approach to the next assignment:

Due dates

You must complete the PEP tasks (individually), and then hand in the answers to the questions in this exercise (as a group) at the beginning of class on Monday, September 14. This assignment is worth 30 points.

Background

Section A.3 of Waite defines an abstraction that is useful in a compiler translating a subset of C. Figure A.8(b) shows the explicit information captured by the original abstraction from the program text in Figure A.8(a). Implicit information, like the type of each expression, can be obtained through computation over the tree and stored at Expression nodes, thus making it explicit.

The important question to answer about a design like that given in Section A.3 of Waite is whether it supports the computations required by the compiler. In order to answer this question, one needs to examine the meanings of all of the components of the tree and determine whether the information needed for the computations is present in an appropriate form. An initial design can be made on the basis of experience, but as the implementation of the computation evolves it may become necessary to adjust that design.

One of the applicability criteria for the Visitor design pattern used to implement computations over a tree is that the classes defining the structure rarely change. Thus an iterative implementation process that alters the design of the abstraction as the computation is implemented increases the cost. In order to keep this cost under control, the designer must seek ways to either reduce the number of iterations or reduce the cost of each. The number of iterations is minimized by improving the quality of the original design -- say by employing highly-experienced designers. Iteration cost is reduced by generating code from specifications.

The basic symbols of the abstraction, which appear as terminal symbols in the grammar, must be recognized in the sequence of input characters. Waite's Chapter 3 discusses the issues involved in this process. Specifications for the literal terminal symbols can be extracted from the grammar, but the designer of the language must specify the structure of the nonliteral terminal symbols.

Most grammars describing abstract syntax trees are ambiguous (see Section 4.3 of Waite). Thus they cannot be used unchanged to specify the process of building the tree from a sequence of basic symbols. The usual process is to derive a parsing grammar from the tree grammar by adding rules and using several different names in those rules where one name is used in the tree grammar. Waite gives an example of this process in Section 4.3.

Once a parsing grammar and a set of basic symbol specifications are available, any of a number of tools can be used to generate the scanner (also known as a ``lexer'') and parser. These components can also be written by hand using the techniques Waite presents in Chapters 3 and 5.

The requirement that the compiler be able to unambiguously scan and parse input text may place some restrictions on the form of the tree abstraction. Thus, as the implementation of the scanner and parser progress, it may be necessary to change the tree structure. In order to contain the costs of this iteration, designers often defer the actual implementation of the abstraction until after the scanner and parser have been completed. This is a viable strategy, but it requires a formal specification of the abstraction to be available as a target during scanner and parser development. Changes to such a formal specification are much cheaper than changes to a set of class definitions.

If the scanner and parser are developed without a firm definition of the target abstraction, there is a strong tendency to lump constructs together that look the same without careful consideration of their semantics. Such ``simplifications'' always cause serious trouble later on if the semantics of the two constructs differ significantly. Thus the design process should always involve early specification of an abstraction based on the semantics.

Capturing the design is critical when the implementation involves more than one person, when the total implementation time is more than a few hours, or when the implementation task is novel or particularly complicated. While there is still great debate about the best way to capture a design, Gamma's case study on pages 33-77 is one reasonable possibility.

Gamma's starting point is an overview of what is covered by the editor design, followed by five pages about how a document would be represented by the editor. This material includes a discussion of a set of key design decisions and the rationale for each. Examples are provided to drive home how these decisions will influence the implementation. Design decisions about the implementation of a number of key functions of the editor follow the material on the representation of a document. Given this material, an editor implementation team is much more likely to create a working editor more quickly than the same team starting with just the assignment and a stack of blank paper.

Task

Answer the following questions.

  1. Waite gives a description of an identifier in Section A.3.1.1.1.
    1. What information about an identifier is explicit in the original abstraction, and what information is made explicit by computation over the tree?
    2. What structural properties of the tree would that computation use?

  2. A computation is necessary at each dyadic expression (Section A.3.2.8) to determine the type of the expression. What information required by this computation is implicit and what is explicit in the abstraction summarized in Table A.2?

  3. Table A.2 abstracts each dyadic expression by a distinct construct according to the operator (AndExp,..., StarExp). Another possible abstraction for C-- would abstract any dyadic expression by a common construct:

    ConstructSignature
    Dyadic(Expression, Operator, Expression): Expression
    AndOp(): Operator
    ......
    StarOp(): Operator

    This kind of abstraction might be appropriate if the same basic computations were used for all dyadic expressions, and those computations were simply parameterized by the operator.

    1. How many distinct constructs does this abstraction have relative to an abstraction like that of Table A.2?
    2. If this abstraction were used, what information required by the type computation would be implicit and what would be explicit in the original abstraction of the dyadic expression?

Constructing Tree Abstractions from Text shows how the abstract trees described in Section 1.3 of Appel's book can be constructed from input text. Familiarize yourself with both the hand-coded solution and the formal specification. Implementations are available in a browsable directory and as a single tar file. Obtain copies of these solutions via the web, and run both of them.

The hand-coded solution was tested using gcc version 2.7.2. Since it uses the standard template library it mayor may not work with other versions of C++. You may make changes to accommodate your C++ compiler, but you must retain the basic structure. You may also produce a Java version of the implementation if you wish.

The specifications were tested using Eli version 4.2. You may also convert the specifications to the input formats of any other scanner and parser generator if you wish. In that case you may invoke the constructor routines of either the C++ implementation or the Eli-generated abstraction module.

Answer the following questions about your experiences.

  1. What are the sizes of the source code for the two solutions? Do you think that the complexities of the two solutions are proportional to their size in lines (in other words, is the size of the source code an appropriate measure of complexity)?

  2. Compare the sizes of the executable programs obtained from the two solutions. Can you draw any conclusions about the results? If so, what are they; if not, why can't you?

This assignment required you to study some properties of C--, and to examine two versions of a structural analyzer for a simpler language. In the next, you must implement a structural analyzer for C-using tools and techniques of your choice. The technical issues involved in that implementation should be reasonably clear at this point, but how can you effectively divide the effort among the members of your team? Write a brief description of the tasks to be carried out by each team member for the next assignment, giving a rationale for your partitioning. This description forms a part of your group's project plan, but it should be submitted by electronic mail to lrc@sei.cmu.edu rather than being entered into the PEP system.

Complete your personal postmortem for this assignment and your personal project plan for the next assignment, and submit them via the web before class on the due date of this assignment. Your group must hand in one set of answers to the questions at the beginning of class on the due date.


Instructor
Revision 1.5 (1998/08/31 20:07:52)