Parse Trees & Abstract Syntax Trees
Visitor pattern, expression problem, and AST design
Overview
This lecture covers what parse trees (abstract syntax trees) are, how they’re structured in code, and two competing approaches to organizing logic around them: the visitor pattern vs. methods directly on nodes. It also introduces the expression problem - a fundamental design trade-off in software engineering.
1. From Tokens to Parse Trees
The compiler pipeline so far:
Source Text -> Tokenizer -> Stream of Tokens -> Parser -> Parse Tree
- Tokenization (covered previously): turning a stream of characters into a linear series of tokens.
- Parsing (this section): feeding tokens into a parser to produce a parse tree (a.k.a. abstract syntax tree or syntax tree).
Terminology
| Term | Usage |
|---|---|
| Parse Tree | Informal / practical term |
| Abstract Syntax Tree (AST) | Formal term |
| Syntax Tree | General-purpose term |
There are subtle theoretical differences between these, but for a practical compilers course they are used interchangeably.
2. What Makes a Tree?
- A tree has a root node.
- Each node has a collection of children.
- Children may have children of their own (recursive structure).
- In most parse trees, a child also has a pointer to its parent (enables walking the tree in both directions).
Trees and Recursion
- Lists / arrays -> favor iteration
- Trees -> favor recursion
Many compiler algorithms are recursive walks over the parse tree.
Example: Expression Tree
(-)
/ \
(+) 4
/ \
1 2
The minus operator is the root; its children are the plus sub-expression and the literal 4.
3. Modeling Parse Trees in Java (CatScript Approach)
The ParseElement Base Class
- Abstract base class that all AST nodes extend.
- Contains:
List<ParseElement> children- child nodesToken start- the first token this element spansToken end- the last token this element spans- Parse error information
Start and End Tokens
Each parse element maps back to a span of tokens in the source:
| Construct | Start Token | End Token |
|---|---|---|
for loop |
for keyword |
} (closing brace) |
if statement |
if keyword |
} (closing brace) |
This connection between the parse tree and the underlying source text is maintained via the start/end token references.
Class Hierarchy Example
ParseElement (abstract)
└── Expression (abstract)
└── AdditiveExpression
- leftHandSide: Expression
- rightHandSide: Expression
For 1 + 1:
- An
AdditiveExpressionobject holds:leftHandSide-> integer literal expression (1)rightHandSide-> integer literal expression (1)
4. Code Generation Approach (Crafting Interpreters Book)
The book Crafting Interpreters (whose language is called Lox) uses a code generator to produce AST classes:
How It Works
- Define classes as simple strings:
"Binary : Expr left, Token operator, Expr right" - A code generator parses these strings (split on
:for class name, split on,for fields) - The
defineTypefunction builds Java source code:- Class declaration extending base class
- Constructor with field parameters
- Field assignments
- Field declarations
Why Use Code Generation?
- Avoids writing repetitive boilerplate (constructors, field assignments, field declarations) for every AST node type.
- This is metaprogramming - writing programs that write programs.
Downsides
- Java is bad at metaprogramming in general.
- The generated classes become black boxes - if you regenerate, any hand-written logic inside them gets overwritten.
- This forces you to keep logic outside the AST nodes (leading to the visitor pattern).
Other forms of metaprogramming: C++ macros, Rust macros.
5. The Expression Problem
Definition
Given n operations and m types, you must choose one trade-off.
In the lecture’s example: 3 operations (interpret, resolve, analyze) x 4 types (binary, grouping, literal, unary).
| Add New Type | Add New Operation | |
|---|---|---|
| OOP approach | Easy (just add a class, implement methods) | Hard (must modify every existing class) |
| Functional approach | Hard (must update every function) | Easy (just add a new function) |
The Functional Approach
In functional programming, instead of objects with methods, you write top-level functions that use elaborate if / pattern-matching statements to handle each type. Adding a new function is easy (standalone code), but adding a new type means updating every existing function.
The Trade-off
There is no solution that makes both adding new types and adding new operations easy without modifying existing code.
In Practice
“With modern IDEs, I can add a new function to a class hierarchy very easily. Just add it to the base class, hit alt-enter, implement on subclasses. It’s really not a hard problem.”
6. The Visitor Pattern
From the book: “The visitor pattern is the most widely misunderstood pattern in all of design patterns, which is really saying something when you look at the software architecture excesses of the past couple decades.”
Structure
- Base class defines an
accept(Visitor v)method. - Visitor interface defines a
visitmethod for every concrete subclass. - Each concrete subclass implements
acceptby callingvisitor.visit(this).
Example: Vehicle Visitor
// Base class
abstract class Vehicle {
abstract void accept(VehicleVisitor visitor);
}
// Visitor interface
interface VehicleVisitor {
void visit(Car car);
void visit(Truck truck);
}
// Concrete class
class Truck extends Vehicle {
void accept(VehicleVisitor visitor) {
visitor.visit(this); // dispatches to visit(Truck)
}
}
// Concrete visitor
class VehiclePrinter implements VehicleVisitor {
void visit(Car car) { print("Found a car"); }
void visit(Truck truck) { print("Found a truck"); }
}
Key mechanism: this inside Truck.accept() is of type Truck, so visitor.visit(this) calls visit(Truck) - achieving double dispatch.
Simpler Alternatives
instanceofchecks:if (vehicle instanceof Car) { ... } else if (vehicle instanceof Truck) { ... }- Put the method directly on the class:
vehicle.print()
Why the Book Uses Visitors
- Because the AST classes are code-generated, you can’t add logic to them (it would be overwritten).
- Visitors let you define operations externally to the generated classes.
7. The CatScript Approach: Methods on Nodes
Instead of visitors, put operations directly on the AST node classes:
class AdditiveExpression extends Expression {
Expression leftHandSide;
Expression rightHandSide;
// Evaluation logic - right here on the node
Object evaluate(CatScriptRuntime runtime) {
Object lhs = leftHandSide.evaluate(runtime);
Object rhs = rightHandSide.evaluate(runtime);
if (isAddition) return (int)lhs + (int)rhs;
else return (int)lhs - (int)rhs;
}
// Transpilation - also right here
void transpile(StringBuilder sb) { ... }
// Compilation - also right here
void compile(ByteCodeGenerator gen) { ... }
}
Four Operations on the Parse Tree
| Operation | Purpose |
|---|---|
| Validate | Type checking / verification |
| Evaluate (Eval) | Interpret the program directly |
| Transpile | Translate to another language |
| Compile | Generate bytecode |
Advantages
- Dramatically simpler codebase - no visitors, no callbacks
- Everything about a node type is in one place (high cohesion)
- Easy to read, debug, and understand
- Copy-paste with slight modifications is preferred over complex abstractions
The “Violation”
This violates separation of concerns - the AST knows about compilation, evaluation, etc. Carson argues this is a worthwhile trade-off:
“I don’t care when people say you’re not separating your concerns. I think there are more important things like simplicity, making it easy to debug, simplicity, locality of behavior, simplicity, easy reading, simplicity, and also simplicity.”
8. Software Design Philosophy
The Developer Experience Curve (attributed to Lea Verou)
How priorities shift over a programming career:
| Concern | Early Career | ~5 Years | ~10 Years | Senior |
|---|---|---|---|---|
| “Does it work?” | Highest priority | Less important | Even less | Assumed |
| Code readability | Some | Low (focus on expressiveness) | Growing | Highest priority |
| DRY (Don’t Repeat Yourself) | Low (don’t know how) | Growing | Peak (no repeated code!) | Declining - copy-paste is OK |
Key Takeaways
- Fight hard to keep code simple.
- Copy-paste with slight modifications > super complicated architecture with no repetition.
- Three similar lines of code is better than a premature abstraction.
- The visitor pattern and code generation add complexity that may not be justified.
- Choose simplicity and readability over the “right” architecture.
- The instructor’s ideal: the “grug brain” parser - plain Java classes with methods, no visitors, no code generators.
“Copy and paste with slight modifications is better than a super complicated architecture that has no repetition but is impossible to understand.”
Summary
| Topic | Key Point |
|---|---|
| Parse Trees | Objects with children arranged as a tree; nodes have start/end tokens |
| Code Generation | Metaprogramming to avoid boilerplate; forces logic outside the AST |
| Expression Problem | Fundamental trade-off: easy new types OR easy new operations, not both |
| Visitor Pattern | Double-dispatch mechanism; powerful but adds significant complexity |
| CatScript Approach | Handwritten classes, logic on nodes, no visitors - simplicity wins |
| Design Philosophy | Readability and simplicity > DRY > architectural purity |