Compilers

Parse Trees & Abstract Syntax Trees

Visitor pattern, expression problem, and AST design

Overview

This lecture covers what parse trees (abstract syntax trees) are, how they’re structured in code, and two competing approaches to organizing logic around them: the visitor pattern vs. methods directly on nodes. It also introduces the expression problem - a fundamental design trade-off in software engineering.


1. From Tokens to Parse Trees

The compiler pipeline so far:

Source Text  ->  Tokenizer  ->  Stream of Tokens  ->  Parser  ->  Parse Tree
  • Tokenization (covered previously): turning a stream of characters into a linear series of tokens.
  • Parsing (this section): feeding tokens into a parser to produce a parse tree (a.k.a. abstract syntax tree or syntax tree).

Terminology

Term Usage
Parse Tree Informal / practical term
Abstract Syntax Tree (AST) Formal term
Syntax Tree General-purpose term

There are subtle theoretical differences between these, but for a practical compilers course they are used interchangeably.


2. What Makes a Tree?

  • A tree has a root node.
  • Each node has a collection of children.
  • Children may have children of their own (recursive structure).
  • In most parse trees, a child also has a pointer to its parent (enables walking the tree in both directions).

Trees and Recursion

  • Lists / arrays -> favor iteration
  • Trees -> favor recursion

Many compiler algorithms are recursive walks over the parse tree.

Example: Expression Tree

        (-)
       /   \
     (+)    4
    /   \
   1     2

The minus operator is the root; its children are the plus sub-expression and the literal 4.


3. Modeling Parse Trees in Java (CatScript Approach)

The ParseElement Base Class

  • Abstract base class that all AST nodes extend.
  • Contains:
    • List<ParseElement> children - child nodes
    • Token start - the first token this element spans
    • Token end - the last token this element spans
    • Parse error information

Start and End Tokens

Each parse element maps back to a span of tokens in the source:

Construct Start Token End Token
for loop for keyword } (closing brace)
if statement if keyword } (closing brace)

This connection between the parse tree and the underlying source text is maintained via the start/end token references.

Class Hierarchy Example

ParseElement (abstract)
  └── Expression (abstract)
        └── AdditiveExpression
              - leftHandSide: Expression
              - rightHandSide: Expression

For 1 + 1:

  • An AdditiveExpression object holds:
    • leftHandSide -> integer literal expression (1)
    • rightHandSide -> integer literal expression (1)

4. Code Generation Approach (Crafting Interpreters Book)

The book Crafting Interpreters (whose language is called Lox) uses a code generator to produce AST classes:

How It Works

  1. Define classes as simple strings: "Binary : Expr left, Token operator, Expr right"
  2. A code generator parses these strings (split on : for class name, split on , for fields)
  3. The defineType function builds Java source code:
    • Class declaration extending base class
    • Constructor with field parameters
    • Field assignments
    • Field declarations

Why Use Code Generation?

  • Avoids writing repetitive boilerplate (constructors, field assignments, field declarations) for every AST node type.
  • This is metaprogramming - writing programs that write programs.

Downsides

  • Java is bad at metaprogramming in general.
  • The generated classes become black boxes - if you regenerate, any hand-written logic inside them gets overwritten.
  • This forces you to keep logic outside the AST nodes (leading to the visitor pattern).

Other forms of metaprogramming: C++ macros, Rust macros.


5. The Expression Problem

Definition

Given n operations and m types, you must choose one trade-off.

In the lecture’s example: 3 operations (interpret, resolve, analyze) x 4 types (binary, grouping, literal, unary).

  Add New Type Add New Operation
OOP approach Easy (just add a class, implement methods) Hard (must modify every existing class)
Functional approach Hard (must update every function) Easy (just add a new function)

The Functional Approach

In functional programming, instead of objects with methods, you write top-level functions that use elaborate if / pattern-matching statements to handle each type. Adding a new function is easy (standalone code), but adding a new type means updating every existing function.

The Trade-off

There is no solution that makes both adding new types and adding new operations easy without modifying existing code.

In Practice

“With modern IDEs, I can add a new function to a class hierarchy very easily. Just add it to the base class, hit alt-enter, implement on subclasses. It’s really not a hard problem.”


6. The Visitor Pattern

From the book: “The visitor pattern is the most widely misunderstood pattern in all of design patterns, which is really saying something when you look at the software architecture excesses of the past couple decades.”

Structure

  1. Base class defines an accept(Visitor v) method.
  2. Visitor interface defines a visit method for every concrete subclass.
  3. Each concrete subclass implements accept by calling visitor.visit(this).

Example: Vehicle Visitor

// Base class
abstract class Vehicle {
    abstract void accept(VehicleVisitor visitor);
}

// Visitor interface
interface VehicleVisitor {
    void visit(Car car);
    void visit(Truck truck);
}

// Concrete class
class Truck extends Vehicle {
    void accept(VehicleVisitor visitor) {
        visitor.visit(this);  // dispatches to visit(Truck)
    }
}

// Concrete visitor
class VehiclePrinter implements VehicleVisitor {
    void visit(Car car)   { print("Found a car"); }
    void visit(Truck truck) { print("Found a truck"); }
}

Key mechanism: this inside Truck.accept() is of type Truck, so visitor.visit(this) calls visit(Truck) - achieving double dispatch.

Simpler Alternatives

  1. instanceof checks: if (vehicle instanceof Car) { ... } else if (vehicle instanceof Truck) { ... }
  2. Put the method directly on the class: vehicle.print()

Why the Book Uses Visitors

  • Because the AST classes are code-generated, you can’t add logic to them (it would be overwritten).
  • Visitors let you define operations externally to the generated classes.

7. The CatScript Approach: Methods on Nodes

Instead of visitors, put operations directly on the AST node classes:

class AdditiveExpression extends Expression {
    Expression leftHandSide;
    Expression rightHandSide;

    // Evaluation logic  -  right here on the node
    Object evaluate(CatScriptRuntime runtime) {
        Object lhs = leftHandSide.evaluate(runtime);
        Object rhs = rightHandSide.evaluate(runtime);
        if (isAddition) return (int)lhs + (int)rhs;
        else return (int)lhs - (int)rhs;
    }

    // Transpilation  -  also right here
    void transpile(StringBuilder sb) { ... }

    // Compilation  -  also right here
    void compile(ByteCodeGenerator gen) { ... }
}

Four Operations on the Parse Tree

Operation Purpose
Validate Type checking / verification
Evaluate (Eval) Interpret the program directly
Transpile Translate to another language
Compile Generate bytecode

Advantages

  • Dramatically simpler codebase - no visitors, no callbacks
  • Everything about a node type is in one place (high cohesion)
  • Easy to read, debug, and understand
  • Copy-paste with slight modifications is preferred over complex abstractions

The “Violation”

This violates separation of concerns - the AST knows about compilation, evaluation, etc. Carson argues this is a worthwhile trade-off:

“I don’t care when people say you’re not separating your concerns. I think there are more important things like simplicity, making it easy to debug, simplicity, locality of behavior, simplicity, easy reading, simplicity, and also simplicity.”


8. Software Design Philosophy

The Developer Experience Curve (attributed to Lea Verou)

How priorities shift over a programming career:

Concern Early Career ~5 Years ~10 Years Senior
“Does it work?” Highest priority Less important Even less Assumed
Code readability Some Low (focus on expressiveness) Growing Highest priority
DRY (Don’t Repeat Yourself) Low (don’t know how) Growing Peak (no repeated code!) Declining - copy-paste is OK

Key Takeaways

  • Fight hard to keep code simple.
  • Copy-paste with slight modifications > super complicated architecture with no repetition.
  • Three similar lines of code is better than a premature abstraction.
  • The visitor pattern and code generation add complexity that may not be justified.
  • Choose simplicity and readability over the “right” architecture.
  • The instructor’s ideal: the “grug brain” parser - plain Java classes with methods, no visitors, no code generators.

“Copy and paste with slight modifications is better than a super complicated architecture that has no repetition but is impossible to understand.”


Summary

Topic Key Point
Parse Trees Objects with children arranged as a tree; nodes have start/end tokens
Code Generation Metaprogramming to avoid boilerplate; forces logic outside the AST
Expression Problem Fundamental trade-off: easy new types OR easy new operations, not both
Visitor Pattern Double-dispatch mechanism; powerful but adds significant complexity
CatScript Approach Handwritten classes, logic on nodes, no visitors - simplicity wins
Design Philosophy Readability and simplicity > DRY > architectural purity