Compilers

Recursive Descent Parsing

Turning a list of tokens into a parse tree

Expressions vs statements

A fundamental distinction in most programming languages:

  Expression Statement
Definition Produces a value Performs an action
Examples 5 + 3, x > 10, a * 5 + 3 var x = 10, print("hello"), if (x) { ... }
Returns A value (number, boolean, etc.) Nothing (in most languages)

Some languages treat assignment as an expression. Most don’t.


The Lox expression grammar

A simple but capable grammar for expressions:

expression -> literal | unary | binary | grouping ;

literal    -> NUMBER | STRING | "true" | "false" | "nil" ;
grouping   -> "(" expression ")" ;
unary      -> ( "-" | "!" ) expression ;
binary     -> expression operator expression ;

operator   -> "==" | "!=" | "<" | "<=" | ">" | ">="
           | "+"  | "-"  | "*" | "/" ;

This supports numbers, strings, booleans, null, unary negation/not, binary arithmetic and comparison, and parenthesized grouping - a surprisingly capable expression language from a small grammar.


The ambiguity problem

This grammar is ambiguous: a single token sequence can produce multiple valid parse trees with different meanings.

Example: 6 / 3 - 1

Parse tree 1 - division wraps subtraction (incorrect):

      (/)
     /   \
    6    (-)
        /   \
       3     1

Result: 6 / (3 - 1) = 6 / 2 = 3

Parse tree 2 - subtraction wraps division (correct):

      (-)
     /   \
   (/)    1
  /   \
 6     3

Result: (6 / 3) - 1 = 2 - 1 = 1

Both are valid derivations from the grammar, but they produce different results. We need a way to enforce the correct interpretation.


Operator precedence

Operator precedence (or binding precedence) determines which operator is evaluated first in a sequence of tokens.

Higher-precedence operators bind more tightly to their operands - they “grab” their arguments before lower-precedence operators do.

Standard precedence levels (lowest to highest)

Precedence Name Operators Associativity
1 (lowest) Equality == != Left
2 Comparison > >= < <= Left
3 Term (additive) + - Left
4 Factor (multiplicative) * / Left
5 (highest) Unary - ! Right

Counterintuitive fact: The operator with the lowest precedence ends up at the top of the parse tree. Higher-precedence operators sink lower into the tree because they bind first and become children.


Associativity

When the same operator appears multiple times, associativity determines grouping.

Left associativity

Most binary operators are left-associative - the leftmost operation binds first:

4 - 3 - 2  ->  (4 - 3) - 2  =  -1      (ok)
             4 - (3 - 2)    =   3      (no)

This matters for non-commutative operations like subtraction and division.

Right associativity

Unary operators and the power operator (^) are right-associative - the rightmost operation binds first:

!!true  ->  !(!true)  ->  !false  ->  true

Infix, prefix, and postfix

The ambiguity problem arises specifically from infix operators - operators placed between operands (1 + 2).

Notation Operator position Example Ambiguity?
Infix Between operands 1 + 2 Yes
Prefix (Lisp) Before operands (+ 1 2) No
Postfix (RPN) After operands 1 2 + No

Prefix and postfix notations eliminate ambiguity because the order of operations is explicit. Most programming languages use infix because it’s natural for humans, even though it requires precedence rules.


Encoding precedence in the grammar

The fix: split each precedence level into its own production rule. Each level can only reference the next level down on its right-hand side, preventing lower-precedence operators from being consumed where higher-precedence ones are expected.

expression -> equality ;

equality   -> comparison ( ( "!=" | "==" ) comparison )* ;
comparison -> term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
term       -> factor ( ( "-" | "+" ) factor )* ;
factor     -> unary ( ( "/" | "*" ) unary )* ;

unary      -> ( "!" | "-" ) unary
           | primary ;

primary    -> NUMBER | STRING | "true" | "false" | "nil"
           | "(" expression ")" ;

How this prevents the wrong parse of 6 / 3 - 1

  1. Start at expression -> equality -> comparison -> term
  2. term looks for a factor on the left - finds 6 / 3 via the factor rule
  3. factor matches 6, consumes /, then looks for a unary on the right
  4. The right-hand side of factor must be a unary - it cannot be a term (additive expression)
  5. So 3 - 1 cannot be consumed as the right side of /. Only 3 is consumed.
  6. Control returns to term, which matches - and then 1

This is the crux: the right-hand side of a factor expression cannot be an additive expression - it must be something at a lower (more powerful) precedence level.

Why everything “passes through”

Since the * repetition is optional (zero or more), each level can simply delegate to the next level down without matching any operators. This is how a simple number like 1 works:

expression -> equality -> comparison -> term -> factor -> unary -> primary -> 1

It passes through every level until it reaches primary, where it matches as a number literal. This is necessary - there must be a path from expression to a bare number.

How associativity is encoded

Left associativity comes from the * repetition - the leftmost match happens first:

term -> factor ( ( "-" | "+" ) factor )* ;

For 1 + 2 + 3: match 1, then + 2, then + 3 - the left side accumulates.

Right associativity comes from recursion to the same rule on the right:

unary -> ( "!" | "-" ) unary | primary ;

The unary on the right-hand side refers back to itself, making the rightmost operation evaluate first.


The recursive descent algorithm

The core idea is strikingly simple:

For each production in your grammar, create a function named after it. Inside that function, look at the right-hand side of the production and call the corresponding functions.

This works because the grammar structure mirrors the call stack. Each precedence level calls the next level down, and the recursion naturally builds the tree.


Implementation: parsing equality expressions

Given the grammar rule:

equality -> comparison ( ( "!=" | "==" ) comparison )* ;

The recursive descent implementation in Java:

Expression parseEqualityExpression() {
    Expression lhs = parseComparisonExpression();

    if (match(BANG_EQUAL, EQUAL_EQUAL)) {
        Token operator = consumeToken();
        Expression rhs = parseComparisonExpression();

        EqualityExpression expr = new EqualityExpression();
        expr.setOperator(operator);
        expr.setLeftHandSide(lhs);
        expr.setRightHandSide(rhs);
        return expr;
    }

    return lhs;
}

How it works

  1. Call parseComparisonExpression() for the left-hand side
  2. Check if the current token is != or == using match()
  3. If yes: consume the operator, parse the right-hand side, build and return an EqualityExpression
  4. If no: just return the left-hand side (pass-through case)

Key parser methods

These are analogous to the tokenizer’s character-level methods, but operate on token types:

Method Purpose
match(types...) Return true if the current token matches any of the given types
consumeToken() Remove and return the current token from the front of the token stream

This is where the token list is converted into a tree - match() inspects tokens, consumeToken() removes them from the stream, and the parsed structure is assembled into expression objects with children.


Handling multiple operators (the * repetition)

The simple if version above only matches one equality operator. To handle 1 == 2 == false, we need to match multiple. A naive fix - recursively calling parseEqualityExpression() on the right-hand side - produces right associativity, which is incorrect for equality.

The “loop left, recurse right” mnemonic

The fix for left-associative operators is to use a while loop instead of recursion. Each iteration takes the current left-hand side and slides it under a new operator node:

Expression parseAdditiveExpression() {
    Expression lhs = parseUnaryExpression();

    while (match(PLUS, MINUS)) {
        AdditiveExpression expr = new AdditiveExpression();
        expr.setOperator(consumeToken());
        expr.setLeftHandSide(lhs);
        Expression rhs = parseUnaryExpression();
        expr.setRightHandSide(rhs);
        lhs = expr;  // slide the result into the left-hand side
    }

    return lhs;
}

For right-associative operators (unary, exponentiation), use recursion - the function calls itself for the right-hand operand, which naturally causes the rightmost operator to bind first.

Loop left, recurse right - left-associative operators use a loop; right-associative operators use recursion.

Every binary expression parsing function follows this same pattern. See Actual Parsing for a full walkthrough with debugging examples.


Why recursive descent?

Most universities teach shift-reduce parsing with parser generators. Recursive descent is different:

  Shift-Reduce Recursive Descent
Complexity Very complex algorithm Simple, intuitive code
Written by Parser generators (yacc, bison) Written by hand
Debugging Opaque - hard to step through Easy to step through in a debugger
Teaching Standard academic approach Rarely taught (historically)

“Once you get your head around this and start stepping through it in the debugger, it’s incredible how easy it is and how well it works.”


Summary

Concept Key point
Expressions vs statements Expressions produce values; statements perform actions
Ambiguity Flat grammars allow multiple parse trees for the same tokens
Operator precedence Higher-precedence operators bind more tightly (appear lower in the tree)
Associativity Left-associative = leftmost binds first; right-associative = rightmost binds first
Precedence in grammar Split each precedence level into its own production; each references only the next level down
Recursive descent One function per grammar rule; each function calls the next level down
Pass-through Every level can delegate downward, letting bare values like 1 traverse the entire grammar

Further reading