Recursive Descent Parsing
Turning a list of tokens into a parse tree
Expressions vs statements
A fundamental distinction in most programming languages:
| Expression | Statement | |
|---|---|---|
| Definition | Produces a value | Performs an action |
| Examples | 5 + 3, x > 10, a * 5 + 3 |
var x = 10, print("hello"), if (x) { ... } |
| Returns | A value (number, boolean, etc.) | Nothing (in most languages) |
Some languages treat assignment as an expression. Most don’t.
The Lox expression grammar
A simple but capable grammar for expressions:
expression -> literal | unary | binary | grouping ;
literal -> NUMBER | STRING | "true" | "false" | "nil" ;
grouping -> "(" expression ")" ;
unary -> ( "-" | "!" ) expression ;
binary -> expression operator expression ;
operator -> "==" | "!=" | "<" | "<=" | ">" | ">="
| "+" | "-" | "*" | "/" ;
This supports numbers, strings, booleans, null, unary negation/not, binary arithmetic and comparison, and parenthesized grouping - a surprisingly capable expression language from a small grammar.
The ambiguity problem
This grammar is ambiguous: a single token sequence can produce multiple valid parse trees with different meanings.
Example: 6 / 3 - 1
Parse tree 1 - division wraps subtraction (incorrect):
(/)
/ \
6 (-)
/ \
3 1
Result: 6 / (3 - 1) = 6 / 2 = 3
Parse tree 2 - subtraction wraps division (correct):
(-)
/ \
(/) 1
/ \
6 3
Result: (6 / 3) - 1 = 2 - 1 = 1
Both are valid derivations from the grammar, but they produce different results. We need a way to enforce the correct interpretation.
Operator precedence
Operator precedence (or binding precedence) determines which operator is evaluated first in a sequence of tokens.
Higher-precedence operators bind more tightly to their operands - they “grab” their arguments before lower-precedence operators do.
Standard precedence levels (lowest to highest)
| Precedence | Name | Operators | Associativity |
|---|---|---|---|
| 1 (lowest) | Equality | == != |
Left |
| 2 | Comparison | > >= < <= |
Left |
| 3 | Term (additive) | + - |
Left |
| 4 | Factor (multiplicative) | * / |
Left |
| 5 (highest) | Unary | - ! |
Right |
Counterintuitive fact: The operator with the lowest precedence ends up at the top of the parse tree. Higher-precedence operators sink lower into the tree because they bind first and become children.
Associativity
When the same operator appears multiple times, associativity determines grouping.
Left associativity
Most binary operators are left-associative - the leftmost operation binds first:
4 - 3 - 2 -> (4 - 3) - 2 = -1 (ok)
4 - (3 - 2) = 3 (no)
This matters for non-commutative operations like subtraction and division.
Right associativity
Unary operators and the power operator (^) are right-associative - the rightmost operation binds first:
!!true -> !(!true) -> !false -> true
Infix, prefix, and postfix
The ambiguity problem arises specifically from infix operators - operators placed between operands (1 + 2).
| Notation | Operator position | Example | Ambiguity? |
|---|---|---|---|
| Infix | Between operands | 1 + 2 |
Yes |
| Prefix (Lisp) | Before operands | (+ 1 2) |
No |
| Postfix (RPN) | After operands | 1 2 + |
No |
Prefix and postfix notations eliminate ambiguity because the order of operations is explicit. Most programming languages use infix because it’s natural for humans, even though it requires precedence rules.
Encoding precedence in the grammar
The fix: split each precedence level into its own production rule. Each level can only reference the next level down on its right-hand side, preventing lower-precedence operators from being consumed where higher-precedence ones are expected.
expression -> equality ;
equality -> comparison ( ( "!=" | "==" ) comparison )* ;
comparison -> term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
term -> factor ( ( "-" | "+" ) factor )* ;
factor -> unary ( ( "/" | "*" ) unary )* ;
unary -> ( "!" | "-" ) unary
| primary ;
primary -> NUMBER | STRING | "true" | "false" | "nil"
| "(" expression ")" ;
How this prevents the wrong parse of 6 / 3 - 1
- Start at
expression->equality->comparison->term termlooks for afactoron the left - finds6 / 3via thefactorrulefactormatches6, consumes/, then looks for aunaryon the right- The right-hand side of
factormust be aunary- it cannot be aterm(additive expression) - So
3 - 1cannot be consumed as the right side of/. Only3is consumed. - Control returns to
term, which matches-and then1
This is the crux: the right-hand side of a factor expression cannot be an additive expression - it must be something at a lower (more powerful) precedence level.
Why everything “passes through”
Since the * repetition is optional (zero or more), each level can simply delegate to the next level down without matching any operators. This is how a simple number like 1 works:
expression -> equality -> comparison -> term -> factor -> unary -> primary -> 1
It passes through every level until it reaches primary, where it matches as a number literal. This is necessary - there must be a path from expression to a bare number.
How associativity is encoded
Left associativity comes from the * repetition - the leftmost match happens first:
term -> factor ( ( "-" | "+" ) factor )* ;
For 1 + 2 + 3: match 1, then + 2, then + 3 - the left side accumulates.
Right associativity comes from recursion to the same rule on the right:
unary -> ( "!" | "-" ) unary | primary ;
The unary on the right-hand side refers back to itself, making the rightmost operation evaluate first.
The recursive descent algorithm
The core idea is strikingly simple:
For each production in your grammar, create a function named after it. Inside that function, look at the right-hand side of the production and call the corresponding functions.
This works because the grammar structure mirrors the call stack. Each precedence level calls the next level down, and the recursion naturally builds the tree.
Implementation: parsing equality expressions
Given the grammar rule:
equality -> comparison ( ( "!=" | "==" ) comparison )* ;
The recursive descent implementation in Java:
Expression parseEqualityExpression() {
Expression lhs = parseComparisonExpression();
if (match(BANG_EQUAL, EQUAL_EQUAL)) {
Token operator = consumeToken();
Expression rhs = parseComparisonExpression();
EqualityExpression expr = new EqualityExpression();
expr.setOperator(operator);
expr.setLeftHandSide(lhs);
expr.setRightHandSide(rhs);
return expr;
}
return lhs;
}
How it works
- Call
parseComparisonExpression()for the left-hand side - Check if the current token is
!=or==usingmatch() - If yes: consume the operator, parse the right-hand side, build and return an
EqualityExpression - If no: just return the left-hand side (pass-through case)
Key parser methods
These are analogous to the tokenizer’s character-level methods, but operate on token types:
| Method | Purpose |
|---|---|
match(types...) |
Return true if the current token matches any of the given types |
consumeToken() |
Remove and return the current token from the front of the token stream |
This is where the token list is converted into a tree - match() inspects tokens, consumeToken() removes them from the stream, and the parsed structure is assembled into expression objects with children.
Handling multiple operators (the * repetition)
The simple if version above only matches one equality operator. To handle 1 == 2 == false, we need to match multiple. A naive fix - recursively calling parseEqualityExpression() on the right-hand side - produces right associativity, which is incorrect for equality.
The “loop left, recurse right” mnemonic
The fix for left-associative operators is to use a while loop instead of recursion. Each iteration takes the current left-hand side and slides it under a new operator node:
Expression parseAdditiveExpression() {
Expression lhs = parseUnaryExpression();
while (match(PLUS, MINUS)) {
AdditiveExpression expr = new AdditiveExpression();
expr.setOperator(consumeToken());
expr.setLeftHandSide(lhs);
Expression rhs = parseUnaryExpression();
expr.setRightHandSide(rhs);
lhs = expr; // slide the result into the left-hand side
}
return lhs;
}
For right-associative operators (unary, exponentiation), use recursion - the function calls itself for the right-hand operand, which naturally causes the rightmost operator to bind first.
Loop left, recurse right - left-associative operators use a loop; right-associative operators use recursion.
Every binary expression parsing function follows this same pattern. See Actual Parsing for a full walkthrough with debugging examples.
Why recursive descent?
Most universities teach shift-reduce parsing with parser generators. Recursive descent is different:
| Shift-Reduce | Recursive Descent | |
|---|---|---|
| Complexity | Very complex algorithm | Simple, intuitive code |
| Written by | Parser generators (yacc, bison) | Written by hand |
| Debugging | Opaque - hard to step through | Easy to step through in a debugger |
| Teaching | Standard academic approach | Rarely taught (historically) |
“Once you get your head around this and start stepping through it in the debugger, it’s incredible how easy it is and how well it works.”
Summary
| Concept | Key point |
|---|---|
| Expressions vs statements | Expressions produce values; statements perform actions |
| Ambiguity | Flat grammars allow multiple parse trees for the same tokens |
| Operator precedence | Higher-precedence operators bind more tightly (appear lower in the tree) |
| Associativity | Left-associative = leftmost binds first; right-associative = rightmost binds first |
| Precedence in grammar | Split each precedence level into its own production; each references only the next level down |
| Recursive descent | One function per grammar rule; each function calls the next level down |
| Pass-through | Every level can delegate downward, letting bare values like 1 traverse the entire grammar |