Parsing Expressions
Primary expressions, list literals, ambiguity, and index expressions
Primary expressions in full
Primary expressions are at the bottom of the grammar - highest precedence, lowest in the recursive descent call chain. They are the base cases that terminate the recursive descent.
Simple literals
Each literal type follows the same pattern - only the token type and AST node class differ:
if (match(INTEGER)) {
Token t = consumeToken();
IntegerLiteralExpression expr = new IntegerLiteralExpression();
expr.setToken(t);
expr.setValue(Integer.parseInt(t.getStringValue()));
return expr;
}
| Token type | AST node | Value extraction |
|---|---|---|
INTEGER |
IntegerLiteralExpression |
Integer.parseInt(...) |
STRING |
StringLiteralExpression |
Use string value directly |
TRUE |
BooleanLiteralExpression |
Set value to true |
FALSE |
BooleanLiteralExpression |
Set value to false |
NULL |
NullLiteralExpression |
No value needed |
For boolean literals, TRUE and FALSE both produce a BooleanLiteralExpression - one with value true, the other with false.
Single-token vs multi-token expressions
| Expression type | Start/End tokens | Method to use |
|---|---|---|
| Single-token (literals) | Same token for both | setToken(token) |
| Multi-token (binary, list, etc.) | Different start and end | setStart(token) + setEnd(token) |
Parser helper methods
Three key methods available in the parser, each with a distinct purpose:
match(TokenType... types)
Returns true if the current token matches any of the given types. Does not consume the token. Use when you need to keep the token (e.g., to set it as a start token).
matchAndConsume(TokenType type)
Returns true and consumes the token if it matches. Use for syntactic delimiters you don’t need to keep - commas, colons, equals signs.
require(TokenType type, ParseElement element)
Consumes and returns the token if it matches. If the token is not there, adds a parse error to the given element. Use for tokens that must be present according to the grammar (closing brackets, parentheses, etc.).
| Method | Consumes? | Error on miss? | Use case |
|---|---|---|---|
match |
No | No | Check before deciding what to parse |
matchAndConsume |
Yes | No | Discard syntactic delimiters (commas, colons) |
require |
Yes | Yes | Mandatory syntax (closing brackets, parens) |
List literal parsing
list_literal -> '[', expression, { ',', expression }, ']' ;
A list literal is an open bracket, one or more comma-separated expressions, and a close bracket.
The do-while pattern
List literals (and argument lists) naturally use a do-while loop - parse one expression first, then keep parsing while there are commas:
if (match(LEFT_BRACKET)) {
Token start = consumeToken();
ListLiteralExpression list = new ListLiteralExpression();
list.setStart(start);
List<Expression> values = new ArrayList<>();
// Handle empty list: [ ]
if (!match(RIGHT_BRACKET)) {
do {
values.add(parseExpression());
} while (matchAndConsume(COMMA));
}
list.setValues(values);
list.setEnd(require(RIGHT_BRACKET, list));
return list;
}
Key details:
matchAndConsume(COMMA)- commas are syntactic separators; we don’t need to keep themrequire(RIGHT_BRACKET, list)- the closing bracket is mandatory; if missing, a parse error is attached to the list node- The empty list
[]is handled by checking forRIGHT_BRACKETbefore entering the loop - Child expressions are collected into a list and set on the AST node via
setValues()
Grammatical ambiguity: identifiers vs function calls
Two primary expressions start with the same token type (IDENTIFIER):
primary_expression -> IDENTIFIER (* variable reference *)
| IDENTIFIER, '(', argument_list, ')' (* function call *)
Both x (variable reference) and x(1, 2) (function call) begin with an identifier token. This is grammatical ambiguity - we can’t tell which production to use from the first token alone.
Resolution: one token of lookahead
After consuming the identifier, check the next token:
if (match(IDENTIFIER)) {
Token name = consumeToken();
if (match(LEFT_PAREN)) {
// Function call - parse argument list
FunctionCallExpression call = new FunctionCallExpression();
// ... parse arguments (same do-while/comma pattern as list literals)
return call;
} else {
// Simple identifier expression
IdentifierExpression expr = new IdentifierExpression();
expr.setToken(name);
return expr;
}
}
If the next token is (, it’s a function call. Otherwise, it’s a variable reference. This makes CatScript an LL(1) grammar - a left-to-right parser with one token of lookahead.
Function argument lists reuse the list pattern
argument_list -> expression, { ',', expression } ;
The argument list grammar is nearly identical to list literals. The parsing code is the same do-while/comma pattern - just with parentheses as delimiters instead of brackets.
Index expressions
index_expression -> primary_expression, { '[', expression, ']' } ;
An index expression like x[10] indexes into a list. It sits between unary_expression and primary_expression in the precedence chain:
parseUnaryExpression -> parseIndexExpression -> parsePrimaryExpression
Implementation
Expression parseIndexExpression() {
Expression lhs = parsePrimaryExpression();
while (match(LEFT_BRACKET)) {
consumeToken();
IndexExpression expr = new IndexExpression();
expr.setLeftHandSide(lhs);
expr.setRightHandSide(parseExpression());
require(RIGHT_BRACKET, expr);
lhs = expr; // enables chaining: x[0][1]
}
return lhs;
}
The while loop enables chained indexing like x[0][1]. Each iteration wraps the previous result as the left-hand side.
The same token means different things
A [ in a primary expression position starts a list literal:
[1, 2, 3]
A [ after a primary expression has been parsed starts an index expression:
x[0]
Recursive descent resolves this naturally - the parser knows where it is in the grammar and interprets tokens accordingly. No special logic needed.
Why hand-written parsers dominate production compilers
Recursive descent parsers are preferred in production because:
- Excellent error messages - you always know exactly where you are in the grammar, so when something unexpected appears, you can report precisely what was expected
- Easy to debug - step through with a standard debugger, inspect the token stream and AST at any point
- No external tooling - no parser generator (yacc, bison, ANTLR) to configure and maintain
- Full control - handle ambiguity, error recovery, and special cases with plain code
GCC, Clang, Go, Rust, V8 (JavaScript), and TypeScript all use hand-written recursive descent parsers.
Summary
| Concept | Key point |
|---|---|
| Simple literals | Match token type -> consume -> construct node -> return |
match / matchAndConsume / require |
Check without consuming / consume and discard / consume or error |
| List literals | Do-while loop over comma-separated expressions, require closing bracket |
| Ambiguity | Identifier could be variable ref or function call - check next token |
| LL(1) | One token of lookahead resolves all ambiguities in CatScript |
| Index expressions | while loop after primary for [expr], enables chaining |
| Same token, different meaning | [ means list literal in primary position, index after an expression |
| Production parsers | Hand-written recursive descent preferred for error reporting and debuggability |