Compilers

Parsing Expressions

Primary expressions, list literals, ambiguity, and index expressions

Primary expressions in full

Primary expressions are at the bottom of the grammar - highest precedence, lowest in the recursive descent call chain. They are the base cases that terminate the recursive descent.

Simple literals

Each literal type follows the same pattern - only the token type and AST node class differ:

if (match(INTEGER)) {
    Token t = consumeToken();
    IntegerLiteralExpression expr = new IntegerLiteralExpression();
    expr.setToken(t);
    expr.setValue(Integer.parseInt(t.getStringValue()));
    return expr;
}
Token type AST node Value extraction
INTEGER IntegerLiteralExpression Integer.parseInt(...)
STRING StringLiteralExpression Use string value directly
TRUE BooleanLiteralExpression Set value to true
FALSE BooleanLiteralExpression Set value to false
NULL NullLiteralExpression No value needed

For boolean literals, TRUE and FALSE both produce a BooleanLiteralExpression - one with value true, the other with false.

Single-token vs multi-token expressions

Expression type Start/End tokens Method to use
Single-token (literals) Same token for both setToken(token)
Multi-token (binary, list, etc.) Different start and end setStart(token) + setEnd(token)

Parser helper methods

Three key methods available in the parser, each with a distinct purpose:

match(TokenType... types)

Returns true if the current token matches any of the given types. Does not consume the token. Use when you need to keep the token (e.g., to set it as a start token).

matchAndConsume(TokenType type)

Returns true and consumes the token if it matches. Use for syntactic delimiters you don’t need to keep - commas, colons, equals signs.

require(TokenType type, ParseElement element)

Consumes and returns the token if it matches. If the token is not there, adds a parse error to the given element. Use for tokens that must be present according to the grammar (closing brackets, parentheses, etc.).

Method Consumes? Error on miss? Use case
match No No Check before deciding what to parse
matchAndConsume Yes No Discard syntactic delimiters (commas, colons)
require Yes Yes Mandatory syntax (closing brackets, parens)

List literal parsing

list_literal -> '[', expression, { ',', expression }, ']' ;

A list literal is an open bracket, one or more comma-separated expressions, and a close bracket.

The do-while pattern

List literals (and argument lists) naturally use a do-while loop - parse one expression first, then keep parsing while there are commas:

if (match(LEFT_BRACKET)) {
    Token start = consumeToken();
    ListLiteralExpression list = new ListLiteralExpression();
    list.setStart(start);
    List<Expression> values = new ArrayList<>();

    // Handle empty list: [  ]
    if (!match(RIGHT_BRACKET)) {
        do {
            values.add(parseExpression());
        } while (matchAndConsume(COMMA));
    }

    list.setValues(values);
    list.setEnd(require(RIGHT_BRACKET, list));
    return list;
}

Key details:

  • matchAndConsume(COMMA) - commas are syntactic separators; we don’t need to keep them
  • require(RIGHT_BRACKET, list) - the closing bracket is mandatory; if missing, a parse error is attached to the list node
  • The empty list [] is handled by checking for RIGHT_BRACKET before entering the loop
  • Child expressions are collected into a list and set on the AST node via setValues()

Grammatical ambiguity: identifiers vs function calls

Two primary expressions start with the same token type (IDENTIFIER):

primary_expression -> IDENTIFIER                                (* variable reference *)
                   | IDENTIFIER, '(', argument_list, ')'       (* function call *)

Both x (variable reference) and x(1, 2) (function call) begin with an identifier token. This is grammatical ambiguity - we can’t tell which production to use from the first token alone.

Resolution: one token of lookahead

After consuming the identifier, check the next token:

if (match(IDENTIFIER)) {
    Token name = consumeToken();

    if (match(LEFT_PAREN)) {
        // Function call  -  parse argument list
        FunctionCallExpression call = new FunctionCallExpression();
        // ... parse arguments (same do-while/comma pattern as list literals)
        return call;
    } else {
        // Simple identifier expression
        IdentifierExpression expr = new IdentifierExpression();
        expr.setToken(name);
        return expr;
    }
}

If the next token is (, it’s a function call. Otherwise, it’s a variable reference. This makes CatScript an LL(1) grammar - a left-to-right parser with one token of lookahead.

Function argument lists reuse the list pattern

argument_list -> expression, { ',', expression } ;

The argument list grammar is nearly identical to list literals. The parsing code is the same do-while/comma pattern - just with parentheses as delimiters instead of brackets.


Index expressions

index_expression -> primary_expression, { '[', expression, ']' } ;

An index expression like x[10] indexes into a list. It sits between unary_expression and primary_expression in the precedence chain:

parseUnaryExpression -> parseIndexExpression -> parsePrimaryExpression

Implementation

Expression parseIndexExpression() {
    Expression lhs = parsePrimaryExpression();

    while (match(LEFT_BRACKET)) {
        consumeToken();
        IndexExpression expr = new IndexExpression();
        expr.setLeftHandSide(lhs);
        expr.setRightHandSide(parseExpression());
        require(RIGHT_BRACKET, expr);
        lhs = expr;  // enables chaining: x[0][1]
    }

    return lhs;
}

The while loop enables chained indexing like x[0][1]. Each iteration wraps the previous result as the left-hand side.

The same token means different things

A [ in a primary expression position starts a list literal:

[1, 2, 3]

A [ after a primary expression has been parsed starts an index expression:

x[0]

Recursive descent resolves this naturally - the parser knows where it is in the grammar and interprets tokens accordingly. No special logic needed.


Why hand-written parsers dominate production compilers

Recursive descent parsers are preferred in production because:

  1. Excellent error messages - you always know exactly where you are in the grammar, so when something unexpected appears, you can report precisely what was expected
  2. Easy to debug - step through with a standard debugger, inspect the token stream and AST at any point
  3. No external tooling - no parser generator (yacc, bison, ANTLR) to configure and maintain
  4. Full control - handle ambiguity, error recovery, and special cases with plain code

GCC, Clang, Go, Rust, V8 (JavaScript), and TypeScript all use hand-written recursive descent parsers.


Summary

Concept Key point
Simple literals Match token type -> consume -> construct node -> return
match / matchAndConsume / require Check without consuming / consume and discard / consume or error
List literals Do-while loop over comma-separated expressions, require closing bracket
Ambiguity Identifier could be variable ref or function call - check next token
LL(1) One token of lookahead resolves all ambiguities in CatScript
Index expressions while loop after primary for [expr], enables chaining
Same token, different meaning [ means list literal in primary position, index after an expression
Production parsers Hand-written recursive descent preferred for error reporting and debuggability

Further reading