Parsing Expressions

Primary expressions, list literals, ambiguity, and index expressions

Primary expressions in full

Primary expressions are at the bottom of the grammar - highest precedence, lowest in the recursive descent call chain. They are the base cases that terminate the recursive descent.

Simple literals

Each literal type follows the same pattern - only the token type and AST node class differ:

if (match(INTEGER)) {
    Token t = consumeToken();
    IntegerLiteralExpression expr = new IntegerLiteralExpression();
    expr.setToken(t);
    expr.setValue(Integer.parseInt(t.getStringValue()));
    return expr;
}

Token type	AST node	Value extraction
`INTEGER`	`IntegerLiteralExpression`	`Integer.parseInt(...)`
`STRING`	`StringLiteralExpression`	Use string value directly
`TRUE`	`BooleanLiteralExpression`	Set value to `true`
`FALSE`	`BooleanLiteralExpression`	Set value to `false`
`NULL`	`NullLiteralExpression`	No value needed

For boolean literals, TRUE and FALSE both produce a BooleanLiteralExpression - one with value true, the other with false.

Single-token vs multi-token expressions

Expression type	Start/End tokens	Method to use
Single-token (literals)	Same token for both	`setToken(token)`
Multi-token (binary, list, etc.)	Different start and end	`setStart(token)` + `setEnd(token)`

Parser helper methods

Three key methods available in the parser, each with a distinct purpose:

`match(TokenType... types)`

Returns true if the current token matches any of the given types. Does not consume the token. Use when you need to keep the token (e.g., to set it as a start token).

`matchAndConsume(TokenType type)`

Returns true and consumes the token if it matches. Use for syntactic delimiters you don’t need to keep - commas, colons, equals signs.

`require(TokenType type, ParseElement element)`

Consumes and returns the token if it matches. If the token is not there, adds a parse error to the given element. Use for tokens that must be present according to the grammar (closing brackets, parentheses, etc.).

Method	Consumes?	Error on miss?	Use case
`match`	No	No	Check before deciding what to parse
`matchAndConsume`	Yes	No	Discard syntactic delimiters (commas, colons)
`require`	Yes	Yes	Mandatory syntax (closing brackets, parens)

List literal parsing

list_literal -> '[', expression, { ',', expression }, ']' ;

A list literal is an open bracket, one or more comma-separated expressions, and a close bracket.

The do-while pattern

List literals (and argument lists) naturally use a do-while loop - parse one expression first, then keep parsing while there are commas:

if (match(LEFT_BRACKET)) {
    Token start = consumeToken();
    ListLiteralExpression list = new ListLiteralExpression();
    list.setStart(start);
    List<Expression> values = new ArrayList<>();

    // Handle empty list: [  ]
    if (!match(RIGHT_BRACKET)) {
        do {
            values.add(parseExpression());
        } while (matchAndConsume(COMMA));
    }

    list.setValues(values);
    list.setEnd(require(RIGHT_BRACKET, list));
    return list;
}

Key details:

matchAndConsume(COMMA) - commas are syntactic separators; we don’t need to keep them
require(RIGHT_BRACKET, list) - the closing bracket is mandatory; if missing, a parse error is attached to the list node
The empty list [] is handled by checking for RIGHT_BRACKET before entering the loop
Child expressions are collected into a list and set on the AST node via setValues()

Grammatical ambiguity: identifiers vs function calls

Two primary expressions start with the same token type (IDENTIFIER):

primary_expression -> IDENTIFIER                                (* variable reference *)
                   | IDENTIFIER, '(', argument_list, ')'       (* function call *)

Both x (variable reference) and x(1, 2) (function call) begin with an identifier token. This is grammatical ambiguity - we can’t tell which production to use from the first token alone.

Resolution: one token of lookahead

After consuming the identifier, check the next token:

if (match(IDENTIFIER)) {
    Token name = consumeToken();

    if (match(LEFT_PAREN)) {
        // Function call  -  parse argument list
        FunctionCallExpression call = new FunctionCallExpression();
        // ... parse arguments (same do-while/comma pattern as list literals)
        return call;
    } else {
        // Simple identifier expression
        IdentifierExpression expr = new IdentifierExpression();
        expr.setToken(name);
        return expr;
    }
}

If the next token is (, it’s a function call. Otherwise, it’s a variable reference. This makes CatScript an LL(1) grammar - a left-to-right parser with one token of lookahead.

Function argument lists reuse the list pattern

argument_list -> expression, { ',', expression } ;

The argument list grammar is nearly identical to list literals. The parsing code is the same do-while/comma pattern - just with parentheses as delimiters instead of brackets.

Index expressions

index_expression -> primary_expression, { '[', expression, ']' } ;

An index expression like x[10] indexes into a list. It sits between unary_expression and primary_expression in the precedence chain:

parseUnaryExpression -> parseIndexExpression -> parsePrimaryExpression

Implementation

Expression parseIndexExpression() {
    Expression lhs = parsePrimaryExpression();

    while (match(LEFT_BRACKET)) {
        consumeToken();
        IndexExpression expr = new IndexExpression();
        expr.setLeftHandSide(lhs);
        expr.setRightHandSide(parseExpression());
        require(RIGHT_BRACKET, expr);
        lhs = expr;  // enables chaining: x[0][1]
    }

    return lhs;
}

The while loop enables chained indexing like x[0][1]. Each iteration wraps the previous result as the left-hand side.

The same token means different things

A [ in a primary expression position starts a list literal:

[1, 2, 3]

A [ after a primary expression has been parsed starts an index expression:

x[0]

Recursive descent resolves this naturally - the parser knows where it is in the grammar and interprets tokens accordingly. No special logic needed.

Why hand-written parsers dominate production compilers

Recursive descent parsers are preferred in production because:

Excellent error messages - you always know exactly where you are in the grammar, so when something unexpected appears, you can report precisely what was expected
Easy to debug - step through with a standard debugger, inspect the token stream and AST at any point
No external tooling - no parser generator (yacc, bison, ANTLR) to configure and maintain
Full control - handle ambiguity, error recovery, and special cases with plain code

GCC, Clang, Go, Rust, V8 (JavaScript), and TypeScript all use hand-written recursive descent parsers.

Summary

Concept	Key point
Simple literals	Match token type -> consume -> construct node -> return
`match` / `matchAndConsume` / `require`	Check without consuming / consume and discard / consume or error
List literals	Do-while loop over comma-separated expressions, `require` closing bracket
Ambiguity	Identifier could be variable ref or function call - check next token
LL(1)	One token of lookahead resolves all ambiguities in CatScript
Index expressions	`while` loop after primary for `[expr]`, enables chaining
Same token, different meaning	`[` means list literal in primary position, index after an expression
Production parsers	Hand-written recursive descent preferred for error reporting and debuggability