Compilers

Parsing Functions & Type Literals

Function calls, return statements, function declarations, and type expressions

What’s left to parse

The remaining CatScript grammar productions are:

  • Function call statements (and resolving the identifier ambiguity)
  • Return statements
  • Function declarations
  • Type expressions (type literals)

The identifier ambiguity

Both assignment statements and function call statements start with an identifier:

Tokens Statement type
x = 10 Assignment statement
x(1, 2) Function call statement
x[3] = 1 Index assignment statement

Resolution: create a single method parseAssignmentOrFunctionCallStatement() that consumes the identifier, then checks the next token:

  • = -> assignment statement
  • ( -> function call statement
  • [ -> index assignment statement

Function call statements

A function call statement wraps a function call expression and discards any return value. The function exists for its side effects.

The challenge: the identifier has already been consumed at the statement level, but parseFunctionCallExpression() also expects to match an identifier.

Solution - refactor parseFunctionCallExpression() to accept the identifier as a parameter:

Expression parseFunctionCallExpression(Token identifier) {
    // normal parsing logic, but skip consuming the identifier
}

This lets the same function call parsing logic work in both expression and statement positions.


Left factoring

An alternative approach is to left factor the grammar so only one production starts with an identifier:

identifier_statement -> IDENTIFIER, ( assignment_tail | function_call_tail | index_assignment_tail ) ;

This removes the ambiguity at the grammar level. In practice, the lookahead approach is simpler for recursive descent parsers - just check the next token.


Function call arguments

Arguments follow the same comma-separated list pattern seen in list literals:

consume(LEFT_PAREN);
if (!match(RIGHT_PAREN)) {
    do {
        args.add(parseExpression());
    } while (matchAndConsume(COMMA));
}
require(RIGHT_PAREN, expr);

Return statements

return_statement -> 'return', [ expression ] ;

The expression is optional because void functions can have a bare return.

The trick: look ahead at the next token without consuming it. If the next token is a RIGHT_BRACE or EOF, don’t parse an expression - there isn’t one. Otherwise, parse the expression.

if (match(RETURN)) {
    ReturnStatement stmt = new ReturnStatement();
    stmt.setStart(consumeToken());
    if (!match(RIGHT_BRACE) && !match(EOF)) {
        stmt.setExpression(parseExpression());
    }
    return stmt;
}

This is a practical hack - no theory behind it, just syntactic knowledge of CatScript.


Function declarations

Function declarations are the largest grammar production:

function_definition -> 'function', IDENTIFIER, '(', parameter_list, ')',
                       [ ':', type_expression ], '{', { statement }, '}' ;

parameter_list -> [ parameter, { ',', parameter } ] ;

parameter -> IDENTIFIER, [ ':', type_expression ] ;

Parsing steps:

  1. Match and consume function keyword (strong start token)
  2. Require an IDENTIFIER (function name)
  3. Require (
  4. Parse parameter list (zero or more parameters separated by commas)
  5. Require )
  6. Optionally match : followed by a type expression (return type)
  7. Require {
  8. Parse statements in a loop while not }
  9. Require }

Parameters without an explicit type annotation default to object.


Function declarations are top-level only

program -> { program_statement } ;
program_statement -> function_definition | statement ;
statement -> for_statement | if_statement | ... ;
Nesting Allowed?
for inside for Yes
for inside function Yes
function inside for No

Function definitions live under program_statement, not statement. This means functions can only be defined at the top level - they cannot be nested inside loops or conditionals. Statements nest arbitrarily, but function declarations cannot appear inside statement bodies.


Type expressions (type literals)

type_expression -> 'int' | 'string' | 'bool' | 'object'
                 | 'list', [ '<', type_expression, '>' ] ;

The base types (int, string, bool, object) are identifiers, not keywords.

When parseTypeLiteral() is called, a type is required - it always follows a colon in the grammar, so there must be an identifier present.

TypeLiteral parseTypeLiteral() {
    TypeLiteral typeLiteral = new TypeLiteral();
    Token rootType = require(IDENTIFIER, typeLiteral);

    if (rootType.getStringValue().equals("int")) {
        typeLiteral.setType(CatScriptType.INT);
    } else if (rootType.getStringValue().equals("list")) {
        if (matchAndConsume(LESS)) {
            TypeLiteral componentType = parseTypeLiteral();  // recursive!
            require(GREATER, typeLiteral);
            typeLiteral.setType(componentType.getType().getListType());
        }
    }
    // ... string, bool, object cases
    return typeLiteral;
}

The recursion handles nested list types: list<list<int>> calls parseTypeLiteral() recursively, composing list types from the inside out.


Summary

Concept Key point
Identifier ambiguity Assignment, function call, and index assignment all start with IDENTIFIER - resolve with one-token lookahead
Function call refactoring Extract parseFunctionCallExpression(Token identifier) to share logic between expression and statement positions
Left factoring Alternative grammar rewrite that removes ambiguity, but harder to read
Return statements Check for RIGHT_BRACE or EOF to decide if there’s an expression
Function declarations Largest production, but same patterns - top-level only
Type literals Recursive for list types; base types are identifiers, not keywords