Compilers

Parsing Statements

Print, var, for, while, if, and assignment statements

Expressions vs statements

  Expressions Statements
Produce A value Side effects
Action Evaluate Execute
Examples 1 + 1, foo(x), x > 10 print(x), var x = 10, for (...)
Children Other expressions Expressions and other statements

Statements form the core structure of imperative programming languages. They control flow (if, for, while), declare variables (var), and trigger side effects (print). Their arguments are typically expressions.


The CatScript statement grammar

statement -> for_statement
          | while_statement
          | if_statement
          | print_statement
          | variable_statement
          | assignment_statement
          | function_call_statement ;

Statement parsing strategy

Most statements start with a keyword - a strong, unambiguous start symbol:

Statement Start token
for for keyword
while while keyword
if if keyword
print print keyword
var var keyword
assignment IDENTIFIER (no keyword)
function call IDENTIFIER (no keyword)

The null-return convention

Each statement type gets its own parseXxxStatement() function. If the current token doesn’t match, the function returns null:

Statement parseStatement() {
    Statement printStmt = parsePrintStatement();
    if (printStmt != null) return printStmt;

    Statement varStmt = parseVarStatement();
    if (varStmt != null) return varStmt;

    Statement forStmt = parseForStatement();
    if (forStmt != null) return forStmt;

    // ... and so on for each statement type
}

This gives a clean, regular structure - easy to extend by duplicating and modifying a block.


print_statement -> 'print', '(', expression, ')' ;
Statement parsePrintStatement() {
    if (match(PRINT)) {
        PrintStatement stmt = new PrintStatement();
        stmt.setStart(consumeToken());
        require(LEFT_PAREN, stmt);
        stmt.setExpression(parseExpression());
        stmt.setEnd(require(RIGHT_PAREN, stmt));
        return stmt;
    }
    return null;
}

The expression between the parentheses can be anything - an additive expression, a function call, a concatenation. We just call parseExpression() and whatever it is gets consumed and becomes a child of the print statement.


Variable statement

variable_statement -> 'var', IDENTIFIER, [ ':', type_expression ], '=', expression ;
Statement parseVarStatement() {
    if (match(VAR)) {
        VariableStatement stmt = new VariableStatement();
        stmt.setStart(consumeToken());

        Token identifier = require(IDENTIFIER, stmt);
        stmt.setVariableName(identifier.getStringValue());

        if (matchAndConsume(COLON)) {
            TypeLiteral type = parseTypeExpression();
            stmt.setExplicitType(type.getType());
        }

        require(EQUALS, stmt);
        Expression expr = parseExpression();
        stmt.setExpression(expr);
        stmt.setEnd(expr.getEnd());
        return stmt;
    }
    return null;
}

Key details:

  • The identifier is required - require(IDENTIFIER, stmt) adds a parse error if it’s missing
  • The type annotation (: int) is optional - use matchAndConsume(COLON) to check
  • The colon and equals are syntactic delimiters - consumed but not stored as AST fields
  • The end token is whatever the expression ends with (we don’t know or care what it is)

For statement

for_statement -> 'for', '(', IDENTIFIER, 'in', expression, ')',
                '{', { statement }, '}' ;

Parsing a for loop:

  1. Match and consume the for keyword (set as start token)
  2. Require (
  3. Require an IDENTIFIER (the loop variable)
  4. Require the in keyword
  5. Parse an expression (the iterable)
  6. Require )
  7. Require {
  8. Parse statements in a loop while the next token is not }
  9. Require } (set as end token)

Statement bodies: the while-not-close-brace pattern

require(LEFT_BRACE, stmt);
List<Statement> body = new ArrayList<>();
while (!match(RIGHT_BRACE)) {
    body.add(parseStatement());
}
stmt.setStatements(body);
stmt.setEnd(require(RIGHT_BRACE, stmt));

This pattern - while the next token is not a closing brace, keep parsing statements - recurs in every statement that has a body (for, while, if, else).


While statement

while_statement -> 'while', '(', expression, ')', '{', { statement }, '}' ;

Nearly identical structure to for. Match the while keyword, require parenthesized expression, require braced body.


If statement

if_statement -> 'if', '(', expression, ')', '{', { statement }, '}',
               [ 'else', ( if_statement | '{', { statement }, '}' ) ] ;

After parsing the condition and body:

  1. Check for else
  2. If the token after else is if -> recursively call parseIfStatement() (handles else if chains)
  3. Otherwise -> parse another braced statement body (the terminal else block)

The dangling else problem

In C/Java, if statements don’t require braces:

if (val1)
    if (val2) foo();
    else bar();  // which if does this else belong to?

Without braces, it’s ambiguous whether else binds to the inner or outer if. C resolves this by convention (bind to the closest if), but it’s a source of bugs.

CatScript avoids this entirely by requiring braces for all if/else bodies. The grammar makes it unambiguous.


Assignment statement

assignment_statement -> IDENTIFIER, '=', expression ;

Unlike other statements, assignments don’t start with a keyword - they start with an identifier:

if (match(IDENTIFIER)) {
    Token name = consumeToken();
    require(EQUALS, stmt);
    Expression expr = parseExpression();
    // ...
    return assignmentStmt;
}

Assignment vs function call: another ambiguity

Two statement types start with IDENTIFIER:

Tokens Statement type
x = 10 Assignment
foo(1, 2) Function call statement

Resolution: after consuming the identifier, check the next token.

  • Next token is = -> assignment statement
  • Next token is ( -> function call statement (wraps a function call expression)

This is the same kind of one-token lookahead we saw with identifier expressions vs function call expressions at the expression level.

Function call statements

A function call statement wraps a function call expression and discards the return value. In Java, this is like calling doSomething() without assigning the result - the expression produces a value, but it’s used in a statement position for its side effects.


Statements containing statements

Statements can be nested:

for (x in list) {
    if (x > 10) {
        print(x)
    }
}

The parse tree:

ForStatement
├── identifier: "x"
├── expression: IdentifierExpression("list")
└── body:
    └── IfStatement
        ├── condition: ComparisonExpression(x > 10)
        └── body:
            └── PrintStatement
                └── expression: IdentifierExpression("x")

This nesting happens naturally - parseStatement() calls itself recursively through statement body parsing.


Summary

Concept Key point
Expressions vs statements Expressions produce values; statements have side effects and control flow
Keyword-started statements for, while, if, print, var all start with a keyword - easy to distinguish
Null-return convention Each parseXxxStatement() returns null if no match, enabling a chain of attempts
Statement bodies while (!match(RIGHT_BRACE)) { parseStatement(); }
Dangling else CatScript requires braces, avoiding the ambiguity entirely
Assignment ambiguity IDENTIFIER could start assignment or function call - check next token
Nested statements Statements can contain other statements; parse tree reflects the nesting