Compilers

CatScript Compiler - A Tour

Based on Crafting Interpreters and work by Carson Gross

What this guide covers

This short, hands-on guide explains the core compiler stages used in the Cat-script compiler:

  • Scanning / Tokenization - turn raw characters into tokens.
  • Parsing - turn tokens into an abstract syntax tree (AST).
  • Semantic analysis - check the AST for meaning and well-formedness.
  • Intermediate representations & bytecode - compile the AST to a lower-level form.
  • Runtime / memory management - how values are stored and managed at runtime.
  • Code generation & optimization - produce target code or optimized bytecode.

Tokenization (scanning / lexing)

Input: 1 + 1 Output: a stream of tokens: ["1"] ["+"] ["1"]

Position Token Type
0 1 Number
1 + Operator
2 1 Number

Tokenizers collapse characters into meaningful units (numbers, identifiers, operators, punctuation). They often attach extra metadata (line/column) to each token for error messages.


Parsing

We use recursive-descent parsing (a top-down hand-written parser) driven by a grammar you will see later. The parser consumes tokens and builds a parse tree (AST) that represents program structure.

Tokens:

"1"  "+"  "1"

Parse tree (AST):

     (+)
    /   \
  "1"   "1"

The grammar defines which token sequences form valid expressions and statements, and the parser implements that grammar - typically as a collection of mutually recursive functions (e.g., parseExpression, parseTerm, parseFactor).


Semantic analysis

Parsing only checks form (is this syntactically valid?). Semantic analysis checks meaning:

  • Are operations applied to compatible types?
  • Are identifiers declared before use (symbol table checks)?
  • Do function calls match parameter counts/types?

Example question: if the right side of a + b is a list, does + mean concatenation or numeric addition? Semantic analysis answers that and emits errors or inserts conversions/overloads as appropriate.


Evaluation (interpreting)

One way to run code is to directly evaluate the AST. The evaluator walks the tree and produces values.

Example in pseudocode:

tokens = ["1", "+", "1"]
parse_tree = parse(tokens)
result = evaluate(parse_tree)
print(result) // prints 2

The evaluator must handle numeric operations, variable lookups, control flow, function calls, etc., using a runtime environment (stack frames, global environment, etc.).


Bytecode generation

Compiling to bytecode turns expressions into a small set of stack-machine instructions. Example for 1 + 1:

BIPUSH 1
BIPUSH 1
IADD

What these do:

  • BIPUSH 1 - push the constant 1 onto the stack.
  • BIPUSH 1 - push another 1.
  • IADD - pop two values, add them, push the result.

This stack-machine sequence is analogous to simple assembly using registers:

mov w0, 1
mov w1, 1
add w0, w0, w1
  • Move 1 into register w0.
  • Move 1 into register w1.
  • Add w0 and w1, store the result in w0.

Stack bytecode is compact and easy to generate; register code is closer to real CPU instructions and can be more efficient but requires register allocation.


Further reading

  • Read Crafting Interpreters for an excellent, in-depth walkthrough.
  • Look at Carson Gross’s materials for implementation patterns and examples.