CatScript Compiler - A Tour
Based on Crafting Interpreters and work by Carson Gross
What this guide covers
This short, hands-on guide explains the core compiler stages used in the Cat-script compiler:
- Scanning / Tokenization - turn raw characters into tokens.
- Parsing - turn tokens into an abstract syntax tree (AST).
- Semantic analysis - check the AST for meaning and well-formedness.
- Intermediate representations & bytecode - compile the AST to a lower-level form.
- Runtime / memory management - how values are stored and managed at runtime.
- Code generation & optimization - produce target code or optimized bytecode.
Tokenization (scanning / lexing)
Input: 1 + 1
Output: a stream of tokens: ["1"] ["+"] ["1"]
| Position | Token | Type |
|---|---|---|
| 0 | 1 |
Number |
| 1 | + |
Operator |
| 2 | 1 |
Number |
Tokenizers collapse characters into meaningful units (numbers, identifiers, operators, punctuation). They often attach extra metadata (line/column) to each token for error messages.
Parsing
We use recursive-descent parsing (a top-down hand-written parser) driven by a grammar you will see later. The parser consumes tokens and builds a parse tree (AST) that represents program structure.
Tokens:
"1" "+" "1"
Parse tree (AST):
(+)
/ \
"1" "1"
The grammar defines which token sequences form valid expressions and statements, and the parser implements that grammar - typically as a collection of mutually recursive functions (e.g., parseExpression, parseTerm, parseFactor).
Semantic analysis
Parsing only checks form (is this syntactically valid?). Semantic analysis checks meaning:
- Are operations applied to compatible types?
- Are identifiers declared before use (symbol table checks)?
- Do function calls match parameter counts/types?
Example question: if the right side of a + b is a list, does + mean concatenation or numeric addition? Semantic analysis answers that and emits errors or inserts conversions/overloads as appropriate.
Evaluation (interpreting)
One way to run code is to directly evaluate the AST. The evaluator walks the tree and produces values.
Example in pseudocode:
tokens = ["1", "+", "1"]
parse_tree = parse(tokens)
result = evaluate(parse_tree)
print(result) // prints 2
The evaluator must handle numeric operations, variable lookups, control flow, function calls, etc., using a runtime environment (stack frames, global environment, etc.).
Bytecode generation
Compiling to bytecode turns expressions into a small set of stack-machine instructions. Example for 1 + 1:
BIPUSH 1
BIPUSH 1
IADD
What these do:
BIPUSH 1- push the constant1onto the stack.BIPUSH 1- push another1.IADD- pop two values, add them, push the result.
This stack-machine sequence is analogous to simple assembly using registers:
mov w0, 1
mov w1, 1
add w0, w0, w1
- Move
1into registerw0. - Move
1into registerw1. - Add
w0andw1, store the result inw0.
Stack bytecode is compact and easy to generate; register code is closer to real CPU instructions and can be more efficient but requires register allocation.
Further reading
- Read Crafting Interpreters for an excellent, in-depth walkthrough.
- Look at Carson Gross’s materials for implementation patterns and examples.