CatScript Compiler - A Tour

Based on Crafting Interpreters and work by Carson Gross

What this guide covers

This short, hands-on guide explains the core compiler stages used in the Cat-script compiler:

Scanning / Tokenization - turn raw characters into tokens.
Parsing - turn tokens into an abstract syntax tree (AST).
Semantic analysis - check the AST for meaning and well-formedness.
Intermediate representations & bytecode - compile the AST to a lower-level form.
Runtime / memory management - how values are stored and managed at runtime.
Code generation & optimization - produce target code or optimized bytecode.

Tokenization (scanning / lexing)

Input: 1 + 1 Output: a stream of tokens: ["1"] ["+"] ["1"]

Position	Token	Type
0	`1`	Number
1	`+`	Operator
2	`1`	Number

Tokenizers collapse characters into meaningful units (numbers, identifiers, operators, punctuation). They often attach extra metadata (line/column) to each token for error messages.

Parsing

We use recursive-descent parsing (a top-down hand-written parser) driven by a grammar you will see later. The parser consumes tokens and builds a parse tree (AST) that represents program structure.

Tokens:

"1"  "+"  "1"

Parse tree (AST):

     (+)
    /   \
  "1"   "1"

The grammar defines which token sequences form valid expressions and statements, and the parser implements that grammar - typically as a collection of mutually recursive functions (e.g., parseExpression, parseTerm, parseFactor).

Semantic analysis

Parsing only checks form (is this syntactically valid?). Semantic analysis checks meaning:

Are operations applied to compatible types?
Are identifiers declared before use (symbol table checks)?
Do function calls match parameter counts/types?

Example question: if the right side of a + b is a list, does + mean concatenation or numeric addition? Semantic analysis answers that and emits errors or inserts conversions/overloads as appropriate.

Evaluation (interpreting)

One way to run code is to directly evaluate the AST. The evaluator walks the tree and produces values.

Example in pseudocode:

tokens = ["1", "+", "1"]
parse_tree = parse(tokens)
result = evaluate(parse_tree)
print(result) // prints 2

The evaluator must handle numeric operations, variable lookups, control flow, function calls, etc., using a runtime environment (stack frames, global environment, etc.).

Bytecode generation

Compiling to bytecode turns expressions into a small set of stack-machine instructions. Example for 1 + 1:

BIPUSH 1
BIPUSH 1
IADD

What these do:

BIPUSH 1 - push the constant 1 onto the stack.
BIPUSH 1 - push another 1.
IADD - pop two values, add them, push the result.

This stack-machine sequence is analogous to simple assembly using registers:

mov w0, 1
mov w1, 1
add w0, w0, w1

Move 1 into register w0.
Move 1 into register w1.
Add w0 and w1, store the result in w0.

Stack bytecode is compact and easy to generate; register code is closer to real CPU instructions and can be more efficient but requires register allocation.