Parse Trees and Context-Free Grammars

From tokens to tree structures

Why parse trees

Consider the tokens 1 + 2 * 3 - 4. What do they mean?

Parse trees are powerful because they lend themselves to recursion - specifically the algorithm used in the CatScript parser: recursive descent parsing.

Evaluating a parse tree

Evaluation works naturally on tree structures:

Node	Operation	Result
`2`	Evaluate literal	2
`3`	Evaluate literal	3
`(*)`	Evaluate left (2) and right (3), multiply	6
`1`	Evaluate literal	1
`(+)`	Evaluate left (1) and right (6), add	7
`4`	Evaluate literal	4
`(-)`	Evaluate left (7) and right (4), subtract	3

This recursive evaluation is an intuitive design that naturally respects operator precedence.

Context-free grammars

So how do we transform an array of tokens into a tree?

With a Context-Free Grammar (CFG).

A context-free grammar is a formal grammar that specifies the structure of a language:

Alphabet - The set of all symbols used in the grammar
Productions - Rules that define how symbols expand
Start symbol - The top-level rule where parsing begins

Terminals vs non-terminals

The key distinction in any grammar:

	Terminals	Non-terminals
What they are	Actual tokens/characters	Abstract categories
Can expand?	No - they are “terminal”	Yes - they have production rules
In parse tree	Leaf nodes	Internal nodes
Examples	`"if"`, `"+"`, `123`, `"bacon"`	`expression`, `statement`, `breakfast`
Notation	Quoted strings or UPPERCASE	lowercase names

Why the name “terminal”? These symbols are the endpoints where derivation terminates - there’s nothing more to expand. Non-terminals keep expanding until only terminals remain.

In the breakfast grammar:

Terminals: "really", "crispy", "bacon", "toast", "with", "on the side"
Non-terminals: breakfast, protein, crispiness, bread, cooked

CFGs can describe infinite languages through recursion:

really long -> really really long -> really really really long -> ...

We define rules using Extended Backus-Naur Form (EBNF) called productions.

Key insight:

Producing a valid string = starting from grammar rules, generating text
Parsing = going from a string back to a grammar (the reverse)

EBNF notation

Usage	Notation	Alternative	Meaning
definition	`=`		Defines a rule
concatenation	`,`		Sequence of elements
termination	`;`	`.`	End of rule
alternation	`\\|`	`/` or `!`	One or the other
optional	`[ ... ]`	`(/ ... /)`	Zero or one
repetition	`{ ... }`	`(: ... :)`	Zero or more
grouping	`( ... )`		Group elements
terminal string	`" ... "`	`' ... '`	Literal text
comment	`(* ... *)`		Documentation
special sequence	`? ... ?`		Implementation-defined
exception	`-`		Exclude from set

See Extended Backus-Naur Form on Wikipedia for more details.

The breakfast grammar

Here’s a simple grammar for describing breakfast orders:

breakfast  -> protein "with" breakfast "on the side" ;
breakfast  -> protein ;
breakfast  -> bread ;

protein    -> crispiness "crispy" "bacon" ;
protein    -> "sausage" ;
protein    -> cooked "eggs" ;

crispiness -> "really" ;
crispiness -> "really" crispiness ;

cooked     -> "scrambled" ;
cooked     -> "poached" ;
cooked     -> "fried" ;

bread      -> "toast" ;
bread      -> "biscuits" ;
bread      -> "English muffin" ;

This grammar demonstrates:

Recursion - crispiness can contain another crispiness, and breakfast can contain another breakfast
Alternation - breakfast can be a protein, bread, or a combination with “on the side”
Terminals - Quoted strings like "toast" and "bacon" are literal tokens

Parsing example

Let’s parse a string that exercises recursion and alternatives:

really really crispy bacon with toast on the side

Step 1: Identify the top-level rule

The full sentence matches:

breakfast -> protein "with" breakfast "on the side"

So the root of the parse tree is breakfast.

Step 2: Expand the left `protein`

protein -> crispiness "crispy" "bacon"

Now expand crispiness (note the recursion):

crispiness -> "really" crispiness
crispiness -> "really"

This produces really really. The subtree:

protein
 ├─ crispiness
 │   ├─ "really"
 │   └─ crispiness
 │       └─ "really"
 ├─ "crispy"
 └─ "bacon"

Step 3: Expand the nested `breakfast`

The inner breakfast is simply:

breakfast -> bread
bread -> "toast"

Step 4: Full parse tree

breakfast
├─ protein
│  ├─ crispiness
│  │  ├─ "really"
│  │  └─ crispiness
│  │     └─ "really"
│  ├─ "crispy"
│  └─ "bacon"
├─ "with"
├─ breakfast
│  └─ bread
│     └─ "toast"
└─ "on the side"

CatScript implementation

Here’s how a CatScript for-loop maps to its grammar rule:

Source code:

for (s in strings) {
    print(s)
    print(s)
}

Grammar rule:

for_statement = 'for', '(', IDENTIFIER, 'in', expression ')',
                '{', { statement }, '}';

Code	Grammar element
`for`	`'for'` terminal
`(`	`'('` terminal
`s`	`IDENTIFIER`
`in`	`'in'` terminal
`strings`	`expression`
`)`	`')'` terminal
`{`	`'{'` terminal
`print(s)`	`statement` (first)
`print(s)`	`statement` (second, via `{ statement }` repetition)
`}`	`'}'` terminal

The full CatScript grammar can be found in CatScript Overview.