Compilers

Parse Trees and Context-Free Grammars

From tokens to tree structures

Why parse trees

Consider the tokens 1 + 2 * 3 - 4. What do they mean?

        (-)
       /   \
     (+)    4
    /   \
   1    (*)
        / \
       2   3

Parse trees are powerful because they lend themselves to recursion - specifically the algorithm used in the CatScript parser: recursive descent parsing.


Evaluating a parse tree

Evaluation works naturally on tree structures:

Node Operation Result
2 Evaluate literal 2
3 Evaluate literal 3
(*) Evaluate left (2) and right (3), multiply 6
1 Evaluate literal 1
(+) Evaluate left (1) and right (6), add 7
4 Evaluate literal 4
(-) Evaluate left (7) and right (4), subtract 3

This recursive evaluation is an intuitive design that naturally respects operator precedence.


Context-free grammars

So how do we transform an array of tokens into a tree?

With a Context-Free Grammar (CFG).

A context-free grammar is a formal grammar that specifies the structure of a language:

  • Alphabet - The set of all symbols used in the grammar
  • Productions - Rules that define how symbols expand
  • Start symbol - The top-level rule where parsing begins

Terminals vs non-terminals

The key distinction in any grammar:

  Terminals Non-terminals
What they are Actual tokens/characters Abstract categories
Can expand? No - they are “terminal” Yes - they have production rules
In parse tree Leaf nodes Internal nodes
Examples "if", "+", 123, "bacon" expression, statement, breakfast
Notation Quoted strings or UPPERCASE lowercase names

Why the name “terminal”? These symbols are the endpoints where derivation terminates - there’s nothing more to expand. Non-terminals keep expanding until only terminals remain.

In the breakfast grammar:

  • Terminals: "really", "crispy", "bacon", "toast", "with", "on the side"
  • Non-terminals: breakfast, protein, crispiness, bread, cooked

CFGs can describe infinite languages through recursion:

really long -> really really long -> really really really long -> ...

We define rules using Extended Backus-Naur Form (EBNF) called productions.

Key insight:

  • Producing a valid string = starting from grammar rules, generating text
  • Parsing = going from a string back to a grammar (the reverse)

EBNF notation

Usage Notation Alternative Meaning
definition =   Defines a rule
concatenation ,   Sequence of elements
termination ; . End of rule
alternation \| / or ! One or the other
optional [ ... ] (/ ... /) Zero or one
repetition { ... } (: ... :) Zero or more
grouping ( ... )   Group elements
terminal string " ... " ' ... ' Literal text
comment (* ... *)   Documentation
special sequence ? ... ?   Implementation-defined
exception -   Exclude from set

See Extended Backus-Naur Form on Wikipedia for more details.


The breakfast grammar

Here’s a simple grammar for describing breakfast orders:

breakfast  -> protein "with" breakfast "on the side" ;
breakfast  -> protein ;
breakfast  -> bread ;

protein    -> crispiness "crispy" "bacon" ;
protein    -> "sausage" ;
protein    -> cooked "eggs" ;

crispiness -> "really" ;
crispiness -> "really" crispiness ;

cooked     -> "scrambled" ;
cooked     -> "poached" ;
cooked     -> "fried" ;

bread      -> "toast" ;
bread      -> "biscuits" ;
bread      -> "English muffin" ;

This grammar demonstrates:

  • Recursion - crispiness can contain another crispiness, and breakfast can contain another breakfast
  • Alternation - breakfast can be a protein, bread, or a combination with “on the side”
  • Terminals - Quoted strings like "toast" and "bacon" are literal tokens

Parsing example

Let’s parse a string that exercises recursion and alternatives:

really really crispy bacon with toast on the side

Step 1: Identify the top-level rule

The full sentence matches:

breakfast -> protein "with" breakfast "on the side"

So the root of the parse tree is breakfast.

Step 2: Expand the left protein

protein -> crispiness "crispy" "bacon"

Now expand crispiness (note the recursion):

crispiness -> "really" crispiness
crispiness -> "really"

This produces really really. The subtree:

protein
 ├─ crispiness
 │   ├─ "really"
 │   └─ crispiness
 │       └─ "really"
 ├─ "crispy"
 └─ "bacon"

Step 3: Expand the nested breakfast

The inner breakfast is simply:

breakfast -> bread
bread -> "toast"

Step 4: Full parse tree

breakfast
├─ protein
│  ├─ crispiness
│  │  ├─ "really"
│  │  └─ crispiness
│  │     └─ "really"
│  ├─ "crispy"
│  └─ "bacon"
├─ "with"
├─ breakfast
│  └─ bread
│     └─ "toast"
└─ "on the side"

CatScript implementation

Here’s how a CatScript for-loop maps to its grammar rule:

Source code:

for (s in strings) {
    print(s)
    print(s)
}

Grammar rule:

for_statement = 'for', '(', IDENTIFIER, 'in', expression ')',
                '{', { statement }, '}';
Code Grammar element
for 'for' terminal
( '(' terminal
s IDENTIFIER
in 'in' terminal
strings expression
) ')' terminal
{ '{' terminal
print(s) statement (first)
print(s) statement (second, via { statement } repetition)
} '}' terminal

The full CatScript grammar can be found in CatScript Overview.