Parse Trees and Context-Free Grammars
From tokens to tree structures
Why parse trees
Consider the tokens 1 + 2 * 3 - 4. What do they mean?
(-)
/ \
(+) 4
/ \
1 (*)
/ \
2 3
Parse trees are powerful because they lend themselves to recursion - specifically the algorithm used in the CatScript parser: recursive descent parsing.
Evaluating a parse tree
Evaluation works naturally on tree structures:
| Node | Operation | Result |
|---|---|---|
2 |
Evaluate literal | 2 |
3 |
Evaluate literal | 3 |
(*) |
Evaluate left (2) and right (3), multiply | 6 |
1 |
Evaluate literal | 1 |
(+) |
Evaluate left (1) and right (6), add | 7 |
4 |
Evaluate literal | 4 |
(-) |
Evaluate left (7) and right (4), subtract | 3 |
This recursive evaluation is an intuitive design that naturally respects operator precedence.
Context-free grammars
So how do we transform an array of tokens into a tree?
With a Context-Free Grammar (CFG).
A context-free grammar is a formal grammar that specifies the structure of a language:
- Alphabet - The set of all symbols used in the grammar
- Productions - Rules that define how symbols expand
- Start symbol - The top-level rule where parsing begins
Terminals vs non-terminals
The key distinction in any grammar:
| Terminals | Non-terminals | |
|---|---|---|
| What they are | Actual tokens/characters | Abstract categories |
| Can expand? | No - they are “terminal” | Yes - they have production rules |
| In parse tree | Leaf nodes | Internal nodes |
| Examples | "if", "+", 123, "bacon" |
expression, statement, breakfast |
| Notation | Quoted strings or UPPERCASE | lowercase names |
Why the name “terminal”? These symbols are the endpoints where derivation terminates - there’s nothing more to expand. Non-terminals keep expanding until only terminals remain.
In the breakfast grammar:
- Terminals:
"really","crispy","bacon","toast","with","on the side" - Non-terminals:
breakfast,protein,crispiness,bread,cooked
CFGs can describe infinite languages through recursion:
really long -> really really long -> really really really long -> ...
We define rules using Extended Backus-Naur Form (EBNF) called productions.
Key insight:
- Producing a valid string = starting from grammar rules, generating text
- Parsing = going from a string back to a grammar (the reverse)
EBNF notation
| Usage | Notation | Alternative | Meaning |
|---|---|---|---|
| definition | = |
Defines a rule | |
| concatenation | , |
Sequence of elements | |
| termination | ; |
. |
End of rule |
| alternation | \| |
/ or ! |
One or the other |
| optional | [ ... ] |
(/ ... /) |
Zero or one |
| repetition | { ... } |
(: ... :) |
Zero or more |
| grouping | ( ... ) |
Group elements | |
| terminal string | " ... " |
' ... ' |
Literal text |
| comment | (* ... *) |
Documentation | |
| special sequence | ? ... ? |
Implementation-defined | |
| exception | - |
Exclude from set |
See Extended Backus-Naur Form on Wikipedia for more details.
The breakfast grammar
Here’s a simple grammar for describing breakfast orders:
breakfast -> protein "with" breakfast "on the side" ;
breakfast -> protein ;
breakfast -> bread ;
protein -> crispiness "crispy" "bacon" ;
protein -> "sausage" ;
protein -> cooked "eggs" ;
crispiness -> "really" ;
crispiness -> "really" crispiness ;
cooked -> "scrambled" ;
cooked -> "poached" ;
cooked -> "fried" ;
bread -> "toast" ;
bread -> "biscuits" ;
bread -> "English muffin" ;
This grammar demonstrates:
- Recursion -
crispinesscan contain anothercrispiness, andbreakfastcan contain anotherbreakfast - Alternation -
breakfastcan be aprotein,bread, or a combination with “on the side” - Terminals - Quoted strings like
"toast"and"bacon"are literal tokens
Parsing example
Let’s parse a string that exercises recursion and alternatives:
really really crispy bacon with toast on the side
Step 1: Identify the top-level rule
The full sentence matches:
breakfast -> protein "with" breakfast "on the side"
So the root of the parse tree is breakfast.
Step 2: Expand the left protein
protein -> crispiness "crispy" "bacon"
Now expand crispiness (note the recursion):
crispiness -> "really" crispiness
crispiness -> "really"
This produces really really. The subtree:
protein
├─ crispiness
│ ├─ "really"
│ └─ crispiness
│ └─ "really"
├─ "crispy"
└─ "bacon"
Step 3: Expand the nested breakfast
The inner breakfast is simply:
breakfast -> bread
bread -> "toast"
Step 4: Full parse tree
breakfast
├─ protein
│ ├─ crispiness
│ │ ├─ "really"
│ │ └─ crispiness
│ │ └─ "really"
│ ├─ "crispy"
│ └─ "bacon"
├─ "with"
├─ breakfast
│ └─ bread
│ └─ "toast"
└─ "on the side"
CatScript implementation
Here’s how a CatScript for-loop maps to its grammar rule:
Source code:
for (s in strings) {
print(s)
print(s)
}
Grammar rule:
for_statement = 'for', '(', IDENTIFIER, 'in', expression ')',
'{', { statement }, '}';
| Code | Grammar element |
|---|---|
for |
'for' terminal |
( |
'(' terminal |
s |
IDENTIFIER |
in |
'in' terminal |
strings |
expression |
) |
')' terminal |
{ |
'{' terminal |
print(s) |
statement (first) |
print(s) |
statement (second, via { statement } repetition) |
} |
'}' terminal |
The full CatScript grammar can be found in CatScript Overview.