Regular Expressions
A primer on regex for compiler construction
Why learn regular expressions
We don’t actually use regular expressions for the CatScript compiler, but many compilers do use them, so it’s useful to know.
- They’re terse (maybe not a good thing)
- From Automata Theory (finite state machines)
- Tools exist to generate lexical scanners from regex patterns
“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”
- Jamie Zawinski
Limitations
Regular expressions are non-recursive (no stack/memory) and cannot recognize certain languages:
- Balanced parentheses - no way to count an arbitrary number of opening parens and match them with closing parens
- Nested structures - HTML, JSON, programming languages with nested blocks
A push-down automaton (parser with a stack) can recognize these languages, which is why we need a separate parsing phase.
Tools
Useful tools for learning and testing regular expressions:
- regex101.com - Interactive regex tester with explanation
- ihateregex.io - Common regex patterns with visual explanations
Literal characters
Most characters match themselves literally.
| Pattern | Matches | Example input | Result |
|---|---|---|---|
cat |
The literal string “cat” | "the cat sat" |
cat |
123 |
The literal string “123” | "order 123" |
123 |
hello |
The literal string “hello” | "say hello" |
hello |
Special characters
These characters have special meaning and must be escaped with \ to match literally.
| Character | Meaning | To match literally |
|---|---|---|
. |
Any single character | \. |
* |
Zero or more | \* |
+ |
One or more | \+ |
? |
Zero or one | \? |
^ |
Start of string | \^ |
$ |
End of string | \$ |
\| |
Alternation (or) | \\| |
() |
Grouping | \(\) |
[] |
Character class | \[\] |
Examples:
| Pattern | Matches | Example |
|---|---|---|
a.b |
“a”, any char, “b” | acb, a1b, a b |
a\.b |
Literal “a.b” | a.b |
\$100 |
Literal “$100” | $100 |
Character classes
Match any single character from a set.
| Pattern | Matches |
|---|---|
[abc] |
Any of: a, b, or c |
[a-z] |
Any lowercase letter |
[A-Z] |
Any uppercase letter |
[0-9] |
Any digit |
[a-zA-Z] |
Any letter |
[^abc] |
Any character except a, b, or c |
[a-zA-Z_] |
Letters or underscore |
Shorthand classes:
| Pattern | Equivalent | Matches |
|---|---|---|
\d |
[0-9] |
Any digit |
\D |
[^0-9] |
Any non-digit |
\w |
[a-zA-Z0-9_] |
Word character |
\W |
[^a-zA-Z0-9_] |
Non-word character |
\s |
[ \t\n\r] |
Whitespace |
\S |
[^ \t\n\r] |
Non-whitespace |
Examples:
| Pattern | Matches |
|---|---|
[aeiou] |
Any vowel: a, e, i, o, u |
[0-9][0-9] |
Two-digit number: 42, 99 |
\d\d\d-\d\d\d\d |
Phone format: 555-1234 |
Anchors
Match positions, not characters.
| Pattern | Matches |
|---|---|
^ |
Start of string/line |
$ |
End of string/line |
\b |
Word boundary |
\B |
Non-word boundary |
Examples:
| Pattern | Input | Matches? |
|---|---|---|
^hello |
"hello world" |
Yes - starts with “hello” |
^hello |
"say hello" |
No - doesn’t start with “hello” |
world$ |
"hello world" |
Yes - ends with “world” |
\bcat\b |
"the cat sat" |
Yes - “cat” as whole word |
\bcat\b |
"category" |
No - “cat” is part of word |
Optional
The ? quantifier matches zero or one occurrence.
| Pattern | Matches |
|---|---|
colou?r |
color or colour |
https? |
http or https |
-?\d+ |
Optional negative sign: 42 or -42 |
Examples:
| Pattern | Matches | Doesn’t match |
|---|---|---|
Nov(ember)? |
Nov, November |
Novem |
\d{3}-?\d{4} |
555-1234, 5551234 |
555--1234 |
Repetition
Quantifiers specify how many times a pattern repeats.
| Pattern | Meaning |
|---|---|
* |
Zero or more |
+ |
One or more |
{n} |
Exactly n times |
{n,} |
n or more times |
{n,m} |
Between n and m times |
Greedy vs Non-Greedy:
By default, quantifiers are greedy (match as much as possible). Add ? for non-greedy (match as little as possible).
| Pattern | Behavior | Input "<b>bold</b>" |
|---|---|---|
<.*> |
Greedy | <b>bold</b> (entire string) |
<.*?> |
Non-greedy | <b> (first tag only) |
Examples:
| Pattern | Matches |
|---|---|
a+ |
One or more a’s: a, aa, aaa |
\d{3} |
Exactly 3 digits: 123, 999 |
\d{2,4} |
2 to 4 digits: 12, 123, 1234 |
[a-z]+ |
One or more lowercase letters |
Grouping
Parentheses group patterns together and capture matches.
| Pattern | Purpose |
|---|---|
(abc) |
Group and capture “abc” |
(?:abc) |
Group without capturing |
(a\|b) |
Match “a” or “b” |
Examples:
| Pattern | Matches |
|---|---|
(ha)+ |
ha, haha, hahaha |
(cat\|dog) |
cat or dog |
(\d{3})-(\d{4}) |
Phone number with captured area code and number |
(?:Mr\|Mrs\|Ms)\.?\s |
Title prefix without capturing |
Backreferences
Reference previously captured groups.
| Pattern | Meaning |
|---|---|
\1 |
Match same text as first capture group |
\2 |
Match same text as second capture group |
Examples:
| Pattern | Matches | Explanation |
|---|---|---|
(\w+)\s+\1 |
the the, is is |
Repeated words |
(['"])(.*?)\1 |
"hello", 'world' |
Quoted strings (matching quotes) |
<(\w+)>.*?</\1> |
<b>text</b> |
Matching HTML tags |
Use case: Finding duplicate words in text:
\b(\w+)\s+\1\b
This matches “the the”, “is is”, etc. - useful for catching typos.