Compilers

Regular Expressions

A primer on regex for compiler construction

Why learn regular expressions

We don’t actually use regular expressions for the CatScript compiler, but many compilers do use them, so it’s useful to know.

  • They’re terse (maybe not a good thing)
  • From Automata Theory (finite state machines)
  • Tools exist to generate lexical scanners from regex patterns

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

  • Jamie Zawinski

Limitations

Regular expressions are non-recursive (no stack/memory) and cannot recognize certain languages:

  • Balanced parentheses - no way to count an arbitrary number of opening parens and match them with closing parens
  • Nested structures - HTML, JSON, programming languages with nested blocks

A push-down automaton (parser with a stack) can recognize these languages, which is why we need a separate parsing phase.


Tools

Useful tools for learning and testing regular expressions:


Literal characters

Most characters match themselves literally.

Pattern Matches Example input Result
cat The literal string “cat” "the cat sat" cat
123 The literal string “123” "order 123" 123
hello The literal string “hello” "say hello" hello

Special characters

These characters have special meaning and must be escaped with \ to match literally.

Character Meaning To match literally
. Any single character \.
* Zero or more \*
+ One or more \+
? Zero or one \?
^ Start of string \^
$ End of string \$
\| Alternation (or) \\|
() Grouping \(\)
[] Character class \[\]

Examples:

Pattern Matches Example
a.b “a”, any char, “b” acb, a1b, a b
a\.b Literal “a.b” a.b
\$100 Literal “$100” $100

Character classes

Match any single character from a set.

Pattern Matches
[abc] Any of: a, b, or c
[a-z] Any lowercase letter
[A-Z] Any uppercase letter
[0-9] Any digit
[a-zA-Z] Any letter
[^abc] Any character except a, b, or c
[a-zA-Z_] Letters or underscore

Shorthand classes:

Pattern Equivalent Matches
\d [0-9] Any digit
\D [^0-9] Any non-digit
\w [a-zA-Z0-9_] Word character
\W [^a-zA-Z0-9_] Non-word character
\s [ \t\n\r] Whitespace
\S [^ \t\n\r] Non-whitespace

Examples:

Pattern Matches
[aeiou] Any vowel: a, e, i, o, u
[0-9][0-9] Two-digit number: 42, 99
\d\d\d-\d\d\d\d Phone format: 555-1234

Anchors

Match positions, not characters.

Pattern Matches
^ Start of string/line
$ End of string/line
\b Word boundary
\B Non-word boundary

Examples:

Pattern Input Matches?
^hello "hello world" Yes - starts with “hello”
^hello "say hello" No - doesn’t start with “hello”
world$ "hello world" Yes - ends with “world”
\bcat\b "the cat sat" Yes - “cat” as whole word
\bcat\b "category" No - “cat” is part of word

Optional

The ? quantifier matches zero or one occurrence.

Pattern Matches
colou?r color or colour
https? http or https
-?\d+ Optional negative sign: 42 or -42

Examples:

Pattern Matches Doesn’t match
Nov(ember)? Nov, November Novem
\d{3}-?\d{4} 555-1234, 5551234 555--1234

Repetition

Quantifiers specify how many times a pattern repeats.

Pattern Meaning
* Zero or more
+ One or more
{n} Exactly n times
{n,} n or more times
{n,m} Between n and m times

Greedy vs Non-Greedy:

By default, quantifiers are greedy (match as much as possible). Add ? for non-greedy (match as little as possible).

Pattern Behavior Input "<b>bold</b>"
<.*> Greedy <b>bold</b> (entire string)
<.*?> Non-greedy <b> (first tag only)

Examples:

Pattern Matches
a+ One or more a’s: a, aa, aaa
\d{3} Exactly 3 digits: 123, 999
\d{2,4} 2 to 4 digits: 12, 123, 1234
[a-z]+ One or more lowercase letters

Grouping

Parentheses group patterns together and capture matches.

Pattern Purpose
(abc) Group and capture “abc”
(?:abc) Group without capturing
(a\|b) Match “a” or “b”

Examples:

Pattern Matches
(ha)+ ha, haha, hahaha
(cat\|dog) cat or dog
(\d{3})-(\d{4}) Phone number with captured area code and number
(?:Mr\|Mrs\|Ms)\.?\s Title prefix without capturing

Backreferences

Reference previously captured groups.

Pattern Meaning
\1 Match same text as first capture group
\2 Match same text as second capture group

Examples:

Pattern Matches Explanation
(\w+)\s+\1 the the, is is Repeated words
(['"])(.*?)\1 "hello", 'world' Quoted strings (matching quotes)
<(\w+)>.*?</\1> <b>text</b> Matching HTML tags

Use case: Finding duplicate words in text:

\b(\w+)\s+\1\b

This matches “the the”, “is is”, etc. - useful for catching typos.