Regular Expressions

A primer on regex for compiler construction

Why learn regular expressions

We don’t actually use regular expressions for the CatScript compiler, but many compilers do use them, so it’s useful to know.

They’re terse (maybe not a good thing)
From Automata Theory (finite state machines)
Tools exist to generate lexical scanners from regex patterns

“Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.”

Jamie Zawinski

Limitations

Regular expressions are non-recursive (no stack/memory) and cannot recognize certain languages:

Balanced parentheses - no way to count an arbitrary number of opening parens and match them with closing parens
Nested structures - HTML, JSON, programming languages with nested blocks

A push-down automaton (parser with a stack) can recognize these languages, which is why we need a separate parsing phase.

Tools

Useful tools for learning and testing regular expressions:

regex101.com - Interactive regex tester with explanation
ihateregex.io - Common regex patterns with visual explanations

Literal characters

Most characters match themselves literally.

Pattern	Matches	Example input	Result
`cat`	The literal string “cat”	`"the cat sat"`	`cat`
`123`	The literal string “123”	`"order 123"`	`123`
`hello`	The literal string “hello”	`"say hello"`	`hello`

Special characters

These characters have special meaning and must be escaped with \ to match literally.

Character	Meaning	To match literally
`.`	Any single character	`\.`
`*`	Zero or more	`\*`
`+`	One or more	`\+`
`?`	Zero or one	`\?`
`^`	Start of string	`\^`
`$`	End of string	`\$`
`\\|`	Alternation (or)	`\\\|`
`()`	Grouping	``
`[]`	Character class	`\[\]`

Examples:

Pattern	Matches	Example
`a.b`	“a”, any char, “b”	`acb`, `a1b`, `a b`
`a\.b`	Literal “a.b”	`a.b`
`\$100`	Literal “$100”	`$100`

Character classes

Match any single character from a set.

Pattern	Matches
`[abc]`	Any of: a, b, or c
`[a-z]`	Any lowercase letter
`[A-Z]`	Any uppercase letter
`[0-9]`	Any digit
`[a-zA-Z]`	Any letter
`[^abc]`	Any character except a, b, or c
`[a-zA-Z_]`	Letters or underscore

Shorthand classes:

Pattern	Equivalent	Matches
`\d`	`[0-9]`	Any digit
`\D`	`[^0-9]`	Any non-digit
`\w`	`[a-zA-Z0-9_]`	Word character
`\W`	`[^a-zA-Z0-9_]`	Non-word character
`\s`	`[ \t\n\r]`	Whitespace
`\S`	`[^ \t\n\r]`	Non-whitespace

Examples:

Pattern	Matches
`[aeiou]`	Any vowel: `a`, `e`, `i`, `o`, `u`
`[0-9][0-9]`	Two-digit number: `42`, `99`
`\d\d\d-\d\d\d\d`	Phone format: `555-1234`

Anchors

Match positions, not characters.

Pattern	Matches
`^`	Start of string/line
`$`	End of string/line
`\b`	Word boundary
`\B`	Non-word boundary

Examples:

Pattern	Input	Matches?
`^hello`	`"hello world"`	Yes - starts with “hello”
`^hello`	`"say hello"`	No - doesn’t start with “hello”
`world$`	`"hello world"`	Yes - ends with “world”
`\bcat\b`	`"the cat sat"`	Yes - “cat” as whole word
`\bcat\b`	`"category"`	No - “cat” is part of word

Optional

The ? quantifier matches zero or one occurrence.

Pattern	Matches
`colou?r`	`color` or `colour`
`https?`	`http` or `https`
`-?\d+`	Optional negative sign: `42` or `-42`

Examples:

Pattern	Matches	Doesn’t match
`Nov(ember)?`	`Nov`, `November`	`Novem`
`\d{3}-?\d{4}`	`555-1234`, `5551234`	`555--1234`

Repetition

Quantifiers specify how many times a pattern repeats.

Pattern	Meaning
`*`	Zero or more
`+`	One or more
`{n}`	Exactly n times
`{n,}`	n or more times
`{n,m}`	Between n and m times

Greedy vs Non-Greedy:

By default, quantifiers are greedy (match as much as possible). Add ? for non-greedy (match as little as possible).

Pattern	Behavior	Input `"<b>bold</b>"`
`<.*>`	Greedy	`<b>bold</b>` (entire string)
`<.*?>`	Non-greedy	`<b>` (first tag only)

Examples:

Pattern	Matches
`a+`	One or more a’s: `a`, `aa`, `aaa`
`\d{3}`	Exactly 3 digits: `123`, `999`
`\d{2,4}`	2 to 4 digits: `12`, `123`, `1234`
`[a-z]+`	One or more lowercase letters

Grouping

Parentheses group patterns together and capture matches.

Pattern	Purpose
`(abc)`	Group and capture “abc”
`(?:abc)`	Group without capturing
`(a\\|b)`	Match “a” or “b”

Examples:

Pattern	Matches
`(ha)+`	`ha`, `haha`, `hahaha`
`(cat\\|dog)`	`cat` or `dog`
`(\d{3})-(\d{4})`	Phone number with captured area code and number
`(?:Mr\\|Mrs\\|Ms)\.?\s`	Title prefix without capturing

Backreferences

Reference previously captured groups.

Pattern	Meaning
`\1`	Match same text as first capture group
`\2`	Match same text as second capture group

Examples:

Pattern	Matches	Explanation
`(\w+)\s+\1`	`the the`, `is is`	Repeated words
`(['"])(.*?)\1`	`"hello"`, `'world'`	Quoted strings (matching quotes)
`<(\w+)>.*?</\1>`	`<b>text</b>`	Matching HTML tags

Use case: Finding duplicate words in text:

\b(\w+)\s+\1\b

This matches “the the”, “is is”, etc. - useful for catching typos.