Tokenization
Turning source code into tokens
Core concepts
Lexeme - The smallest sequence of characters that represents something in the language. This is the raw character data.
Token - A lexeme combined with additional metadata:
- Type - What kind of token (identifier, operator, keyword, etc.)
- Literal values - String literals like
"foo" - Symbolic values - Variables and identifiers
- Location information - Line number, line offset, absolute offset, absolute end
- Enables good error messages when something goes wrong
- IDEs use this to lint code with underline squiggles for syntax errors
The tokenization loop
At its core, tokenization is a loop:
while (more tokens)
examine the next character to determine lexeme type
scan until lexeme ends
add the token + its location information
Regular expressions
Regular expressions are often used for tokenization as they can be passed to a generator to produce a lexical scanner.
[a-zA-Z_][a-zA-Z_0-9]*
This pattern matches variable names (identifiers). Regular languages correspond to finite automata, and the lexing phase of compiler construction involves regular languages.
Hand-written tokenizers
In CatScript, the tokenizer and parser are built from scratch without using regular expression generators.
Benefits:
- Easy to read
- Easy to debug
The main scanning loop:
List<Token> scanTokens() {
while (!isAtEnd()) {
// At beginning of the next lexeme
start = current;
scanToken();
}
tokens.add(new Token(EOF, "", null, line));
return tokens;
}
While not at the end of the string, scan the next token. Then add an EOF (sentinel) token and return.
The scanToken method examines each character:
private void scanToken() {
char c = advance();
switch (c) {
case ')': addToken(LEFT_PAREN); break;
// other cases omitted for brevity
}
}
If a token is not recognized, we have a lexical error - one of three error types in programming languages (lexical, grammatical, semantic):
default:
Lox.error(line, "Unexpected character");
break;
Lookahead
Lookahead is the technique of examining upcoming characters without consuming them. It’s necessary because many tokens share common prefixes:
| Input | Possible tokens |
|---|---|
! |
! (not) or != (not equal) |
/ |
/ (divide) or // (comment) |
< |
< (less than) or <= (less or equal) |
= |
= (assign) or == (equals) |
Without lookahead, the tokenizer couldn’t distinguish between these. When we see !, we must peek at the next character to decide which token to produce.
Common tokenizer methods
Three methods form the core of most hand-written tokenizers:
advance()
Consumes and returns the current character, then moves forward:
private char advance() {
return src.charAt(current++);
}
Use advance() when you’re ready to commit to consuming a character.
peek()
Returns the current character without consuming it:
private char peek() {
if (isAtEnd()) return '\0';
return src.charAt(current);
}
Use peek() to look ahead and make decisions without moving the position.
match()
Conditionally consumes a character if it matches the expected value:
private boolean match(char expected) {
if (isAtEnd()) return false;
if (src.charAt(current) != expected) return false;
current++;
return true;
}
Use match() for lookahead with conditional consumption - it combines peeking and advancing.
Handling compound lexemes
When ! is encountered, how do we know if it’s not vs != (not equal)? We use lookahead with match():
case '!':
addToken(match('=') ? BANG_EQUAL : BANG);
break;
Comments require similar lookahead:
case '/':
if (match('/')) {
// Comment goes until end of line
while (peek() != '\n' && !isAtEnd()) advance();
} else {
addToken(SLASH);
}
break;
If we encounter a slash, check if it’s followed by another slash. If so, advance until end of line. The peek() function looks ahead since \n is not part of the comment.
Although whitespace has no semantic meaning in CatScript, we still must do bookkeeping when found to update token offsets.
Testing tokenization
What is an identifier?
An identifier is a variable name like pet or i - something that is not a keyword.
@Test
public void identifierTokenization() {
assertAll(
() -> assertTokensAre("asdf", IDENTIFIER, EOF),
() -> assertTokensAre("asdf", "asdf", "<EOF>"),
() -> assertTokensAre("asdf asdf", IDENTIFIER, IDENTIFIER, EOF),
() -> assertTokensAre("asdf asdf", "asdf", "asdf", "<EOF>")
);
}
The first assert says: given the string "asdf", we should get back an IDENTIFIER token and an EOF token.
The EOF token is a sentinel notifying that we have reached the end of file.
The second assert tests the value of the token, while the first tests the type.
Implementation infrastructure
The Token class
What is a Token? A lexeme with additional information:
package csci468.tokenizer;
public class Token {
int start;
int end;
int line;
int lineOffset;
String stringValue;
TokenType type;
private final ICatScriptTokenizer tokenizer;
public Token(int start, int end, int line, int lineOffset,
String stringValue, TokenType type, ICatScriptTokenizer tokenizer) {
this.start = start;
this.end = end;
this.line = line;
this.lineOffset = lineOffset;
this.stringValue = stringValue;
this.type = type;
this.tokenizer = tokenizer;
}
public int getStart() { return start; }
public int getEnd() { return end; }
public int getLine() { return line; }
public int getLineOffset() { return lineOffset; }
public String getStringValue() { return stringValue; }
public TokenType getType() { return type; }
@Override
public String toString() {
return "Token(\"" + stringValue + "\"){" +
"type=" + type +
", start=" + start +
", end=" + end +
", line=" + line +
", offset=" + lineOffset +
'}';
}
public String getLineContent() {
String[] lines = tokenizer.getSource().split("\n");
return line - 1 >= lines.length ? "" : lines[line - 1];
}
}
The tokenize function
What does our tokenize function look like?
public List<Token> tokenize() {
consumeWhitespace();
while (!tokenizationEnd()) {
scanToken();
consumeWhitespace();
}
addToken(EOF, "<EOF>", getCurrentPosition());
return tokenList;
}
This list of tokens is what gets fed into our parser.
Handling whitespace
Let’s start by looking at a test:
@Test
public void onlyWhitespace() {
assertAll(
() -> assertTokensAre(" ", EOF),
() -> assertTokensAre("\n\n\n", EOF),
() -> assertTokensAre("\t\t\t", EOF),
() -> assertTokensAre(" \n \t \r ", EOF)
);
}
If we tokenize something that is all whitespace, we should only end up with an EOF token.
Referring back to consumeWhitespace in our tokenizer function:
private void consumeWhitespace() {
while (!tokenizationEnd()) {
char c = peek();
if (c == ' ' || c == '\r' || c == '\t') {
position += 1;
continue;
} else if (c == '\n') {
position += 1;
continue;
}
break;
}
}
The peek function
What does peek() do?
private char peek() {
return peek(0);
}
private char peek(int offset) {
if (getCurrentPosition() + offset >= src.length()) return '\0';
return src.charAt(getCurrentPosition() + offset);
}
- Get the current position
- If
currentPosition + offsetis greater than source length, return'\0'(null character) - this handles peeking past the end of the source - Otherwise, call
charAtto get the character at that position
If we say charAt(0), we’re asking what’s at the front of the string we’re looking at.
Going back to consumeWhitespace: if the character c is a space, carriage return, or tab, we increment position and continue. We keep running through the loop to consume all whitespace until we hit a non-whitespace character.