Tokenization

Turning source code into tokens

Core concepts

Lexeme - The smallest sequence of characters that represents something in the language. This is the raw character data.

Token - A lexeme combined with additional metadata:

Type - What kind of token (identifier, operator, keyword, etc.)
Literal values - String literals like "foo"
Symbolic values - Variables and identifiers
Location information - Line number, line offset, absolute offset, absolute end
- Enables good error messages when something goes wrong
- IDEs use this to lint code with underline squiggles for syntax errors

The tokenization loop

At its core, tokenization is a loop:

while (more tokens)
    examine the next character to determine lexeme type
    scan until lexeme ends
    add the token + its location information

Regular expressions

Regular expressions are often used for tokenization as they can be passed to a generator to produce a lexical scanner.

[a-zA-Z_][a-zA-Z_0-9]*

This pattern matches variable names (identifiers). Regular languages correspond to finite automata, and the lexing phase of compiler construction involves regular languages.

Hand-written tokenizers

In CatScript, the tokenizer and parser are built from scratch without using regular expression generators.

Benefits:

Easy to read
Easy to debug

The main scanning loop:

List<Token> scanTokens() {
    while (!isAtEnd()) {
        // At beginning of the next lexeme
        start = current;
        scanToken();
    }

    tokens.add(new Token(EOF, "", null, line));
    return tokens;
}

While not at the end of the string, scan the next token. Then add an EOF (sentinel) token and return.

The scanToken method examines each character:

private void scanToken() {
    char c = advance();
    switch (c) {
        case ')': addToken(LEFT_PAREN); break;
        // other cases omitted for brevity
    }
}

If a token is not recognized, we have a lexical error - one of three error types in programming languages (lexical, grammatical, semantic):

default:
    Lox.error(line, "Unexpected character");
    break;

Lookahead

Lookahead is the technique of examining upcoming characters without consuming them. It’s necessary because many tokens share common prefixes:

Input	Possible tokens
`!`	`!` (not) or `!=` (not equal)
`/`	`/` (divide) or `//` (comment)
`<`	`<` (less than) or `<=` (less or equal)
`=`	`=` (assign) or `==` (equals)

Without lookahead, the tokenizer couldn’t distinguish between these. When we see !, we must peek at the next character to decide which token to produce.

Common tokenizer methods

Three methods form the core of most hand-written tokenizers:

`advance()`

Consumes and returns the current character, then moves forward:

private char advance() {
    return src.charAt(current++);
}

Use advance() when you’re ready to commit to consuming a character.

`peek()`

Returns the current character without consuming it:

private char peek() {
    if (isAtEnd()) return '\0';
    return src.charAt(current);
}

Use peek() to look ahead and make decisions without moving the position.

`match()`

Conditionally consumes a character if it matches the expected value:

private boolean match(char expected) {
    if (isAtEnd()) return false;
    if (src.charAt(current) != expected) return false;
    current++;
    return true;
}

Use match() for lookahead with conditional consumption - it combines peeking and advancing.

Handling compound lexemes

When ! is encountered, how do we know if it’s not vs != (not equal)? We use lookahead with match():

case '!':
    addToken(match('=') ? BANG_EQUAL : BANG);
    break;

Comments require similar lookahead:

case '/':
    if (match('/')) {
        // Comment goes until end of line
        while (peek() != '\n' && !isAtEnd()) advance();
    } else {
        addToken(SLASH);
    }
    break;

If we encounter a slash, check if it’s followed by another slash. If so, advance until end of line. The peek() function looks ahead since \n is not part of the comment.

Although whitespace has no semantic meaning in CatScript, we still must do bookkeeping when found to update token offsets.

Testing tokenization

What is an identifier?

An identifier is a variable name like pet or i - something that is not a keyword.

@Test
public void identifierTokenization() {
    assertAll(
        () -> assertTokensAre("asdf", IDENTIFIER, EOF),
        () -> assertTokensAre("asdf", "asdf", "<EOF>"),

        () -> assertTokensAre("asdf asdf", IDENTIFIER, IDENTIFIER, EOF),
        () -> assertTokensAre("asdf asdf", "asdf", "asdf", "<EOF>")
    );
}

The first assert says: given the string "asdf", we should get back an IDENTIFIER token and an EOF token.

The EOF token is a sentinel notifying that we have reached the end of file.

The second assert tests the value of the token, while the first tests the type.

Implementation infrastructure

The Token class

What is a Token? A lexeme with additional information:

package csci468.tokenizer;

public class Token {

    int start;
    int end;
    int line;
    int lineOffset;
    String stringValue;
    TokenType type;
    private final ICatScriptTokenizer tokenizer;

    public Token(int start, int end, int line, int lineOffset,
                 String stringValue, TokenType type, ICatScriptTokenizer tokenizer) {
        this.start = start;
        this.end = end;
        this.line = line;
        this.lineOffset = lineOffset;
        this.stringValue = stringValue;
        this.type = type;
        this.tokenizer = tokenizer;
    }

    public int getStart() { return start; }
    public int getEnd() { return end; }
    public int getLine() { return line; }
    public int getLineOffset() { return lineOffset; }
    public String getStringValue() { return stringValue; }
    public TokenType getType() { return type; }

    @Override
    public String toString() {
        return "Token(\"" + stringValue + "\"){" +
                "type=" + type +
                ", start=" + start +
                ", end=" + end +
                ", line=" + line +
                ", offset=" + lineOffset +
                '}';
    }

    public String getLineContent() {
        String[] lines = tokenizer.getSource().split("\n");
        return line - 1 >= lines.length ? "" : lines[line - 1];
    }
}

The tokenize function

What does our tokenize function look like?

public List<Token> tokenize() {
    consumeWhitespace();
    while (!tokenizationEnd()) {
        scanToken();
        consumeWhitespace();
    }
    addToken(EOF, "<EOF>", getCurrentPosition());
    return tokenList;
}

This list of tokens is what gets fed into our parser.

Handling whitespace

Let’s start by looking at a test:

@Test
public void onlyWhitespace() {
    assertAll(
        () -> assertTokensAre("   ", EOF),
        () -> assertTokensAre("\n\n\n", EOF),
        () -> assertTokensAre("\t\t\t", EOF),
        () -> assertTokensAre("  \n  \t  \r  ", EOF)
    );
}

If we tokenize something that is all whitespace, we should only end up with an EOF token.

Referring back to consumeWhitespace in our tokenizer function:

private void consumeWhitespace() {
    while (!tokenizationEnd()) {
        char c = peek();
        if (c == ' ' || c == '\r' || c == '\t') {
            position += 1;
            continue;
        } else if (c == '\n') {
            position += 1;
            continue;
        }
        break;
    }
}

The peek function

What does peek() do?

private char peek() {
    return peek(0);
}

private char peek(int offset) {
    if (getCurrentPosition() + offset >= src.length()) return '\0';
    return src.charAt(getCurrentPosition() + offset);
}

Get the current position
If currentPosition + offset is greater than source length, return '\0' (null character) - this handles peeking past the end of the source
Otherwise, call charAt to get the character at that position

If we say charAt(0), we’re asking what’s at the front of the string we’re looking at.

Going back to consumeWhitespace: if the character c is a space, carriage return, or tab, we increment position and continue. We keep running through the loop to consume all whitespace until we hit a non-whitespace character.