Compilers

Semantic Analysis

Moving beyond syntax to meaning - error phases, control flow, and optimization

The compiler pipeline so far

Phase Input Output
Tokenizer Raw text List of tokens
Parser List of tokens Parse tree
Semantic analysis Parse tree Validated parse tree (with types and errors)

Semantic analysis asks: does this tree make sense? We’re no longer concerned with lexical or grammatical correctness - we’re evaluating meaning.


Three types of errors

$x                          
var x : boolean 10          
var y : boolean = 1        
Line Error type Phase Why
1 Lexical Tokenizer $ has no meaning in CatScript outside of strings
2 Syntactic Parser Missing = - violates the grammar (require fails)
3 Semantic Validation 1 is not a boolean - lexically and grammatically valid, but meaningless

Errors come from different phases of the compiler. This distinction is fundamental and recurs throughout semantic analysis.


Where semantic analysis happens in CatScript

Unlike the tokenizer and parser which are separate components, semantic analysis code lives inside the parse tree elements themselves. Each parse element (expression or statement) implements a validate method:

// In AdditiveExpression.java
@Override
public void validate(SymbolTable symbolTable) {
    // type checking logic here
}

ParseElement is the base class. Statement and Expression extend it. Concrete classes like ForStatement and AdditiveExpression implement the abstract validate method with their own semantic rules.


Type errors vs. non-type errors

Semantic analysis in statically typed languages like CatScript focuses heavily on type systems, but type checking isn’t everything.


Use-before-declaration

print(x)
var x = 10

This is syntactically valid but semantically incorrect in CatScript - x is used before it’s declared.

In JavaScript with var, this is legal due to variable hoisting - variables are moved to the top of their scope. With let, it’s an error. CatScript follows Java-style semantics: variables must be declared before use.


Control flow analysis - return coverage

function sign(x : int) : string {
    if (x > 0) {
        return "positive"
    }
    if (x < 0) {
        return "negative"
    }
}

There’s no type error here. The problem is that if x equals zero, no value is returned. This is a control flow error - not every path through the function leads to a return statement.

The algorithm for validating return coverage is recursive:

  1. Look at the last statement in the statement list
  2. If it’s a return statement - covered
  3. If it’s an if statement - recurse on both the true and else branches (both must have return coverage)
  4. If it’s a for or while loop - recurse on its body
  5. Otherwise - not covered

Void functions don’t need return coverage because they can fall off the end naturally.


Dead code elimination

Semantic analysis can also be used for optimization:

if (DEBUG) {
    logExpensiveState();
}

If DEBUG is a compile-time constant known to be false, the entire block can be eliminated. The compiler evaluates the condition at compile time and drops the unreachable code.

Similarly, Java optimizes constant expressions:

var x = 1 + 2;  // javac compiles this as: var x = 3;

No addition happens at runtime - the compiler does it during semantic analysis.


Other semantic errors in Java

Error Description
Variable might not have been initialized Control flow analysis - all paths between declaration and use must assign a value
Unhandled exception Checked exceptions must be caught or declared thrown
Checked vs. runtime exceptions SQLException (checked) must be handled; RuntimeException subclasses don’t

The 80/20 rule

Semantic analysis involves design decisions. Every language makes different choices:

  • JavaScript: no use-before-declaration error, missing returns produce undefined
  • Rust: allows variable shadowing
  • Haskell: very strict type systems, aim for zero type errors
  • CatScript: standard static typing, not sound but practical

The general principle: don’t overcook it. Catch the most impactful errors without making the type system so complex that it becomes a burden.


Summary

Concept Key point
Three error phases Lexical (tokenizer), syntactic (parser), semantic (validation)
Validate methods Semantic analysis is implemented on parse tree elements, not in a separate visitor
Type errors Detected by checking type assignability
Control flow errors Return coverage, uninitialized variables - not about types
Dead code elimination Compile-time evaluation of constants for optimization
Design decisions Every language makes different semantic choices - there’s no single “right” answer