Why use a DFA rather than a regex engine for lexing?

A DFA runs in O(n) time — one state transition per character, no backtracking. Backtracking regex engines can take exponential time on adversarial inputs (ReDoS attacks). Compilers use DFAs to guarantee linear-time lexing regardless of input.

Lexical Analysis - Tokens, DFA, LEX Tool and Maximal Munch

Q: What is the difference between a token, a pattern, and a lexeme?

A token is the category name (like IDENTIFIER or INTEGER). A pattern is the regular expression rule defining what strings belong to that category. A lexeme is the actual matched string from source code. Example: token=IDENTIFIER, pattern=[a-zA-Z_][a-zA-Z0-9_]*, lexeme='totalCount'.

Q: How does a lexer handle == vs =?

Using the maximal munch rule. The lexer reads the first = and continues. Seeing a second = it transitions to the == accepting state. Since no further = follows, it accepts == as the EQ token. For = followed by a space, it would match = as ASSIGN, skip the space, then match the second = as another ASSIGN.

Key Takeaways:

Lexical analysis (lexing/scanning) is the first phase of a compiler — it converts a stream of characters into a stream of tokens.
Each token has a type (keyword, identifier, operator, literal) and a lexeme (the actual matched text).
Token patterns are specified as regular expressions and implemented using Deterministic Finite Automata (DFAs).
The LEX (or FLEX) tool automatically generates a lexer from a set of regex rules.
Lexers handle whitespace, comments, and line-number tracking — all stripped before the parser sees the token stream.
The longest-match rule (maximal munch) resolves ambiguity: always match the longest possible token.

1. Role of the Lexer

The lexical analyser (lexer or scanner) is the first component of a compiler to touch the source code. It reads the raw text of the program character by character and groups the characters into meaningful chunks called tokens. The parser (the next phase) never sees raw characters — it works entirely with the token stream produced by the lexer.

Think of it like reading a sentence: before understanding the grammar of “The cat sat on the mat,” you first identify that “The”, “cat”, “sat”, “on”, “the”, “mat” are the words (tokens). The lexer does the same for programs.

What the Lexer Does

Groups characters into tokens
Strips whitespace and comments (usually)
Tracks line and column numbers for error reporting
Looks up identifiers in the symbol table (or adds them)
Returns (token-type, lexeme) pairs to the parser on demand

2. Tokens, Patterns, and Lexemes

Three related concepts:

Token: A category — a name for a class of lexemes. Examples: IDENTIFIER, INTEGER_LITERAL, PLUS, IF, WHILE.
Pattern: The rule (usually a regular expression) that describes which strings belong to a token class.
Lexeme: The actual string matched in the source. For token IDENTIFIER, the lexemes might be “count”, “x”, “totalSum”.

Token	Pattern	Example Lexemes
KEYWORD	if \| else \| while \| for \| int \| …	if, while, int
IDENTIFIER	[a-zA-Z_][a-zA-Z0-9_]*	count, x1, _tmp
INTEGER_LIT	[0-9]+	0, 42, 1000
FLOAT_LIT	[0-9]+\.[0-9]+	3.14, 0.5
STRING_LIT	“[^”]*”	“hello”, “”
PLUS	\+	+
ASSIGN	=	=
EQ	==	==
SEMICOLON	;	;
COMMENT	//[^\n]* (discarded)	// this is a comment

3. Specifying Tokens with Regular Expressions

Regular expressions are the natural tool for specifying token patterns because token classes are always regular languages. Every regex can be converted to a DFA, which runs in O(n) time on an input of length n — making lexers extremely fast.

Common Regex Constructs in Lexer Specs

Notation	Meaning	Example
[abc]	Character class — one of a, b, or c	[0-9] matches any digit
[^abc]	Negated class — any char except a, b, c	[^\n] matches non-newline
r+	One or more of r (r* but not ε)	[0-9]+ matches integers
r?	Zero or one of r (optional)	[0-9]+(\.[0-9]+)? matches int or float
r{n,m}	Between n and m repetitions	[0-9]{2,4} matches 2–4 digits
.	Any character except newline	.* matches any line
\d \w \s	Digit, word char, whitespace	Common in practical tools

4. From Regex to DFA — Building the Lexer

A lexer for multiple token types is built by combining all patterns into a single DFA. The standard pipeline is:

Regex → NFA (Thompson's construction) → DFA (subset construction) → Minimal DFA (table minimisation)

Step 1: Regex → NFA (Thompson’s Construction)

Build an NFA for each token’s regex. Combine them with a new start state that has epsilon transitions to each individual NFA’s start state. The accepting state of each sub-NFA is labelled with the corresponding token type.

Step 2: NFA → DFA (Subset Construction)

Apply the subset construction to get a DFA. Each DFA state is a set of NFA states. DFA accept states are any state that contains an NFA accept state.

Step 3: Minimise the DFA

Apply the table-filling algorithm to merge equivalent states. This produces the smallest DFA that implements the lexer.

Step 4: Generate Transition Table

The final DFA is stored as a two-dimensional transition table: rows = states, columns = input characters (often grouped into character classes). The lexer engine uses this table to drive scanning.

5. Longest Match and Priority Rules

Maximal Munch (Longest Match)

When multiple tokens could start at the current position, the lexer always takes the longest possible match. This is called the maximal munch rule.

Example: On input “==”, the lexer matches “==” as EQ (equality operator) rather than two separate “=” (ASSIGN) tokens. On input “iffy”, it matches the full identifier “iffy” rather than the keyword “if” followed by identifier “fy”.

Priority Rules

When two patterns match the same-length string, a priority rule decides. In most lexer tools:

Keywords have higher priority than identifiers — so “if” is always a KEYWORD, never an IDENTIFIER.
Rules are listed in order; the first matching rule wins (used in LEX).

How the Lexer Knows When to Stop

The DFA drives forward through the input. When it reaches a state from which no further transition is possible (or a non-accepting dead state), it backs up to the last position where it was in an accepting state and emits that token.

6. The LEX/FLEX Tool

LEX (and its modern replacement FLEX) is a lexer generator — you provide a specification file with token patterns and associated actions, and LEX generates C code implementing the lexer as a DFA.

Structure of a LEX Specification

%{ C declarations %}
%%
pattern1 { action1; }
pattern2 { action2; }
...
%%
C functions

Example LEX Snippet

[0-9]+ { return INTEGER; }
[a-zA-Z_][a-zA-Z0-9_]* { return IDENT; }
"if" { return IF; }
[ \t\n] { /* skip whitespace */ }
. { yyerror("unknown char"); }

LEX is paired with YACC (Yet Another Compiler Compiler) for parser generation. LEX produces tokens that YACC’s parser consumes.

7. Lexical Errors

A lexical error occurs when the lexer cannot match the current input to any token pattern. Common strategies for error recovery:

Delete the character and continue scanning
Insert a missing character (uncommon — risks cascading errors)
Replace the character with the most likely intended character
Panic mode: Skip characters until a synchronisation point (whitespace, semicolon) is found

Lexical errors are relatively rare — most “syntax errors” are detected by the parser, not the lexer.

8. Common Misconceptions

Misconception 1: The lexer parses the program.
The lexer only tokenises — it groups characters into tokens but applies no grammatical structure. Grammar checking (balanced braces, correct statement structure) is the parser’s job. The lexer cannot detect “{ if ( }” as an error; it just sees a LBRACE, IF, LPAREN, RBRACE token sequence.

Misconception 2: Keywords and identifiers are distinguished by the lexer’s regex alone.
They are distinguished by priority rules. Both “if” and “iffy” match the identifier regex [a-zA-Z_][a-zA-Z0-9_]*, and “if” also matches the keyword pattern. The lexer uses priority: keyword rules are listed first, so “if” becomes a KEYWORD. “iffy” only matches the identifier rule.

Misconception 3: Lexers process the entire file first, then return all tokens.
In most implementations, the lexer and parser run interleaved. The parser calls yylex() to get the next token on demand — the lexer returns one token at a time, advancing its position in the input stream.

Misconception 4: Comments are tokens.
Comments are typically matched by a pattern whose action discards the text and returns nothing (or recursively calls the lexer again). The parser never sees comment tokens in most language designs.

Misconception 5: LEX generates a parser.
LEX generates a lexer (scanner). YACC generates a parser. They are separate tools that are typically used together: LEX produces tokens, YACC consumes them.

9. Frequently Asked Questions

What is the difference between a token, a pattern, and a lexeme?

A token is the category name (like IDENTIFIER or INTEGER). A pattern is the regular expression rule that defines what strings belong to that token class. A lexeme is the actual matched string from the source code. For example: token = IDENTIFIER, pattern = [a-zA-Z_][a-zA-Z0-9_]*, lexeme = “totalCount”.

How does a lexer handle “==” vs “=”?

Using the maximal munch rule. When the lexer sees “==”, it reads the first “=” and transitions its DFA. The DFA then sees the second “=” and moves to a state representing “==”. No further “=” follows, so it accepts “==” as the EQ token. If the input were “= =”, the lexer would match “=” as ASSIGN, skip the space, then match the second “=” as another ASSIGN.

Why use a DFA rather than directly running a regex engine for lexing?

A DFA always runs in O(n) time — one state transition per input character, no backtracking. Some regex engines (like those using backtracking NFA simulation) can take O(2ⁿ) time on adversarial inputs (a well-known attack called ReDoS — Regular Expression Denial of Service). Compilers use DFAs explicitly to guarantee linear-time lexing regardless of input.

What happens when the lexer encounters an error?

When no pattern matches the current character, the lexer triggers an error. Common recovery strategies include skipping the invalid character and resuming, or entering a panic mode where scanning skips forward to a known synchronisation point (like the end of the line or a semicolon). The error is reported with the line and column number, and scanning continues so that the lexer can find as many errors as possible in one pass.

Lexical Analysis – Tokens, DFA, LEX Tool and Maximal Munch Explained