Syntax Analysis and Parsing – LL(1), LR, SLR, LALR, FIRST and FOLLOW Sets Explained

Home
CS & IT
Compiler Design
Syntax Analysis & Parsing
Key Takeaways:

  • The parser takes the token stream from the lexer and checks it against the grammar, building a parse tree or AST.
  • FIRST(A) is the set of terminals that can begin any string derived from A. FOLLOW(A) is the set of terminals that can appear immediately after A in any sentential form.
  • LL(1) parsers scan left-to-right, use leftmost derivation, and look 1 token ahead. They use a predictive parse table and an explicit stack.
  • LR parsers (LR(0), SLR, CLR/LR(1), LALR) scan left-to-right, use rightmost derivation in reverse, and are more powerful than LL parsers.
  • LALR(1) is used by most real parser generators (YACC, Bison) — it handles almost all practical programming language grammars.
  • Parse table conflicts (shift-reduce or reduce-reduce) mean the grammar is not in the parser’s class — not necessarily that the grammar is ambiguous. An unambiguous grammar can still cause SLR/LALR conflicts.

1. Role of the Parser

The parser (syntax analyser) receives the token stream from the lexer and verifies that the sequence of tokens conforms to the grammatical rules of the language. If it does, the parser builds a parse tree (or abstract syntax tree — AST) representing the hierarchical structure of the program. If it doesn’t, the parser reports a syntax error.

The grammar used by the parser is a Context-Free Grammar (CFG). Real programming language parsers use deterministic subsets of CFGs (LL or LR grammars) to achieve efficient O(n) parsing.

2. FIRST and FOLLOW Sets

FIRST and FOLLOW are the foundation of predictive (LL) parsing — they tell the parser which production rule to use based on the current lookahead token.

FIRST(X)

The set of terminal symbols that can appear as the first symbol in any string derived from X.

  • If X is a terminal: FIRST(X) = {X}
  • If X is a non-terminal: FIRST(X) = terminals that can start strings derived from X
  • If X can derive ε (empty string): ε ∈ FIRST(X)

Computing FIRST for a production A → X1X2…Xk:

  1. Add FIRST(X1) − {ε} to FIRST(A)
  2. If ε ∈ FIRST(X1), also add FIRST(X2) − {ε}, and so on
  3. If all Xi can derive ε, add ε to FIRST(A)

FOLLOW(A)

The set of terminal symbols that can appear immediately to the right of A in any sentential form. FOLLOW never contains ε but always contains $ (end-of-input marker) for the start symbol.

  • $ ∈ FOLLOW(start symbol)
  • For each production B → αAβ: FIRST(β) − {ε} ⊂ FOLLOW(A)
  • For each production B → αA or B → αAβ where ε ∈ FIRST(β): FOLLOW(B) ⊂ FOLLOW(A)

FIRST of a String

FIRST(α) where α = X1X2...Xk:
= (FIRST(X1) - {ε}) ∪ (if ε ∈ FIRST(X1): FIRST(X2X3...Xk))

3. LL(1) Predictive Parsing

An LL(1) parser reads Left-to-right, uses Leftmost derivation, and looks ahead 1 token. It uses a predictive parse table M[A, a] that maps (non-terminal A, lookahead token a) to the production rule to apply.

Building the LL(1) Parse Table

For each production A → α:

  1. For each terminal a ∈ FIRST(α) − {ε}: add A → α to M[A, a]
  2. If ε ∈ FIRST(α): for each terminal b ∈ FOLLOW(A): add A → α to M[A, b]
  3. If ε ∈ FIRST(α) and $ ∈ FOLLOW(A): add A → α to M[A, $]

A grammar is LL(1) if and only if the parse table has no cell with more than one entry.

LL(1) Parsing Algorithm

Maintain a stack. Initially: push $, then push start symbol. Repeat:

  • Let X = top of stack, a = current lookahead token
  • If X = a = $: accept
  • If X = a (terminal): pop X, advance input
  • If X is a non-terminal: look up M[X, a]; if it contains X → Y1Y2…Yk, pop X, push Yk,…,Y1 (rightmost first)
  • If M[X, a] is empty: error

Conditions That PREVENT LL(1)

  • Left recursion: A → Aα causes the parser to loop infinitely. Eliminate by rewriting: A → βA’, A’ → αA’ | ε
  • Common prefixes: A → αβ | αγ — the parser can’t decide which rule to use. Fix by left factoring: A → αA’, A’ → β | γ

4. Recursive Descent Parsing

A recursive descent parser implements each non-terminal as a function. The function for non-terminal A tries to match A’s production rules by calling the functions for the right-hand side symbols. It is the manual, hand-coded equivalent of LL(1) parsing.

// Grammar: E → T E', E' → + T E' | ε, T → id
void parseE() { parseT(); parseE_prime(); }
void parseE_prime() {
  if (lookahead == '+') { match('+'); parseT(); parseE_prime(); }
  // else: epsilon production, do nothing
}

Recursive descent is easy to write, understand, and debug. Most hand-written parsers for programming languages (including GCC’s C parser and Clang) use recursive descent.

5. Bottom-Up LR Parsing

LR parsers work bottom-up: they start with the input tokens and work backward to the start symbol. They scan Left-to-right and produce a Rightmost derivation in reverse. LR parsers are more powerful than LL parsers — every LL(1) grammar is also LR(1), but many grammars are LR but not LL.

An LR parser uses two data structures: a stack (storing grammar symbols and states) and an action/goto table. Actions are:

  • Shift: Push the next input token onto the stack and move to a new state
  • Reduce: Pop symbols matching the right-hand side of a production, push the left-hand side non-terminal
  • Accept: Input is fully parsed
  • Error: No valid action exists

Items and the Canonical Collection

An LR item is a production with a dot (.) marking how far parsing has progressed. For A → αβ, the item A → α•β means “α has been seen; expecting β next.” The collection of all LR item sets (states) forms the LR automaton used to build the parse table.

6. LR(0) and SLR Parsers

LR(0)

Uses only the current state (no lookahead) to decide between shift and reduce. Very few practical grammars are LR(0) without conflicts.

SLR (Simple LR)

Uses 1 token of lookahead based on FOLLOW sets to resolve conflicts. A reduction A → α is only performed when the lookahead token is in FOLLOW(A). SLR is stronger than LR(0) but still fails on some grammars where FOLLOW sets are too coarse.

Building the SLR Table

  1. Augment grammar: S’ → S
  2. Compute LR(0) item sets (canonical collection)
  3. For each item set I and each symbol X: build ACTION and GOTO entries
    • If [A → α•aβ] ∈ I and goto(I,a) = J: ACTION[I,a] = shift J
    • If [A → α•] ∈ I (A ≠ S’): ACTION[I,a] = reduce A → α for all a ∈ FOLLOW(A)
    • If [S’ → S•] ∈ I: ACTION[I,$] = accept

7. CLR(1) and LALR(1) Parsers

CLR(1) — Canonical LR(1)

Uses LR(1) items: each item carries a lookahead token. The item [A → α•β, a] means “we have matched α so far and expect β next; when this production is fully matched (dot reaches the end), we reduce A → αβ only if the next input token is a.” Specifically, the reduce action fires only from the completed item [A → αβ•, a] when the lookahead is a — not from mid-production items. CLR(1) is the most powerful LR parser but produces a very large parse table (many states) because different lookaheads create distinct states.

LALR(1) — Lookahead LR(1)

LALR merges CLR(1) states that have the same LR(0) core (same items, ignoring lookaheads). This dramatically reduces the number of states while retaining most of CLR(1)’s power. LALR(1) is the practical sweet spot — used by YACC, Bison, and most industrial parser generators.

Power Comparison

ParserLookaheadPowerStatesUsed In
LR(0)NoneWeakestFewestTheoretical only
SLR(1)FOLLOW setsWeakSame as LR(0)Teaching
LALR(1)Merged lookaheadsStrongSame as LR(0)YACC, Bison
CLR(1)Full lookaheadsStrongestMostHigh-end tools
LL(1)FIRST/FOLLOWWeakest overallN/A (table-driven)Antlr, hand-written

8. LL vs LR — Key Differences

FeatureLL(1)LR(1)
DerivationLeftmostRightmost (in reverse)
DirectionTop-down (start symbol → input)Bottom-up (input → start symbol)
Left recursionCannot handleHandles naturally
Left factoring needed?YesNo
Grammar classLL(1) ⊂ LR(1)More powerful
ImplementationSimple (recursive descent)Table-driven (complex)
Error messagesBetter (natural recursion)Harder to produce good messages
Used byANTLR, hand-written parsersYACC, Bison, most generators

9. Common Misconceptions

Misconception 1: LALR(1) is just a faster version of LR(1).
LALR(1) is strictly weaker than LR(1) — it cannot handle all LR(1) grammars. Some LR(1) grammars produce reduce-reduce conflicts when states are merged for LALR. LALR is chosen because it produces a smaller table, not because it is equivalent to LR(1).

Misconception 2: A grammar that is not LL(1) cannot be parsed with a top-down parser.
With backtracking, any unambiguous CFG can be parsed top-down — just not deterministically (not in O(n)). LL(1) is the condition for deterministic top-down parsing with one-token lookahead. Parsers like Earley and CYK can handle arbitrary CFGs.

Misconception 3: FIRST and FOLLOW are only needed for LL parsing.
FOLLOW sets are also used in SLR parsing (to decide when to reduce). FIRST sets appear in CLR(1)/LALR lookahead computation. They are fundamental to all table-driven parsing.

Misconception 4: Shift-reduce conflict always means the grammar is ambiguous.
Not necessarily. Some unambiguous grammars have shift-reduce conflicts in SLR or LALR because the lookahead information is insufficient. The grammar may be LR(1) (no conflicts in CLR) but cause conflicts in the weaker LALR table.

Misconception 5: Left recursion is always bad and must be eliminated.
Left recursion is only a problem for LL parsers. LR parsers handle left-recursive grammars perfectly — in fact, left-recursive grammars naturally express left-associative operators (like subtraction and division) and are preferred in LR grammars for this reason.

10. Frequently Asked Questions

What are FIRST and FOLLOW sets used for?

FIRST sets tell the parser which production to use: “if the next token is in FIRST(A → α), use this production.” FOLLOW sets are used when a production derives ε — “if the next token is in FOLLOW(A), apply the epsilon production.” Together they determine the LL(1) parse table uniquely, and they also guide SLR reduce actions.

What is the difference between SLR, LALR, and CLR parsers?

All three are bottom-up LR parsers differing in how they handle lookaheads when deciding to reduce. SLR uses FOLLOW sets (too coarse — may cause spurious conflicts). CLR uses precise per-item lookaheads (fewest conflicts but many states). LALR merges CLR states with the same LR(0) core, giving CLR’s power with LR(0)’s table size — the practical choice for most parser generators.

How do you eliminate left recursion from a grammar?

Replace left-recursive rules with right-recursive rules using a new non-terminal. For direct left recursion A → Aα | β: rewrite as A → βA’ and A’ → αA’ | ε. For indirect left recursion (A → B, B → A…), first substitute to expose the direct recursion, then eliminate. Left factoring similarly removes common prefixes: A → αβ | αγ becomes A → αA’, A’ → β | γ.

Why do most compilers use LALR(1) rather than LR(1)?

CLR(1) can have exponentially more states than LALR(1) for the same grammar. This makes the parse table very large and slow to generate. LALR(1) produces the same number of states as LR(0) or SLR, making it practical. Almost all real programming language grammars can be expressed in LALR(1) — C, C++, Java, Pascal, and most languages were designed with LALR parsability in mind.