Context-Free Languages – CFG, PDA, CYK Algorithm and Pumping Lemma Explained

Home
CS & IT
Theory of Computation
Context-Free Languages
Key Takeaways:

  • A Context-Free Grammar (CFG) uses recursive production rules to generate strings — the backbone of every programming language parser.
  • A Pushdown Automaton (PDA) is an NFA augmented with a stack — it recognises exactly the context-free languages.
  • Chomsky Normal Form (CNF) rewrites any CFG so every rule has the form A → BC or A → a, enabling efficient parsing.
  • The CYK algorithm decides membership in a CFL in O(n3) time using dynamic programming on CNF grammars.
  • The Pumping Lemma for CFLs proves languages like {anbncn} are not context-free.
  • CFLs are closed under union, concatenation, and star — but not under intersection or complement.

1. Context-Free Grammars (CFG)

A Context-Free Grammar is a set of rules for generating strings. The defining feature: each rule rewrites a single non-terminal symbol, regardless of its context (hence “context-free”). This makes CFGs ideal for describing nested or recursive structures like balanced parentheses, arithmetic expressions, and programming language syntax.

Formal Definition

A CFG is a 4-tuple G = (V, Σ, R, S) where:

  • V — finite set of variables (non-terminals), e.g. {S, A, B}
  • Σ — finite set of terminals (alphabet symbols), e.g. {a, b, (, )}
  • R — finite set of production rules of the form A → w where A ∈ V and w ∈ (V ∪ Σ)*
  • S ∈ V — start variable

Example: Grammar for balanced parentheses

S → (S) | SS | ε

This generates: ε, (), (()), ()(), (()()), … — every string of balanced parentheses.

Example: Grammar for arithmetic expressions

E → E + T | T
T → T * F | F
F → (E) | id

This grammar correctly handles operator precedence (* before +) and left-associativity through the structure of the rules.

2. Derivations and Parse Trees

A derivation is a sequence of rule applications starting from S to reach a terminal string. A parse tree visualises the derivation as a tree where the root is S, interior nodes are variables, leaves are terminals or ε, and each interior node with its children represents a rule application.

Leftmost vs Rightmost Derivation

  • Leftmost: always expand the leftmost variable first — used in LL (top-down) parsers
  • Rightmost: always expand the rightmost variable first — used in LR (bottom-up) parsers

Both derivations produce the same parse tree if the grammar is unambiguous.

3. Ambiguity in Grammars

A grammar is ambiguous if some string has more than one parse tree (equivalently, more than one leftmost derivation). Ambiguity is a property of grammars, not languages.

Example of an Ambiguous Grammar

E → E + E | E * E | (E) | id

The string “id + id * id” has two parse trees — one grouping it as (id + id) * id, another as id + (id * id). This is why real programming language grammars explicitly encode precedence (as in the arithmetic expression grammar above).

Inherent Ambiguity

Some languages are inherently ambiguous — every grammar that generates them is ambiguous. Example: {anbncm} ∪ {anbmcm}. No unambiguous CFG can generate this language.

4. Chomsky Normal Form (CNF)

A CFG is in Chomsky Normal Form if every production rule has one of two forms:

  • A → BC — two non-terminals (binary rules)
  • A → a — exactly one terminal
  • (Only exception: S → ε is allowed if ε is in the language)

Converting any CFG to CNF — 5 Steps

  1. Add new start: Introduce S0 → S to prevent the old start from appearing on the right-hand side.
  2. Eliminate ε-rules: Find all “nullable” variables (those that can derive ε). For each rule containing a nullable variable, add a version with that variable removed. Remove the ε rules (except S0 → ε if needed).
  3. Eliminate unit rules (A → B): Replace each unit rule A → B with all rules B can derive, then remove the unit rules.
  4. Eliminate long rules: Replace each rule A → B1B2…Bk (k ≥ 3) with A → B1A1, A1 → B2A2, …, Ak-2 → Bk-1Bk.
  5. Move terminals to their own rules: For any terminal a in a binary rule, introduce a variable Na → a and replace a with Na in that rule.

5. Pushdown Automata (PDA)

A Pushdown Automaton is an NFA equipped with an infinite stack. The stack allows the PDA to remember an unbounded amount of information — enough to match nested structures like balanced parentheses or equal-length prefixes and suffixes.

Formal Definition

A PDA is a 6-tuple P = (Q, Σ, Γ, δ, q0, F) where:

  • Q — finite set of states
  • Σ — input alphabet
  • Γ — stack alphabet
  • δ : Q × (Σ ∪ {ε}) × Γε → 2Q × Γε — transition function
  • q0 — start state
  • F — set of accept states

Each transition reads an input symbol (or ε), pops a stack symbol (or ε), and pushes a string of stack symbols (or ε).

Acceptance Modes

  • Accept by final state: Input is consumed and machine is in an accept state
  • Accept by empty stack: Input is consumed and stack is empty

Both modes are equivalent — the same class of languages is recognised either way.

Example: PDA for {0n1n | n ≥ 1}

Push a $ onto the stack (as bottom marker), then push one stack symbol for each 0 read. When 1s appear, pop one symbol for each 1. Accept if the stack contains only $ when input ends. If 0s and 1s were equal in count, the stack is exactly emptied of the pushed symbols.

6. Equivalence: CFG and PDA

Context-Free Grammars and Pushdown Automata recognise exactly the same class of languages — the Context-Free Languages (CFLs).

  • CFG → PDA: Simulate leftmost derivation. The stack holds the current “sentential form.” On each step, if the top of stack is a non-terminal, apply a grammar rule (push the right-hand side). If the top is a terminal, match it against the input.
  • PDA → CFG: For each pair of states (p, q) in the PDA, create a variable Apq that generates all strings that take the PDA from state p with empty stack to state q with empty stack.

7. CYK Parsing Algorithm

The Cocke-Younger-Kasami (CYK) algorithm determines whether a string belongs to a CFL in O(n3 |G|) time, where n is the string length and |G| is the grammar size. It requires the grammar to be in CNF.

How It Works

CYK fills a triangular table. Entry T[i][j] = set of variables that can generate the substring from position i to position j.

  1. Base case: T[i][i] = {A | A → w[i] is a rule} (single characters)
  2. Inductive case: For each substring of length l > 1, for each split point k: if A → BC is a rule and B ∈ T[i][k] and C ∈ T[k+1][j], then add A to T[i][j]
  3. Accept if S ∈ T[1][n]

CYK is the basis for many practical parsing algorithms and is used in natural language processing to parse sentences according to a probabilistic context-free grammar.

8. Pumping Lemma for CFLs

The Pumping Lemma for context-free languages proves that certain languages (those requiring three or more interdependent counts) are not context-free.

Statement

If L is a context-free language, then there exists a pumping length p such that any string w ∈ L with |w| ≥ p can be written as w = uvxyz where:

  • |vy| ≥ 1 (v and y are not both empty)
  • |vxy| ≤ p
  • For all i ≥ 0, uvixyiz ∈ L

Example: {anbncn} is not context-free

Assume it is a CFL with pumping length p. Choose w = apbpcp. Since |vxy| ≤ p, the combined v and y can span at most two of the three symbol types. Pumping (i=2) increases the count of at most two symbol types, but not all three equally. So some symbol type’s count becomes unequal and the string leaves the language. Contradiction.

9. Closure Properties

OperationCFLs Closed?Note
Union (L1 ∪ L2)YesNew start: S → S1 | S2
Concatenation (L1 · L2)YesNew start: S → S1 S2
Kleene Star (L*)YesNew start: S → S1 S | ε
Intersection (L1 ∩ L2)No{anbncn} = CFL ∩ CFL is not CFL
ComplementNoComplement of a CFL may not be CFL
Intersection with RegularYesPDA + DFA product construction
Reversal (LR)YesReverse all grammar rules
HomomorphismYesApply substitution to terminals

Key insight: Intersection with a regular language always yields a CFL — this is used in compiler design to restrict a CFL by additional regular constraints.

10. Common Misconceptions

Misconception 1: Every CFL can be parsed in O(n2) time.
The general CYK algorithm is O(n3). However, restricted grammar classes like LL(k) and LR(k) can be parsed in O(n) time — which is why compiler parsers are designed to use these grammar subclasses.

Misconception 2: An ambiguous grammar generates an ambiguous language.
A language is ambiguous only if every grammar for it is ambiguous (inherently ambiguous). Many ambiguous grammars can be rewritten as unambiguous grammars for the same language.

Misconception 3: PDAs are deterministic, like DFAs.
PDAs are inherently nondeterministic. Deterministic PDAs (DPDAs) are a strict subset and recognise a strict subset of CFLs. For example, {wwR} (palindromes) is a CFL but not recognised by any DPDA. LL and LR parsing corresponds to DPDA recognition.

Misconception 4: The intersection of two CFLs is always a CFL.
False. {anbncm} and {anbmcm} are both CFLs, but their intersection {anbncn} is not context-free. Intersection is not closed for CFLs.

Misconception 5: You can always remove ambiguity from a grammar.
Not always. Some languages are inherently ambiguous — no unambiguous grammar exists for them. However, the grammars used in real programming languages can always be made unambiguous (by design — that is what grammar engineering is about).

11. Frequently Asked Questions

What is a context-free grammar and why is it used in compilers?

A context-free grammar is a set of recursive rules that define the syntactic structure of a language. In compilers, the parser uses a CFG to verify that source code has correct syntax and to build a parse tree (or abstract syntax tree). CFGs are ideal because programming languages have nested structures (if-else inside while loops, expressions inside function calls) that regular expressions cannot describe — you need recursion, which CFGs provide.

What is the difference between a PDA and a DFA?

A DFA has only its current state as memory — a fixed, finite amount. A PDA additionally has a stack, which provides unlimited (but last-in-first-out) memory. This extra memory lets PDAs match nested structures: a PDA can push when it sees an opening bracket and pop when it sees a closing one, checking they match. A DFA cannot count unbounded quantities and cannot do this. DFAs recognise regular languages; PDAs recognise the larger class of context-free languages.

Why is {a^n b^n c^n} not context-free when {a^n b^n} is?

{anbn} is context-free because a PDA can push one symbol per ‘a’ then pop one per ‘b’ — the stack tracks the single relationship. {anbncn} requires tracking two simultaneous equalities: a’s equal b’s AND b’s equal c’s. A stack can only track one unbounded counter at a time (in a push-then-pop sequence). The Pumping Lemma for CFLs formally proves this impossibility.

What is Greibach Normal Form (GNF)?

Greibach Normal Form is an alternative normal form for CFGs where every production starts with exactly one terminal: A → aα where a is a terminal and α is a (possibly empty) string of non-terminals. GNF is useful theoretically and guarantees that derivations never loop infinitely. Every CFG without ε can be converted to GNF. CNF is more commonly used for parsing (CYK algorithm), while GNF is more common in formal proofs.