Intermediate Code Generation – Three-Address Code, DAG, Basic Blocks and Control Flow Graph Explained

Q: Why is intermediate code necessary?

Intermediate code enables retargetability (one front-end, many back-ends), machine-independent optimisation, and separation of concerns. This is why LLVM IR, GCC GIMPLE, and Java bytecode exist — they let the same front-end target multiple architectures by swapping only the back-end.

Q: What is SSA form and why is it used?

In SSA form every variable is assigned exactly once. When branches merge, phi functions select the right version. SSA makes dataflow explicit and simplifies optimisation algorithms including dead code elimination, constant propagation, and register allocation. Used by LLVM, GCC, and most modern compilers.

Q: How does a compiler identify loops in intermediate code?

Loops are identified via back edges in the Control Flow Graph. A back edge goes from node n to a node h that dominates n (appears on every path from entry to n). Each back edge identifies a natural loop — the nodes that can reach n without passing through h, plus h.

Key Takeaways:

Intermediate code is a language-independent, machine-independent representation that sits between the AST and target machine code — it enables retargeting (one front-end, many back-ends).
Three-address code (3AC) is the most common intermediate form: each instruction has at most one operator and three operands (two sources, one destination).
3AC is stored as quadruples (op, arg1, arg2, result) or triples (op, arg1, arg2 — result is implicit position).
A DAG (Directed Acyclic Graph) represents an expression without redundancy — common subexpressions share nodes.
A basic block is a maximal straight-line sequence of instructions with one entry and one exit. Optimisation works on basic blocks.
A Control Flow Graph (CFG) models how basic blocks connect through jumps and branches.

1. Why Intermediate Code?

A compiler translates from source language to target machine code. Without an intermediate representation, you would need a separate compiler for every (source language, target machine) pair — N languages × M machines = N×M compilers.

With an intermediate representation (IR), you need only N front-ends and M back-ends — N+M total. The front-end translates any source language to IR; the back-end translates IR to any target machine. This is the key design behind LLVM (used by Clang, Rust, Swift) and GCC (GIMPLE IR).

Intermediate code also separates concerns: language-specific transformations happen in the front-end; machine-independent optimisations happen on the IR; machine-specific optimisations happen in the back-end.

2. Three-Address Code (3AC)

Three-address code (TAC) is the most widely used IR. Each statement has at most one operator and references at most three addresses (operands). Operands can be variable names, constants, or compiler-generated temporaries (t1, t2, …).

Forms of 3AC Instructions

Instruction Type	Form	Example
Assignment	x = y op z	t1 = a + b
Unary assignment	x = op y	t2 = -c
Copy	x = y	d = t1
Unconditional jump	goto L	goto L3
Conditional jump	if x relop y goto L	if t1 > 0 goto L2
Procedure call	param x; call p, n	param a; call f, 1
Return	return y	return t3
Array access	x = y[i] or y[i] = x	t4 = a[i]
Address / pointer	x = &y or x = *y	t5 = &a

Example: Translating a = b * c + b * c

t1 = b * c
t2 = b * c
t3 = t1 + t2
a = t3

Notice that b * c is computed twice. A DAG or common subexpression elimination would eliminate the redundancy:

t1 = b * c
t2 = t1 + t1
a = t2

3. Quadruples and Triples

3AC instructions can be stored in two ways:

Quadruples

Each instruction is a 4-field record: (operator, argument1, argument2, result).

#	op	arg1	arg2	result
0	*	b	c	t1
1	*	b	c	t2
2	+	t1	t2	t3
3	=	t3		a

Quadruples explicitly name the result — easy to rearrange and optimise.

Triples

Each instruction is a 3-field record: (operator, argument1, argument2). The result is the instruction’s position number (index into the array).

#	op	arg1	arg2
0	*	b	c
1	*	b	c
2	+	(0)	(1)
3	=	a	(2)

Triples are more compact but harder to reorder (instruction positions change).

4. DAG Representation

A Directed Acyclic Graph (DAG) represents an expression block without redundancy. Nodes represent operators; leaves represent operands (variables and constants). If the same subexpression appears multiple times, it shares a single node — this naturally identifies common subexpressions.

Building a DAG

For each leaf operand (variable or constant): find or create a node labelled with that operand
For each operator: check if a node for (op, left-child, right-child) already exists. If yes, share it. If no, create a new node.
Attach the result variable name to the node

Uses of DAGs

Common Subexpression Elimination: Shared nodes mean shared computation — compute once, reuse result
Dead code detection: A node with no variable name attached (and not the root) is dead
Value numbering: Assign numbers to expressions; identical numbers mean identical values

5. Basic Blocks

A basic block is a maximal sequence of consecutive 3AC instructions such that:

Control enters only at the first instruction (no jumps into the middle)
Control leaves only at the last instruction (no jumps out except at the end)

Basic blocks are the unit of local optimisation — within a block, there are no branches, so dataflow is completely predictable.

Algorithm to Identify Basic Block Leaders

A leader is the first instruction of a basic block:

The first instruction of the program is a leader
Any instruction that is the target of a conditional or unconditional jump is a leader
Any instruction immediately following a conditional or unconditional jump is a leader

A basic block consists of a leader and all instructions up to (but not including) the next leader.

6. Control Flow Graph (CFG)

A Control Flow Graph is a directed graph where:

Each node is a basic block
A directed edge from block B1 to block B2 means control can flow from B1 to B2

CFGs are the foundation for global optimisation and data flow analysis. Key CFG concepts:

Entry node: The block containing the program’s first instruction
Exit node: The block containing the last instruction (or return)
Predecessors of B: All blocks with an edge to B
Successors of B: All blocks B has an edge to
Dominator: Node D dominates node N if every path from entry to N passes through D

Back Edges and Loops

A back edge is a CFG edge from a node to one of its dominators. Back edges identify loops. A natural loop corresponding to a back edge (n, h) is the set of nodes that can reach n without going through h, plus h itself. Loop detection is essential for loop optimisation.

7. Static Single Assignment (SSA) Form

In SSA form, every variable is assigned exactly once. When a variable is used in multiple branches that merge, a special φ (phi) function is inserted at the merge point to select the appropriate version.

Original: x = 1; if (c) x = 2; y = x;
SSA form: x1 = 1; if (c) x2 = 2; x3 = phi(x1, x2); y = x3;

SSA simplifies many optimisation algorithms because dataflow is explicit — each use of a variable directly points to its unique definition. Used by LLVM, GCC, and virtually all modern compiler backends.

8. Translating Language Constructs to 3AC

If-Else Statement

if (E) S1 else S2

[evaluate E to t]
if t == false goto L_false
[code for S1]
goto L_end
L_false: [code for S2]
L_end:

While Loop

while (E) S

L_begin: [evaluate E to t]
if t == false goto L_end
[code for S]
goto L_begin
L_end:

Array Access: a[i][j] (row-major)

offset = i * (number_of_cols) + j
t = a[offset]

9. Common Misconceptions

Misconception 1: Three-address code uses exactly three addresses.
The name means “at most three addresses” (two sources and one result). Instructions like x = y (copy) or goto L have fewer than three addresses. The term refers to the maximum, not a fixed count.

Misconception 2: A DAG and a tree represent the same thing.
A parse tree has no sharing — the same subexpression appearing twice creates two separate subtrees. A DAG allows sharing: the same subexpression is represented by a single node with multiple parents. DAGs are more compact and directly expose redundancy for optimisation.

Misconception 3: Quadruples and triples are equivalent in all respects.
Both represent the same computation, but triples are harder to reorder because the result address is the instruction’s index. Moving an instruction changes its index, breaking all references to it. Quadruples name results explicitly, making reordering during optimisation much simpler.

Misconception 4: Basic blocks can have multiple entry points.
By definition, a basic block has exactly one entry point (the leader) and one exit point. If a jump targets the middle of a block, the block is split — the target becomes a new leader, creating a new basic block.

Misconception 5: SSA form is used only in academic compilers.
SSA is used by every major industrial compiler. LLVM IR is in SSA form. GCC uses GIMPLE, which is converted to SSA for optimisation. Java’s JIT compilers (HotSpot) use SSA internally. It is the standard IR for modern optimising compilers.

10. Frequently Asked Questions

Why is intermediate code necessary? Can’t the compiler go directly from AST to machine code?

Technically yes, but intermediate code provides crucial advantages: (1) Retargetability — the same front-end works for multiple target architectures by swapping the back-end. (2) Machine-independent optimisation — many powerful optimisations (constant folding, common subexpression elimination) work best on IR, before the complexity of real machine constraints is introduced. (3) Separation of concerns — front-end engineers and back-end engineers can work independently. This is why LLVM IR, GCC’s GIMPLE, and Java bytecode exist.

What is the difference between a basic block and a function?

A function is a named unit of code with parameters and a return value — it may contain many basic blocks. A basic block is a straight-line sequence of instructions with no branches inside — it is the smallest unit of control flow analysis. A function is typically broken into multiple basic blocks by if-statements, loops, and switch statements. One function can have dozens of basic blocks; one basic block always belongs to exactly one function.

What is SSA form and why is it used?

In SSA (Static Single Assignment) form, every variable is defined (assigned) exactly once. When multiple branches of code assign to the same variable and the branches merge, a phi function selects the right version. SSA makes dataflow explicit and simplifies many optimisation algorithms — dead code elimination, constant propagation, and register allocation all become simpler in SSA. It is the standard IR for LLVM and GCC’s optimisation passes.

How does a compiler identify loops in intermediate code?

Loops are identified through the Control Flow Graph using dominance analysis. A back edge is an edge from node n to a dominator h (meaning h appears on every path from entry to n). Each back edge identifies a natural loop — all nodes that can reach n without going through h, plus h itself. These natural loops are the targets of loop optimisation passes like loop invariant code motion and loop unrolling.