What Is a Compiler?
A compiler is a program that reads source code in one language and produces equivalent code in another language, usually with the goal of producing efficient machine-executable output. The translation happens in a series of well-defined phases, each building on the output of the previous one.
A related tool, an interpreter, executes source code directly without producing a separate output file. Most modern language implementations use a combination — Python compiles to bytecode, which is then interpreted by the Python VM; Java compiles to JVM bytecode, which is JIT-compiled to native code at runtime.
The Phases of a Compiler
A typical compiler is organised into front-end, middle-end, and back-end phases:
| Phase | Input | Output | Theory Behind It |
|---|---|---|---|
| Lexical Analysis | Source characters | Token stream | Regular expressions, DFA |
| Syntax Analysis (Parsing) | Token stream | Parse tree / AST | Context-free grammars, LL/LR parsing |
| Semantic Analysis | AST | Annotated AST | Type systems, symbol tables, SDT |
| Intermediate Code Generation | Annotated AST | Intermediate code (3-address code) | DAG, quadruples, SSA |
| Code Optimisation | Intermediate code | Optimised intermediate code | Data flow analysis, loop optimisation |
| Code Generation | Optimised intermediate code | Target machine code | Register allocation, instruction selection |
Why Study Compiler Design?
- Language design: Compilers force you to think precisely about syntax and semantics — what is a valid program and what does it mean?
- Performance: Understanding how compilers optimise code helps you write code that compiles to efficient machine instructions.
- Tools: Compiler techniques are used in linters, static analysers, code formatters, query optimisers (SQL compilers), and domain-specific languages.
- Theory in practice: Compiler phases directly apply automata theory (DFAs for lexing), formal grammars (CFGs for parsing), and graph algorithms (data flow analysis).
Topics in This Cluster
Lexical Analysis
Tokens, patterns, lexemes, regular expressions, DFA-based lexers, LEX/FLEX tool, and handling whitespace and comments.
Syntax Analysis & Parsing
Top-down parsing (LL(1), recursive descent), bottom-up parsing (LR(0), SLR, CLR, LALR), FIRST and FOLLOW sets, parse tables.
Semantic Analysis
Type checking, symbol tables, scope rules, syntax-directed translation (SDT), attribute grammars, inherited and synthesised attributes.
Intermediate Code Generation
Three-address code, quadruples, triples, DAG representation, basic blocks, control flow graphs, and SSA form.
Code Optimisation
Peephole optimisation, common subexpression elimination, constant folding, loop optimisation, data flow analysis, liveness analysis.
Code Generation
Instruction selection, register allocation (graph colouring), instruction scheduling, and target-specific considerations.
Formulas & Quick Reference
FIRST/FOLLOW rules, LL(1) table construction, LR parsing steps, 3AC patterns, optimisation taxonomy, and key definitions.
The Compiler as a Pipeline
Think of a compiler as an assembly line. The source code enters at one end; each station transforms it into a more “machine-friendly” form; executable code exits at the other end. Each phase has a clean interface with the next — the lexer knows nothing about grammar, the parser knows nothing about machine registers, and the register allocator knows nothing about the original syntax. This separation of concerns is what makes compilers modular, testable, and retargetable (the same front-end can target multiple architectures by swapping the back-end).
Symbol Table — The Compiler’s Memory
Running through all phases is the symbol table — a data structure that stores information about every identifier in the program (variables, functions, types). The lexer populates it, the parser refines it, and the code generator reads from it. Efficient symbol table implementation (using hash tables) is critical to compiler performance.