Query Processing and Optimisation in DBMS – Cost Estimation and Join Algorithms
What You Will Learn
- Query processing pipeline: parsing, optimisation, execution
- Equivalence rules for relational algebra expressions
- Cost estimation: block I/O model and statistics
- Selection algorithms: sequential scan, binary search, index scan
- Join algorithms: nested loop, block nested loop, merge join, hash join
- Heuristic optimisation rules
- GATE PYQs with solutions
GATE Weightage: 1 mark (occasionally). Join cost calculations and heuristic rules are the likely question types.
1. Query Processing Pipeline
When a SQL query is submitted, the DBMS processes it through the following stages:
- Parser: Checks SQL syntax; produces a parse tree
- Translator: Converts parse tree to a relational algebra (RA) expression
- Optimiser: Generates alternative RA expressions; estimates costs; selects cheapest plan
- Query Evaluation Engine: Executes the chosen plan; returns results
2. Cost Model
DBMS optimisers use cost models based on estimated number of disk I/Os (block accesses). Main statistics used:
| Parameter | Symbol | Meaning |
|---|---|---|
| Number of tuples | n_r | Total rows in relation r |
| Number of blocks | b_r | Disk blocks used by r |
| Blocking factor | f_r | Tuples per block = n_r / b_r |
| Distinct values | V(A, r) | Number of distinct values for attribute A in r |
| Selection cardinality | s | Estimated tuples satisfying a condition |
Equality: s = n_r / V(A, r) (uniform distribution assumption)
Range A < v: s = (v – min(A)) / (max(A) – min(A)) × n_r
Join result size estimate:
|R ⋈ S| ≈ n_R × n_S / max(V(A, R), V(A, S)) (on join attribute A)
3. Equivalence Rules
Two RA expressions are equivalent if they produce the same set of tuples for all valid database instances. Key equivalence rules used in optimisation:
| Rule | Description |
|---|---|
| Cascade of selections | σ_{c1 ∧ c2}(R) ≡ σ_{c1}(σ_{c2}(R)) |
| Commutativity of selection | σ_{c1}(σ_{c2}(R)) ≡ σ_{c2}(σ_{c1}(R)) |
| Cascade of projections | π_{L1}(π_{L2}(R)) ≡ π_{L1}(R) if L1 ⊆ L2 |
| Selection-join commute | σ_c(R ⋈ S) ≡ σ_c(R) ⋈ S if c involves only R’s attributes |
| Commutativity of join | R ⋈ S ≡ S ⋈ R |
| Associativity of join | (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T) |
4. Selection Algorithms
| Algorithm | Condition | Cost (block I/Os) |
|---|---|---|
| Linear scan | Any condition | b_r (scan all blocks) |
| Binary search | Ordered file, equality on sort key | ⌈log₂(b_r)⌉ |
| Primary B+ tree index, equality | Key attribute | h_i + 1 (h_i = tree height) |
| Primary B+ tree index, equality | Non-key attribute | h_i + ⌈b/f_r⌉ (b = matching blocks) |
| Secondary B+ tree index, equality | Key attribute | h_i + 1 |
| Secondary B+ tree index, equality | Non-key attribute | h_i + n (n = matching records — worst case) |
| Primary index, range | A ≥ v on primary index key | h_i + number of matching blocks |
5. Join Algorithms
Notation
Relations: R (outer, n_R tuples, b_R blocks) and S (inner, n_S tuples, b_S blocks). M = memory buffers available.
Nested Loop Join
OR
Cost = b_R + b_R × b_S (block-based, using 1 buffer for outer, 1 for inner, 1 for output)
Minimise cost: Put smaller relation as outer (to minimise b_R × b_S iterations)
Block Nested Loop Join
With M memory pages: use M-2 pages for outer relation blocks, 1 for inner, 1 for output.
If M-2 ≥ b_R (outer fits in memory): Cost = b_R + b_S (optimal)
Sort-Merge Join
Sort cost = 2 × b × ⌈log_{M-1}(⌈b/M⌉)⌉ + 2b passes × b_R blocks
Simplified: if R and S are already sorted → Cost = b_R + b_S (just merge)
Best for: relations already sorted, or range join conditions
Hash Join
Cost = 2(b_R + b_S)
Phase 2 (Build-Probe): For each bucket pair, build hash table on smaller, probe with larger
Cost = (b_R + b_S)
Total Cost = 3(b_R + b_S)
Condition: each partition of smaller relation must fit in M-2 memory pages
| Algorithm | I/O Cost | Best For |
|---|---|---|
| Nested Loop (tuple) | b_R + n_R × b_S | Small relations; one fits in memory |
| Block Nested Loop | b_R + ⌈b_R/(M-2)⌉ × b_S | When one relation fits in memory |
| Sort-Merge Join | 3(b_R + b_S) + sort cost | Pre-sorted relations; range joins |
| Hash Join | 3(b_R + b_S) | Equi-joins; large relations; enough memory |
6. Heuristic Optimisation
Heuristics transform relational algebra expressions to reduce intermediate result sizes before cost-based evaluation:
- Push selections down: Apply σ as early as possible to reduce tuple count early. Most important heuristic.
- Push projections down: Apply π early to reduce tuple width (fewer bytes to process).
- Most restrictive first: Apply the selection with lowest selectivity (fewest matching tuples) first.
- Avoid Cartesian product: Combine σ and × into ⋈ whenever possible.
- Left-deep join trees: Organise joins as left-deep trees to enable pipelining — the right input to each join is always a base relation (can be streamed from disk).
Original: σ_{Dept=5}(EMPLOYEE ⋈ DEPARTMENT)
If Dept is an attribute of EMPLOYEE:
Optimised: σ_{Dept=5}(EMPLOYEE) ⋈ DEPARTMENT
Result: Instead of joining all tuples then filtering, filter EMPLOYEE first (fewer rows to join).
7. GATE Examples
GATE 2016: Relation R has 1000 tuples in 100 blocks. Relation S has 500 tuples in 50 blocks. 5 buffer pages available. What is the cost of block nested loop join with R as outer?
View Solution
Block nested loop join cost = b_R + ⌈b_R / (M-2)⌉ × b_S
M = 5, M-2 = 3 buffers for outer chunks
Cost = 100 + ⌈100/3⌉ × 50 = 100 + 34 × 50 = 100 + 1700 = 1800 block I/Os
With S as outer: 50 + ⌈50/3⌉ × 100 = 50 + 17 × 100 = 50 + 1700 = 1750 I/Os
Better to use S as outer (smaller relation).
GATE 2014: A relation with 10,000 tuples is stored in 1,000 blocks. An index has 3 levels. How many disk accesses are needed to retrieve a tuple by its primary key using the B+ tree index?
View Solution
Using primary B+ tree index for equality on key attribute:
Cost = tree height + 1 data block access
Tree height = 3 levels → 3 I/Os for tree traversal + 1 for data block
Answer: 4 disk accesses
8. Common Mistakes
- Nested loop vs block nested loop: Basic nested loop iterates per tuple (cost = b_R + n_R × b_S); block nested loop iterates per block of outer relation (cost = b_R + ⌈b_R/(M-2)⌉ × b_S). Always use block nested loop when memory is given.
- Inner vs outer relation in join: For nested loop join, always put the SMALLER relation as the outer to minimise the multiplier term. For block nested loop, put smaller as outer to reduce ⌈b_R/(M-2)⌉.
- Hash join requires equi-join: Hash join only works for equi-join conditions (equality). It cannot handle range join conditions (A < B). Sort-merge join works for both.
- Sort-merge with pre-sorted input: If both relations are already sorted on the join attribute, sort-merge join costs only b_R + b_S (no sorting phase needed). Students often add sorting cost unnecessarily.
- Secondary index for range queries: Secondary indexes are efficient for equality but very inefficient for range queries on large result sets — each matching record may be in a different block, requiring one I/O per record (worst case).
9. Frequently Asked Questions
What are the steps in query processing in DBMS?
Four steps: (1) Parsing — SQL is parsed into a parse tree; (2) Translation — parse tree converted to relational algebra; (3) Optimisation — alternative plans generated and costed using statistics; cheapest chosen; (4) Execution — query engine runs the plan and returns results. The optimiser uses cost estimation (block I/O model with tuple counts, block counts, distinct value statistics) to compare plans.
What is the cost of nested loop join in DBMS?
Tuple nested loop: b_R + n_R × b_S. Block nested loop (with M buffers): b_R + ⌈b_R/(M-2)⌉ × b_S. If the outer relation fits entirely in M-2 buffers: cost reduces to b_R + b_S. Always put the smaller relation as the outer to minimise cost. For GATE: use block nested loop formula when buffer count M is given.
What is heuristic optimisation in DBMS?
Heuristic optimisation applies transformation rules to relational algebra expressions before cost-based analysis. The key heuristics are: push selections as low as possible in the tree (reduces intermediate size); push projections down (reduces tuple width); apply most selective operations first; replace Cartesian products with joins; use left-deep join trees for pipelining. These rules reduce the search space for cost-based optimisation.