Query Processing and Optimisation in DBMS – Cost Estimation and Join Algorithms



Query Processing and Optimisation in DBMS – Cost Estimation and Join Algorithms

What You Will Learn

  • Query processing pipeline: parsing, optimisation, execution
  • Equivalence rules for relational algebra expressions
  • Cost estimation: block I/O model and statistics
  • Selection algorithms: sequential scan, binary search, index scan
  • Join algorithms: nested loop, block nested loop, merge join, hash join
  • Heuristic optimisation rules
  • GATE PYQs with solutions

GATE Weightage: 1 mark (occasionally). Join cost calculations and heuristic rules are the likely question types.

1. Query Processing Pipeline

When a SQL query is submitted, the DBMS processes it through the following stages:

  1. Parser: Checks SQL syntax; produces a parse tree
  2. Translator: Converts parse tree to a relational algebra (RA) expression
  3. Optimiser: Generates alternative RA expressions; estimates costs; selects cheapest plan
  4. Query Evaluation Engine: Executes the chosen plan; returns results
Query Optimiser Output: An annotated relational algebra expression called an evaluation plan — specifies which algorithm to use for each operator (e.g., “use hash join for this join”, “use index scan for this selection”).

2. Cost Model

DBMS optimisers use cost models based on estimated number of disk I/Os (block accesses). Main statistics used:

ParameterSymbolMeaning
Number of tuplesn_rTotal rows in relation r
Number of blocksb_rDisk blocks used by r
Blocking factorf_rTuples per block = n_r / b_r
Distinct valuesV(A, r)Number of distinct values for attribute A in r
Selection cardinalitysEstimated tuples satisfying a condition
Selectivity (fraction of tuples satisfying condition):
Equality: s = n_r / V(A, r) (uniform distribution assumption)
Range A < v: s = (v – min(A)) / (max(A) – min(A)) × n_r

Join result size estimate:
|R ⋈ S| ≈ n_R × n_S / max(V(A, R), V(A, S)) (on join attribute A)

3. Equivalence Rules

Two RA expressions are equivalent if they produce the same set of tuples for all valid database instances. Key equivalence rules used in optimisation:

RuleDescription
Cascade of selectionsσ_{c1 ∧ c2}(R) ≡ σ_{c1}(σ_{c2}(R))
Commutativity of selectionσ_{c1}(σ_{c2}(R)) ≡ σ_{c2}(σ_{c1}(R))
Cascade of projectionsπ_{L1}(π_{L2}(R)) ≡ π_{L1}(R) if L1 ⊆ L2
Selection-join commuteσ_c(R ⋈ S) ≡ σ_c(R) ⋈ S if c involves only R’s attributes
Commutativity of joinR ⋈ S ≡ S ⋈ R
Associativity of join(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

4. Selection Algorithms

AlgorithmConditionCost (block I/Os)
Linear scanAny conditionb_r (scan all blocks)
Binary searchOrdered file, equality on sort key⌈log₂(b_r)⌉
Primary B+ tree index, equalityKey attributeh_i + 1 (h_i = tree height)
Primary B+ tree index, equalityNon-key attributeh_i + ⌈b/f_r⌉ (b = matching blocks)
Secondary B+ tree index, equalityKey attributeh_i + 1
Secondary B+ tree index, equalityNon-key attributeh_i + n (n = matching records — worst case)
Primary index, rangeA ≥ v on primary index keyh_i + number of matching blocks

5. Join Algorithms

Notation

Relations: R (outer, n_R tuples, b_R blocks) and S (inner, n_S tuples, b_S blocks). M = memory buffers available.

Nested Loop Join

Cost = b_R + n_R × b_S (tuple-based outer loop)
OR
Cost = b_R + b_R × b_S (block-based, using 1 buffer for outer, 1 for inner, 1 for output)

Minimise cost: Put smaller relation as outer (to minimise b_R × b_S iterations)

Block Nested Loop Join

Cost = b_R + ⌈b_R / (M-2)⌉ × b_S

With M memory pages: use M-2 pages for outer relation blocks, 1 for inner, 1 for output.
If M-2 ≥ b_R (outer fits in memory): Cost = b_R + b_S (optimal)

Sort-Merge Join

Cost = Sort(R) + Sort(S) + (b_R + b_S)

Sort cost = 2 × b × ⌈log_{M-1}(⌈b/M⌉)⌉ + 2b passes × b_R blocks
Simplified: if R and S are already sorted → Cost = b_R + b_S (just merge)
Best for: relations already sorted, or range join conditions

Hash Join

Phase 1 (Partition): Hash R and S into n buckets on join attribute
Cost = 2(b_R + b_S)

Phase 2 (Build-Probe): For each bucket pair, build hash table on smaller, probe with larger
Cost = (b_R + b_S)

Total Cost = 3(b_R + b_S)
Condition: each partition of smaller relation must fit in M-2 memory pages

AlgorithmI/O CostBest For
Nested Loop (tuple)b_R + n_R × b_SSmall relations; one fits in memory
Block Nested Loopb_R + ⌈b_R/(M-2)⌉ × b_SWhen one relation fits in memory
Sort-Merge Join3(b_R + b_S) + sort costPre-sorted relations; range joins
Hash Join3(b_R + b_S)Equi-joins; large relations; enough memory

6. Heuristic Optimisation

Heuristics transform relational algebra expressions to reduce intermediate result sizes before cost-based evaluation:

  1. Push selections down: Apply σ as early as possible to reduce tuple count early. Most important heuristic.
  2. Push projections down: Apply π early to reduce tuple width (fewer bytes to process).
  3. Most restrictive first: Apply the selection with lowest selectivity (fewest matching tuples) first.
  4. Avoid Cartesian product: Combine σ and × into ⋈ whenever possible.
  5. Left-deep join trees: Organise joins as left-deep trees to enable pipelining — the right input to each join is always a base relation (can be streamed from disk).
Example (push selection down):
Original: σ_{Dept=5}(EMPLOYEE ⋈ DEPARTMENT)

If Dept is an attribute of EMPLOYEE:
Optimised: σ_{Dept=5}(EMPLOYEE) ⋈ DEPARTMENT

Result: Instead of joining all tuples then filtering, filter EMPLOYEE first (fewer rows to join).

7. GATE Examples

GATE 2016: Relation R has 1000 tuples in 100 blocks. Relation S has 500 tuples in 50 blocks. 5 buffer pages available. What is the cost of block nested loop join with R as outer?

View Solution

Block nested loop join cost = b_R + ⌈b_R / (M-2)⌉ × b_S

M = 5, M-2 = 3 buffers for outer chunks

Cost = 100 + ⌈100/3⌉ × 50 = 100 + 34 × 50 = 100 + 1700 = 1800 block I/Os

With S as outer: 50 + ⌈50/3⌉ × 100 = 50 + 17 × 100 = 50 + 1700 = 1750 I/Os

Better to use S as outer (smaller relation).

GATE 2014: A relation with 10,000 tuples is stored in 1,000 blocks. An index has 3 levels. How many disk accesses are needed to retrieve a tuple by its primary key using the B+ tree index?

View Solution

Using primary B+ tree index for equality on key attribute:

Cost = tree height + 1 data block access

Tree height = 3 levels → 3 I/Os for tree traversal + 1 for data block

Answer: 4 disk accesses

8. Common Mistakes

  • Nested loop vs block nested loop: Basic nested loop iterates per tuple (cost = b_R + n_R × b_S); block nested loop iterates per block of outer relation (cost = b_R + ⌈b_R/(M-2)⌉ × b_S). Always use block nested loop when memory is given.
  • Inner vs outer relation in join: For nested loop join, always put the SMALLER relation as the outer to minimise the multiplier term. For block nested loop, put smaller as outer to reduce ⌈b_R/(M-2)⌉.
  • Hash join requires equi-join: Hash join only works for equi-join conditions (equality). It cannot handle range join conditions (A < B). Sort-merge join works for both.
  • Sort-merge with pre-sorted input: If both relations are already sorted on the join attribute, sort-merge join costs only b_R + b_S (no sorting phase needed). Students often add sorting cost unnecessarily.
  • Secondary index for range queries: Secondary indexes are efficient for equality but very inefficient for range queries on large result sets — each matching record may be in a different block, requiring one I/O per record (worst case).

9. Frequently Asked Questions

What are the steps in query processing in DBMS?

Four steps: (1) Parsing — SQL is parsed into a parse tree; (2) Translation — parse tree converted to relational algebra; (3) Optimisation — alternative plans generated and costed using statistics; cheapest chosen; (4) Execution — query engine runs the plan and returns results. The optimiser uses cost estimation (block I/O model with tuple counts, block counts, distinct value statistics) to compare plans.

What is the cost of nested loop join in DBMS?

Tuple nested loop: b_R + n_R × b_S. Block nested loop (with M buffers): b_R + ⌈b_R/(M-2)⌉ × b_S. If the outer relation fits entirely in M-2 buffers: cost reduces to b_R + b_S. Always put the smaller relation as the outer to minimise cost. For GATE: use block nested loop formula when buffer count M is given.

What is heuristic optimisation in DBMS?

Heuristic optimisation applies transformation rules to relational algebra expressions before cost-based analysis. The key heuristics are: push selections as low as possible in the tree (reduces intermediate size); push projections down (reduces tuple width); apply most selective operations first; replace Cartesian products with joins; use left-deep join trees for pipelining. These rules reduce the search space for cost-based optimisation.

Leave a Comment