Query Processing and Optimisation in DBMS – Cost Estimation and Join Algorithms

Q: What are the steps in query processing in DBMS?

Query processing in DBMS involves four main steps: (1) Parsing and Translation — SQL query is parsed into a parse tree, then translated into a relational algebra expression; (2) Optimisation — the query optimiser generates multiple equivalent relational algebra expressions (query plans) and estimates the cost of each using statistics (number of tuples, block counts, histograms); (3) Execution — the query engine executes the chosen plan using physical operators (sequential scan, index scan, nested loop join, etc.); (4) Result — output returned to the user. The optimiser chooses the plan with minimum estimated I/O cost.

Q: What is the cost of nested loop join in DBMS?

Nested loop join cost: for relations R (outer, n_R blocks) and S (inner, n_S blocks), the cost is n_R + n_R × n_S block I/Os (one read of R + for each block of R, a full scan of S). With M memory buffers, block nested loop join cost = n_R + ⌈n_R / (M-2)⌉ × n_S I/Os. If the smaller relation fits entirely in memory, cost = n_R + n_S. Always put the smaller relation as the inner loop to reduce total I/Os, or the relation that fits in memory as the outer loop.

Q: What is heuristic optimisation in DBMS?

Heuristic optimisation applies rules that generally improve query performance without exact cost calculation: (1) Perform selections as early as possible — reduces intermediate result sizes; (2) Perform projections as early as possible — reduces tuple width; (3) Perform the most restrictive selections first; (4) Replace Cartesian products with joins (add selection conditions); (5) Use left-deep join trees (allows pipelining, one relation always from memory). These heuristics transform the relational algebra expression before cost-based optimisation.

What You Will Learn

Query processing pipeline: parsing, optimisation, execution
Equivalence rules for relational algebra expressions
Cost estimation: block I/O model and statistics
Selection algorithms: sequential scan, binary search, index scan
Join algorithms: nested loop, block nested loop, merge join, hash join
Heuristic optimisation rules
GATE PYQs with solutions

GATE Weightage: 1 mark (occasionally). Join cost calculations and heuristic rules are the likely question types.

1. Query Processing Pipeline

When a SQL query is submitted, the DBMS processes it through the following stages:

Parser: Checks SQL syntax; produces a parse tree
Translator: Converts parse tree to a relational algebra (RA) expression
Optimiser: Generates alternative RA expressions; estimates costs; selects cheapest plan
Query Evaluation Engine: Executes the chosen plan; returns results

Query Optimiser Output: An annotated relational algebra expression called an evaluation plan — specifies which algorithm to use for each operator (e.g., “use hash join for this join”, “use index scan for this selection”).

2. Cost Model

DBMS optimisers use cost models based on estimated number of disk I/Os (block accesses). Main statistics used:

Parameter	Symbol	Meaning
Number of tuples	n_r	Total rows in relation r
Number of blocks	b_r	Disk blocks used by r
Blocking factor	f_r	Tuples per block = n_r / b_r
Distinct values	V(A, r)	Number of distinct values for attribute A in r
Selection cardinality	s	Estimated tuples satisfying a condition

Selectivity (fraction of tuples satisfying condition):
Equality: s = n_r / V(A, r) (uniform distribution assumption)
Range A < v: s = (v – min(A)) / (max(A) – min(A)) × n_r

Join result size estimate:
|R ⋈ S| ≈ n_R × n_S / max(V(A, R), V(A, S)) (on join attribute A)

3. Equivalence Rules

Two RA expressions are equivalent if they produce the same set of tuples for all valid database instances. Key equivalence rules used in optimisation:

Rule	Description
Cascade of selections	σ_{c1 ∧ c2}(R) ≡ σ_{c1}(σ_{c2}(R))
Commutativity of selection	σ_{c1}(σ_{c2}(R)) ≡ σ_{c2}(σ_{c1}(R))
Cascade of projections	π_{L1}(π_{L2}(R)) ≡ π_{L1}(R) if L1 ⊆ L2
Selection-join commute	σ_c(R ⋈ S) ≡ σ_c(R) ⋈ S if c involves only R’s attributes
Commutativity of join	R ⋈ S ≡ S ⋈ R
Associativity of join	(R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

4. Selection Algorithms

Algorithm	Condition	Cost (block I/Os)
Linear scan	Any condition	b_r (scan all blocks)
Binary search	Ordered file, equality on sort key	⌈log₂(b_r)⌉
Primary B+ tree index, equality	Key attribute	h_i + 1 (h_i = tree height)
Primary B+ tree index, equality	Non-key attribute	h_i + ⌈b/f_r⌉ (b = matching blocks)
Secondary B+ tree index, equality	Key attribute	h_i + 1
Secondary B+ tree index, equality	Non-key attribute	h_i + n (n = matching records — worst case)
Primary index, range	A ≥ v on primary index key	h_i + number of matching blocks

5. Join Algorithms

Notation

Relations: R (outer, n_R tuples, b_R blocks) and S (inner, n_S tuples, b_S blocks). M = memory buffers available.

Nested Loop Join

Cost = b_R + n_R × b_S (tuple-based outer loop)
OR
Cost = b_R + b_R × b_S (block-based, using 1 buffer for outer, 1 for inner, 1 for output)

Minimise cost: Put smaller relation as outer (to minimise b_R × b_S iterations)

Block Nested Loop Join

Cost = b_R + ⌈b_R / (M-2)⌉ × b_S

With M memory pages: use M-2 pages for outer relation blocks, 1 for inner, 1 for output.
If M-2 ≥ b_R (outer fits in memory): Cost = b_R + b_S (optimal)

Sort-Merge Join

Cost = Sort(R) + Sort(S) + (b_R + b_S)

Sort cost = 2 × b × ⌈log_{M-1}(⌈b/M⌉)⌉ + 2b passes × b_R blocks
Simplified: if R and S are already sorted → Cost = b_R + b_S (just merge)
Best for: relations already sorted, or range join conditions

Hash Join

Phase 1 (Partition): Hash R and S into n buckets on join attribute
Cost = 2(b_R + b_S)

Phase 2 (Build-Probe): For each bucket pair, build hash table on smaller, probe with larger
Cost = (b_R + b_S)

Total Cost = 3(b_R + b_S)
Condition: each partition of smaller relation must fit in M-2 memory pages

Algorithm	I/O Cost	Best For
Nested Loop (tuple)	b_R + n_R × b_S	Small relations; one fits in memory
Block Nested Loop	b_R + ⌈b_R/(M-2)⌉ × b_S	When one relation fits in memory
Sort-Merge Join	3(b_R + b_S) + sort cost	Pre-sorted relations; range joins
Hash Join	3(b_R + b_S)	Equi-joins; large relations; enough memory

6. Heuristic Optimisation

Heuristics transform relational algebra expressions to reduce intermediate result sizes before cost-based evaluation:

Push selections down: Apply σ as early as possible to reduce tuple count early. Most important heuristic.
Push projections down: Apply π early to reduce tuple width (fewer bytes to process).
Most restrictive first: Apply the selection with lowest selectivity (fewest matching tuples) first.
Avoid Cartesian product: Combine σ and × into ⋈ whenever possible.
Left-deep join trees: Organise joins as left-deep trees to enable pipelining — the right input to each join is always a base relation (can be streamed from disk).

Example (push selection down):
Original: σ_{Dept=5}(EMPLOYEE ⋈ DEPARTMENT)

If Dept is an attribute of EMPLOYEE:
Optimised: σ_{Dept=5}(EMPLOYEE) ⋈ DEPARTMENT

Result: Instead of joining all tuples then filtering, filter EMPLOYEE first (fewer rows to join).

7. GATE Examples

GATE 2016: Relation R has 1000 tuples in 100 blocks. Relation S has 500 tuples in 50 blocks. 5 buffer pages available. What is the cost of block nested loop join with R as outer?

View Solution

Block nested loop join cost = b_R + ⌈b_R / (M-2)⌉ × b_S

M = 5, M-2 = 3 buffers for outer chunks

Cost = 100 + ⌈100/3⌉ × 50 = 100 + 34 × 50 = 100 + 1700 = 1800 block I/Os

With S as outer: 50 + ⌈50/3⌉ × 100 = 50 + 17 × 100 = 50 + 1700 = 1750 I/Os

Better to use S as outer (smaller relation).

GATE 2014: A relation with 10,000 tuples is stored in 1,000 blocks. An index has 3 levels. How many disk accesses are needed to retrieve a tuple by its primary key using the B+ tree index?

View Solution

Using primary B+ tree index for equality on key attribute:

Cost = tree height + 1 data block access

Tree height = 3 levels → 3 I/Os for tree traversal + 1 for data block

Answer: 4 disk accesses

8. Common Mistakes

Nested loop vs block nested loop: Basic nested loop iterates per tuple (cost = b_R + n_R × b_S); block nested loop iterates per block of outer relation (cost = b_R + ⌈b_R/(M-2)⌉ × b_S). Always use block nested loop when memory is given.
Inner vs outer relation in join: For nested loop join, always put the SMALLER relation as the outer to minimise the multiplier term. For block nested loop, put smaller as outer to reduce ⌈b_R/(M-2)⌉.
Hash join requires equi-join: Hash join only works for equi-join conditions (equality). It cannot handle range join conditions (A < B). Sort-merge join works for both.
Sort-merge with pre-sorted input: If both relations are already sorted on the join attribute, sort-merge join costs only b_R + b_S (no sorting phase needed). Students often add sorting cost unnecessarily.
Secondary index for range queries: Secondary indexes are efficient for equality but very inefficient for range queries on large result sets — each matching record may be in a different block, requiring one I/O per record (worst case).

9. Frequently Asked Questions

What are the steps in query processing in DBMS?

Four steps: (1) Parsing — SQL is parsed into a parse tree; (2) Translation — parse tree converted to relational algebra; (3) Optimisation — alternative plans generated and costed using statistics; cheapest chosen; (4) Execution — query engine runs the plan and returns results. The optimiser uses cost estimation (block I/O model with tuple counts, block counts, distinct value statistics) to compare plans.

What is the cost of nested loop join in DBMS?

Tuple nested loop: b_R + n_R × b_S. Block nested loop (with M buffers): b_R + ⌈b_R/(M-2)⌉ × b_S. If the outer relation fits entirely in M-2 buffers: cost reduces to b_R + b_S. Always put the smaller relation as the outer to minimise cost. For GATE: use block nested loop formula when buffer count M is given.

What is heuristic optimisation in DBMS?

Heuristic optimisation applies transformation rules to relational algebra expressions before cost-based analysis. The key heuristics are: push selections as low as possible in the tree (reduces intermediate size); push projections down (reduces tuple width); apply most selective operations first; replace Cartesian products with joins; use left-deep join trees for pipelining. These rules reduce the search space for cost-based optimisation.

Query Processing and Optimisation in DBMS – Cost Estimation and Join Algorithms

Query Processing and Optimisation in DBMS – Cost Estimation and Join Algorithms

What You Will Learn

1. Query Processing Pipeline

2. Cost Model

3. Equivalence Rules

4. Selection Algorithms

5. Join Algorithms

Notation

Nested Loop Join

Block Nested Loop Join

Sort-Merge Join

Hash Join

6. Heuristic Optimisation

7. GATE Examples

8. Common Mistakes

9. Frequently Asked Questions

What are the steps in query processing in DBMS?

What is the cost of nested loop join in DBMS?

What is heuristic optimisation in DBMS?

Next Topics

Leave a Comment