Slides

S1: Ch. 15: Query Processing and Optimization

Learning Goals:
- ...Understand the factors affecting response time of a query, &
- ...Concepts behind PLAN_TABLE, EXPLAIN statement
Query Processing Steps
Strategies for relational operations
- Select (Point, Range), Join, set operations
- Cost models
Query Optimization Module
- Input: e.g. Query Trees / Graphs
- Output: e.g. Execution plans
- Algorithms: e.g. Dynamic Programming

S2: Query Processing

Process is similar to compilation (Fig. 15.1, pp 494)
- SQL Query -- > [scanner, parser] -- > query tree/graph
- ...-- > [OPTIMIZER] -- > efficient execution plan
- ...-- > [code generator] -- > code to execute the plan
- ...-- > [run-time support] -- > query result
We will FOCUS on the optimizer module
- Output:: Execution plan = set of selected access routines
- ...from possible access routines for each relational operator
- Input: Query tree or query graph
- Algorithms: Systematic Enumeration/ Dynamic Programming
- ...Algebraic optimization, Other heuristics
- ...Semantic Optimization, Source to Source transforms
Access Routines for relational operations & cost models
- Select (Point, Range): scan, binary search, index-search
- Join: nested loop, sort-merge, hybrid hash

S3: 15.8.1: Cost Models for Access Routines

Cost Metrics = #block accesses (OR response time in seconds)
- focus on I/O cost, not CPU/communication
I/O Cost DEPENDS ON
- Search criteria: point/range query on ordering /other fields
- File Structures: heap, sorted, hashed.
- Index-type: primary, clustering, secondary, B+tree, multi-level,...
- Other factors, e.g. buffering, disk placement, materialization,
- ...Overflow / free-space mgmt, spanned/unspanned blocking, ...
Parameters of Cost Models
- s = selectivity = fraction of tuples selected = |result| / |table|
- S(field) = selection cardinality = r(d) / (#distinct values of field)
- B, bd, bi(j) = block size, #blocks in datafile/ at level j of index
- r(d) = #records in data file
- u, bfr(d), bfr(i) = utilization, blocking factors for datafile/index
- fo, d = fanout and depth of index
- R, P, Pr = record size, tree pointersize , data record pointer size
- e = Pr[a record in overflow area] * E[overflow chain length]
Q? Compare these parameters to those in Oracle system catalog.

S4: 15.3: Access Routines for Relational Operators

Access routine - is an execution strategy for a relational operator
- May use available file-structures and indexes
- Specifies algorithms , and applicability
Point Query: (i) Linear, (ii) Binary, (iii) Hash, (iv) Indexed search
Range Query: (1) Linear search , (2) Binary search then scan
- ...3. Find 1st record in, then scan datafile (Primary/Clustered index)
- ...4. Find 1st record in, then scan index leafs (Secondary index/B+tree)
Conjunctive selection - 1. Use composite index
- ...2. Process one condition first followed by scan of result
- ...3. Intersect (record pointers retrieved by different conditions)
Q? How do we choose an appropriate plan?
- Rule based models
- Cost based models
Ex. Provide algebraic formula for access routines
- Point Search (UNIQUE field): bd/2 ; log(bd); 1+ e; d + 1;
- ...not unique = > + S/bfr(d) to sorted/hashed file; bd for heap
- Range Search: bd or bd/2 + s*bd; log(bd) + s*bd; d + 1 + s*bd;
- ...secondary index : d + s*[bi(leaf) + r(d)] worst case

S5: 15.3.2 + 15.8.4: JOIN strategies, Cost models

Focus on two way equi-join:: join(R, S, R.A = S.B)
Join-selectivity= js(R.A=S.B)=|join result|/(|R|*|S|); |X|=#records in X
- = > Join result has b(RS)= (js*|R|*|S| / bfr(RS)) blocks
(1)Nested loop: For each block in R, (a) Read matching record in S
- ...using scan or indexed search, and (b) Write partial result
- I/O Cost = bR + (bR * bS) + b(RS)
- ...Indexed search cost= f(available index(S.B)), see Page 499
- ...= bR + [|R|*(d(S.B) + S(S.B))] + b(RS), secondary index(S.B)
(2) Sort-Merge: (i) If Needed, Sort R on R.A, Sort S on S.B
- ...(ii) Scan R and S block by block [See Fig. 15.3(a), pp 503]
- Very efficient if R and S already sorted, Cost = bR + bS + b(RS)
- Sorting = > Add k*{bR * log(bR) + bS * log(bS)}
(3) Hash: (i) hash records of R, S into a common hash file (CHF)
- ...(ii) scan CHF to compute and write results
- Cost= 3*(bR+bS) + b(RS), w/ buffering (hybrid hash, pp507, para 3)

S6: Dominance Analysis of Access Routines

Ex. Compare scan and secondary-index for Range queries?
- Heap file Scan: bd; < = Secondary index: d + s*[bi(leaf) + r(d)]
- ...OR selectivity(s) > = (bd-d) / [bi(leaf)+rd] > bd/rd = (1/bfr(d))
- example: bfr(d) = 10; selectivity = 0.15 ;
Ex.Nested loop: Which relation should be in outer loop ?
- Smaller one = > reduces cost
Q? Which strategy wins for join(R,S), IF (bR > bS) and
- (1) primary index on R.A in R, S.B in S (depth = x)
- ...Let #record/distinct value of join field = a
- ...(bR = bS) = > sort merge if bR = bS
- ...(bR > a*x*bS) = > nested loop with index-scan
- (2) Heap Files , bR = bS :: hash-join (log(bR) > 10)
- (3) Hashed Files R on R.A, S on S.B,
- ...(bR = bS) = > hash join if bR = bS , log(bR) > 10
- ...(bR > 4*bfr(inner loop)*bS) = > nested loop with index-scan ?

S7: Dominance Analysis of Access Routines (Contd.)

Q? Which strategy wins for join(R,S), IF (bR > bS) and
- (4) primary index on R.A, S hashed on primary key S.B,
- ...bS = bR, bfr(R) = 3, index-depth = 1, (Hint: unique matches)
- ...Nested loop (inner = indexed search on R) cost = 4bR + b(RS)
- (5)bS < |memory| << bR ; cost = bR + bS + B(RS) for all 3
- ...1. buffer + index/sort/hash S in memory;
- ...2. For each block in R, bring the block and write result
Q? Which strategies can be used for non-equijoins?
- ...for select-project-equijoin?
Summary comparison of join strategies
- Nested loop is a dark horse, watch out!
- ...e.g. (bR/bS) >> 1, or bfr >> 1
- ...OR only one file has good search structure
- Both files sorted, sort-merge can win
- ...unless (bR/bS) >> 1, or bfr >> 1
- Both files have No good search structure, Hybrid hash can win
- ...unless (bR/bS) >> 1, or bfr >> 1

S8: 15.7: Heuristic Query Transformations

Output: Sequence or Set of access routines
Input:(i) Query Tree: syntax tree for eqv. rel. algebra query
- ...Leafs = input relations :: Non-leafs = rel. algebra operations
- Initial canonical tree: Easy to generate from parser (Fig. 15.5(a))
- ...cartesian product of all rel.+ 1 select + 1 project
- ...Ex. Fig. 15.5 (b) pp 516 with Q (pp 515)
(ii)Query graph: constraints in tuple calculus exp. (Fig. 15.4/pp 513)
- Nodes = relations (tuple variables) OR constants (double circle)
- ...Edges = join conditions
- We will not use it. Came from Ingres/QUEL optimizer.
ALGORITHM 1: Source to Source Transform (Ch 15.7.2, pp 520-1)
- Semantic transformation rules, e.g. integrity constraints (IC)
- ...Ex. Q(pp 533): Find Emps who earn more than their supervisors
- ...IC: No employee earns more than his/her supervisor.
- Implemented by Thm. prover = > exponential in #applicable rules
- Alternative is hand recoding

S9: Query Optimization Algorithms (Contd.)

ALGORITHM 2: Algebraic optimization by transforming query trees
- Start w/ Initial canonical tree
- Use laws of relational algebra (Ex. rule 4, pp 519)
- General steps (pp 520): apply those operations first
- ...which reduce the size of intermediate results
Ex. Trace: Fig. 15.5 (pp 516-518)
- Fig. 15.5(b) move selects down the tree (close to table)
- Fig. 15.5(c) more restrictive select 1st (pname = 'Aquarium')
- ...commute cartesian products, rearrange select conditions
- Fig. 15.5(d): Combine (product, select) into joins
- Fig. 15.5(e): Move projects down the tree, add project
- ...to drop fields not needed for result (and join/selects)
ALGORITHM 3: Dynamic Programming w/ cost models
- 1. Choose an ordering of join operation (e.g. left to right)
- 2. Given ordering, enumerate combination of access routines
- ...choose low-cost combination, 1 access routine per rel. op.
- Used in commercial systems, Step 2 is cubic in #relations,
- ...Step 1. is exponential and optimizers scrimp here!