CREATE TABLE A ( id INT PRIMARY KEY , MORE EXAMPLES val INT NOT NULL ); • Ignoring Projections SELECT * FROM A; • Merging Predicates SELECT * FROM A WHERE val BETWEEN 1 AND 100 OR val BETWEEN 50 AND 150; Source: Lukas Eder GT 8803 // Fall 2019 45
CREATE TABLE A ( id INT PRIMARY KEY , MORE EXAMPLES val INT NOT NULL ); • Ignoring Projections SELECT * FROM A; • Merging Predicates SELECT * FROM A WHERE val BETWEEN 1 AND 150; Source: Lukas Eder GT 8803 // Fall 2019 46
RELATIONAL ALGEBRA EQUIVALENCES • Joins: – Commutative, associative R ⋈ S = S ⋈ R (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T) – How many different orderings are there for an n - way join? GT 8803 // Fall 2019 47
RELATIONAL ALGEBRA EQUIVALENCES • How many different orderings are there for an n -way join? • Catalan number ≈4 n – Exhaustive enumeration will be too slow. • We’ll see in a second how an optimizer limits the search space. GT 8803 // Fall 2019 48
PLAN COST ESTIMATION 49 GT 8803 // Fall 2018
COST ESTIMATION • How long will a query take? – CPU: Small cost; tough to estimate – Disk: # of block transfers – Memory: Amount of DRAM used • How many tuples will be read/written? • What statistics do we need to keep? GT 8803 // Fall 2019 50
STATISTICS • The DBMS stores internal statistics about tables, attributes, and indexes in its internal catalog. • Different systems update them at different times. • Manual invocations: – Postgres/SQLite: ANALYZE – SQL Server: UPDATE STATISTICS GT 8803 // Fall 2019 51
STATISTICS • For each relation R , the DBMS maintains the following information: – N R : Number of tuples in R . – V(A,R) : Number of distinct values for attribute A . GT 8803 // Fall 2019 52
DERIVABLE STATISTICS GT 8803 // Fall 2019 53
DERIVABLE STATISTICS • The selection cardinality SC(A,R) is the average number of records with a value for an attribute A given N R / V(A,R) • Note that this assumes data uniformity. – 10,000 students, 10 colleges – how many students in SCS? GT 8803 // Fall 2019 54
SELECTION STATISTICS GT 8803 // Fall 2019 55
SELECTION STATISTICS SELECT * FROM people WHERE id = 123 GT 8803 // Fall 2019 56
SELECTION STATISTICS • Equality predicates on unique keys are easy to estimate. SELECT * FROM people WHERE id = 123 • What about more complex predicates? What is their selectivity? SELECT * FROM people WHERE val > 1000 SELECT * FROM people WHERE age = 30 AND status = 'Lit' GT 8803 // Fall 2019 57
COMPLEX PREDICATES • The selectivity ( sel ) of a predicate P is the fraction of tuples that qualify. • Formula depends on type of predicate: – Equality – Range – Negation – Conjunction – Disjunction GT 8803 // Fall 2019 58
COMPLEX PREDICATES • The selectivity ( sel ) of a predicate P is the fraction of tuples that qualify. • Formula depends on type of predicate: – Equality – Range – Negation – Conjunction – Disjunction GT 8803 // Fall 2019 59
SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age = 2 GT 8803 // Fall 2019 60
SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = GT 8803 // Fall 2019 61
SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = count 0 1 2 3 4 age GT 8803 // Fall 2019 62
SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = count V(age,people)=5 0 1 2 3 4 age GT 8803 // Fall 2019 63
SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = SC(age=2)=1 count V(age,people)=5 0 1 2 3 4 age GT 8803 // Fall 2019 64
SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = 1/5 SC(age=2)=1 count V(age,people)=5 0 1 2 3 4 age GT 8803 // Fall 2019 65
SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 count 0 1 2 3 4 age GT 8803 // Fall 2019 66
SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 count 0 1 2 3 4 age GT 8803 // Fall 2019 67
SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 age min = 0 age max = 4 count 0 1 2 3 4 age GT 8803 // Fall 2019 68
SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 = (4–2) / (4–0) = 1/2 age min = 0 age max = 4 count 0 1 2 3 4 age GT 8803 // Fall 2019 69
SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age != 2 count 0 1 2 3 4 age GT 8803 // Fall 2019 70
SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age != 2 SC(age=2)=1 count 0 1 2 3 4 age GT 8803 // Fall 2019 71
SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age != 2 SC(age!=2)=2 SC(age!=2)=2 count 0 1 2 3 4 age GT 8803 // Fall 2019 72
SELECTIONS – COMPLEX PREDICATES SELECT * FROM people = 1 – (1/5) = 4/5 WHERE age != 2 SC(age!=2)=2 SC(age!=2)=2 count 0 1 2 3 4 age GT 8803 // Fall 2019 73
SELECTIONS – COMPLEX PREDICATES • Negation Query: – sel(not P) = 1 – sel(P) SELECT * FROM people – Example: sel(age != 2) = 1 – (1/5) = 4/5 WHERE age != 2 • Observation: Selectivity ≈ Probability SC(age!=2)=2 SC(age!=2)=2 count 0 1 2 3 4 age GT 8803 // Fall 2019 74
SELECTIONS – COMPLEX PREDICATES • Conjunction: – sel(P1 ⋀ P2) = sel(P1) · sel(P2) SELECT * FROM people – sel(age=2 ⋀ name LIKE 'A%') WHERE age = 2 AND name LIKE 'A%' • This assumes that the predicates are independent. P 1 P 2 GT 8803 // Fall 2019 75
SELECTIONS – COMPLEX PREDICATES • Conjunction: – sel(P1 ⋀ P2) = sel(P1) · sel(P2) SELECT * FROM people – sel(age=2 ⋀ name LIKE 'A%') WHERE age = 2 AND name LIKE 'A%' • This assumes that the predicates are independent. P 1 P 2 GT 8803 // Fall 2019 76
SELECTIONS – COMPLEX PREDICATES • Conjunction: – sel(P1 ⋀ P2) = sel(P1) · sel(P2) SELECT * FROM people – sel(age=2 ⋀ name LIKE 'A%') WHERE age = 2 AND name LIKE 'A%' • This assumes that the predicates are independent. P 1 P 2 GT 8803 // Fall 2019 77
SELECTIONS – COMPLEX PREDICATES • Disjunction: – sel(P1 ⋁ P2) SELECT * FROM people = sel(P1)+ sel(P2)–sel(P1 ⋁ P2) WHERE age = 2 = sel(P1)+ sel(P2)–sel(P1)· sel(P2) OR name LIKE 'A%' – sel(age=2 OR name LIKE 'A%') • This again assumes that the P 1 P 2 selectivities are independent. GT 8803 // Fall 2019 78
SELECTIONS – COMPLEX PREDICATES • Disjunction: – sel(P1 ⋁ P2) SELECT * FROM people = sel(P1)+ sel(P2)–sel(P1 ⋁ P2) WHERE age = 2 = sel(P1)+ sel(P2)–sel(P1)· sel(P2) OR name LIKE 'A%' – sel(age=2 OR name LIKE 'A%') • This again assumes that the P 1 P 2 selectivities are independent. GT 8803 // Fall 2019 79
RESULT SIZE ESTIMATION FOR JOINS • Given a join of R and S , what is the range of possible result sizes in # of tuples? • In other words, for a given tuple of R , how many tuples of S will it match? GT 8803 // Fall 2019 80
RESULT SIZE ESTIMATION FOR JOINS • General case: R cols ⋂ S cols ={A} where A is not a key for either table. – Match each R -tuple with S -tuples: estSize ≈ N R · N S / V(A,S) – Symmetrically, for S : estSize ≈ N R · N S / V(A,R) • Overall: – estSize ≈ N R · N S / max({V(A,S), V(A,R)}) GT 8803 // Fall 2019 81
COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Uniform Approximation 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GT 8803 // Fall 2019 82
COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Uniform Approximation 10 # of occurrences 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Distinct values of attribute GT 8803 // Fall 2019 83
COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Non-Uniform Approximation 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GT 8803 // Fall 2019 84
COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Non-Uniform Approximation 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Count=8 Count=4 Count=15 Count=3 Count=14 GT 8803 // Fall 2019 85
COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Non-Uniform Approximation 15 10 5 Bucket Ranges 0 1-3 4-6 7-9 10-12 13-15 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Count=8 Count=4 Count=15 Count=3 Count=14 GT 8803 // Fall 2019 86
HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GT 8803 // Fall 2019 87
HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Count=12 Count=12 Count=9 Count=12 GT 8803 // Fall 2019 88
HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 15 10 5 0 1-5 6-8 9-13 14-15 GT 8803 // Fall 2019 89
HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 15 10 5 0 1-5 6-8 9-13 14-15 GT 8803 // Fall 2019 90
SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit ⋮ 1 billion tuples GT 8803 // Fall 2019 91
SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit ⋮ 1 billion tuples GT 8803 // Fall 2019 92
SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 93
SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested sel(age>50) = 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 94
SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested sel(age>50) = 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 95
SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested sel(age>50) = 1/3 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 96
OBSERVATION • Now that we can (roughly) estimate the selectivity of predicates, what can we actually do with them? GT 8803 // Fall 2019 97
PLAN ENUMERATION 98 GT 8803 // Fall 2018
QUERY OPTIMIZATION • After performing rule-based rewriting, the DBMS will enumerate different plans for the query and estimate their costs. – Single table. – Multiple tables. • It chooses the best plan it has seen for the query after exhausting all plans or some timeout. GT 8803 // Fall 2019 99
SINGLE-TABLE QUERY PLANNING • Pick the best access method. – Sequential Scan – Binary Search (clustered indexes) – Index Scan • Simple heuristics are often good enough for this. • OLTP queries are especially easy. GT 8803 // Fall 2019 100
Recommend
More recommend