data analytics using deep learning
play

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 9 : Q U E R Y O P T I M I Z A T I O N administrivia Reminders Assignment 1: postponed to next Monday Sign up for discussion slots on Thursday


  1. CREATE TABLE A ( id INT PRIMARY KEY , MORE EXAMPLES val INT NOT NULL ); • Ignoring Projections SELECT * FROM A; • Merging Predicates SELECT * FROM A WHERE val BETWEEN 1 AND 100 OR val BETWEEN 50 AND 150; Source: Lukas Eder GT 8803 // Fall 2019 45

  2. CREATE TABLE A ( id INT PRIMARY KEY , MORE EXAMPLES val INT NOT NULL ); • Ignoring Projections SELECT * FROM A; • Merging Predicates SELECT * FROM A WHERE val BETWEEN 1 AND 150; Source: Lukas Eder GT 8803 // Fall 2019 46

  3. RELATIONAL ALGEBRA EQUIVALENCES • Joins: – Commutative, associative R ⋈ S = S ⋈ R (R ⋈ S) ⋈ T = R ⋈ (S ⋈ T) – How many different orderings are there for an n - way join? GT 8803 // Fall 2019 47

  4. RELATIONAL ALGEBRA EQUIVALENCES • How many different orderings are there for an n -way join? • Catalan number ≈4 n – Exhaustive enumeration will be too slow. • We’ll see in a second how an optimizer limits the search space. GT 8803 // Fall 2019 48

  5. PLAN COST ESTIMATION 49 GT 8803 // Fall 2018

  6. COST ESTIMATION • How long will a query take? – CPU: Small cost; tough to estimate – Disk: # of block transfers – Memory: Amount of DRAM used • How many tuples will be read/written? • What statistics do we need to keep? GT 8803 // Fall 2019 50

  7. STATISTICS • The DBMS stores internal statistics about tables, attributes, and indexes in its internal catalog. • Different systems update them at different times. • Manual invocations: – Postgres/SQLite: ANALYZE – SQL Server: UPDATE STATISTICS GT 8803 // Fall 2019 51

  8. STATISTICS • For each relation R , the DBMS maintains the following information: – N R : Number of tuples in R . – V(A,R) : Number of distinct values for attribute A . GT 8803 // Fall 2019 52

  9. DERIVABLE STATISTICS GT 8803 // Fall 2019 53

  10. DERIVABLE STATISTICS • The selection cardinality SC(A,R) is the average number of records with a value for an attribute A given N R / V(A,R) • Note that this assumes data uniformity. – 10,000 students, 10 colleges – how many students in SCS? GT 8803 // Fall 2019 54

  11. SELECTION STATISTICS GT 8803 // Fall 2019 55

  12. SELECTION STATISTICS SELECT * FROM people WHERE id = 123 GT 8803 // Fall 2019 56

  13. SELECTION STATISTICS • Equality predicates on unique keys are easy to estimate. SELECT * FROM people WHERE id = 123 • What about more complex predicates? What is their selectivity? SELECT * FROM people WHERE val > 1000 SELECT * FROM people WHERE age = 30 AND status = 'Lit' GT 8803 // Fall 2019 57

  14. COMPLEX PREDICATES • The selectivity ( sel ) of a predicate P is the fraction of tuples that qualify. • Formula depends on type of predicate: – Equality – Range – Negation – Conjunction – Disjunction GT 8803 // Fall 2019 58

  15. COMPLEX PREDICATES • The selectivity ( sel ) of a predicate P is the fraction of tuples that qualify. • Formula depends on type of predicate: – Equality – Range – Negation – Conjunction – Disjunction GT 8803 // Fall 2019 59

  16. SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age = 2 GT 8803 // Fall 2019 60

  17. SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = GT 8803 // Fall 2019 61

  18. SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = count 0 1 2 3 4 age GT 8803 // Fall 2019 62

  19. SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = count V(age,people)=5 0 1 2 3 4 age GT 8803 // Fall 2019 63

  20. SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = SC(age=2)=1 count V(age,people)=5 0 1 2 3 4 age GT 8803 // Fall 2019 64

  21. SELECTIONS – COMPLEX PREDICATES • Assume that V(age,people) has five distinct values (0–4) and N R = 5 SELECT * FROM people WHERE age = 2 • Equality Predicate: A=constant – sel(A=constant) = SC(P) / N R – Example: sel(age=2) = 1/5 SC(age=2)=1 count V(age,people)=5 0 1 2 3 4 age GT 8803 // Fall 2019 65

  22. SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 count 0 1 2 3 4 age GT 8803 // Fall 2019 66

  23. SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 count 0 1 2 3 4 age GT 8803 // Fall 2019 67

  24. SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 age min = 0 age max = 4 count 0 1 2 3 4 age GT 8803 // Fall 2019 68

  25. SELECTIONS – COMPLEX PREDICATES • Range Query: – sel(A>=a) = (A max –a) / (A max –A min ) SELECT * FROM people – Example: sel(age >= 2) WHERE age >= 2 = (4–2) / (4–0) = 1/2 age min = 0 age max = 4 count 0 1 2 3 4 age GT 8803 // Fall 2019 69

  26. SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age != 2 count 0 1 2 3 4 age GT 8803 // Fall 2019 70

  27. SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age != 2 SC(age=2)=1 count 0 1 2 3 4 age GT 8803 // Fall 2019 71

  28. SELECTIONS – COMPLEX PREDICATES SELECT * FROM people WHERE age != 2 SC(age!=2)=2 SC(age!=2)=2 count 0 1 2 3 4 age GT 8803 // Fall 2019 72

  29. SELECTIONS – COMPLEX PREDICATES SELECT * FROM people = 1 – (1/5) = 4/5 WHERE age != 2 SC(age!=2)=2 SC(age!=2)=2 count 0 1 2 3 4 age GT 8803 // Fall 2019 73

  30. SELECTIONS – COMPLEX PREDICATES • Negation Query: – sel(not P) = 1 – sel(P) SELECT * FROM people – Example: sel(age != 2) = 1 – (1/5) = 4/5 WHERE age != 2 • Observation: Selectivity ≈ Probability SC(age!=2)=2 SC(age!=2)=2 count 0 1 2 3 4 age GT 8803 // Fall 2019 74

  31. SELECTIONS – COMPLEX PREDICATES • Conjunction: – sel(P1 ⋀ P2) = sel(P1) · sel(P2) SELECT * FROM people – sel(age=2 ⋀ name LIKE 'A%') WHERE age = 2 AND name LIKE 'A%' • This assumes that the predicates are independent. P 1 P 2 GT 8803 // Fall 2019 75

  32. SELECTIONS – COMPLEX PREDICATES • Conjunction: – sel(P1 ⋀ P2) = sel(P1) · sel(P2) SELECT * FROM people – sel(age=2 ⋀ name LIKE 'A%') WHERE age = 2 AND name LIKE 'A%' • This assumes that the predicates are independent. P 1 P 2 GT 8803 // Fall 2019 76

  33. SELECTIONS – COMPLEX PREDICATES • Conjunction: – sel(P1 ⋀ P2) = sel(P1) · sel(P2) SELECT * FROM people – sel(age=2 ⋀ name LIKE 'A%') WHERE age = 2 AND name LIKE 'A%' • This assumes that the predicates are independent. P 1 P 2 GT 8803 // Fall 2019 77

  34. SELECTIONS – COMPLEX PREDICATES • Disjunction: – sel(P1 ⋁ P2) SELECT * FROM people = sel(P1)+ sel(P2)–sel(P1 ⋁ P2) WHERE age = 2 = sel(P1)+ sel(P2)–sel(P1)· sel(P2) OR name LIKE 'A%' – sel(age=2 OR name LIKE 'A%') • This again assumes that the P 1 P 2 selectivities are independent. GT 8803 // Fall 2019 78

  35. SELECTIONS – COMPLEX PREDICATES • Disjunction: – sel(P1 ⋁ P2) SELECT * FROM people = sel(P1)+ sel(P2)–sel(P1 ⋁ P2) WHERE age = 2 = sel(P1)+ sel(P2)–sel(P1)· sel(P2) OR name LIKE 'A%' – sel(age=2 OR name LIKE 'A%') • This again assumes that the P 1 P 2 selectivities are independent. GT 8803 // Fall 2019 79

  36. RESULT SIZE ESTIMATION FOR JOINS • Given a join of R and S , what is the range of possible result sizes in # of tuples? • In other words, for a given tuple of R , how many tuples of S will it match? GT 8803 // Fall 2019 80

  37. RESULT SIZE ESTIMATION FOR JOINS • General case: R cols ⋂ S cols ={A} where A is not a key for either table. – Match each R -tuple with S -tuples: estSize ≈ N R · N S / V(A,S) – Symmetrically, for S : estSize ≈ N R · N S / V(A,R) • Overall: – estSize ≈ N R · N S / max({V(A,S), V(A,R)}) GT 8803 // Fall 2019 81

  38. COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Uniform Approximation 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GT 8803 // Fall 2019 82

  39. COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Uniform Approximation 10 # of occurrences 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Distinct values of attribute GT 8803 // Fall 2019 83

  40. COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Non-Uniform Approximation 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GT 8803 // Fall 2019 84

  41. COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Non-Uniform Approximation 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Count=8 Count=4 Count=15 Count=3 Count=14 GT 8803 // Fall 2019 85

  42. COST ESTIMATIONS • Our formulas are nice but we assume that data values are uniformly distributed. Non-Uniform Approximation 15 10 5 Bucket Ranges 0 1-3 4-6 7-9 10-12 13-15 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Bucket #5 Count=8 Count=4 Count=15 Count=3 Count=14 GT 8803 // Fall 2019 86

  43. HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 GT 8803 // Fall 2019 87

  44. HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Bucket #1 Bucket #2 Bucket #3 Bucket #4 Count=12 Count=12 Count=9 Count=12 GT 8803 // Fall 2019 88

  45. HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 15 10 5 0 1-5 6-8 9-13 14-15 GT 8803 // Fall 2019 89

  46. HISTOGRAMS WITH QUANTILES • A histogram type wherein the "spread" of each bucket is same. Equi-width Histogram (Quantiles) 15 10 5 0 1-5 6-8 9-13 14-15 GT 8803 // Fall 2019 90

  47. SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit ⋮ 1 billion tuples GT 8803 // Fall 2019 91

  48. SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk 1005 Andy 37 Lit ⋮ 1 billion tuples GT 8803 // Fall 2019 92

  49. SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 93

  50. SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested sel(age>50) = 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 94

  51. SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested sel(age>50) = 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 95

  52. SAMPLING • Modern DBMSs also collect samples from SELECT AVG (age) FROM people tables to estimate selectivities. WHERE age > 50 • Update samples when the underlying tables id name age status 1001 Obama 56 Rested changes significantly. 1002 Kanye 40 Weird 1003 Tupac 25 Dead 1004 Bieber 23 Crunk Table Sample 1005 Andy 37 Lit ⋮ 1001 Obama 56 Rested sel(age>50) = 1/3 1003 Tupac 25 Dead 1 billion tuples 1005 Andy 37 Lit GT 8803 // Fall 2019 96

  53. OBSERVATION • Now that we can (roughly) estimate the selectivity of predicates, what can we actually do with them? GT 8803 // Fall 2019 97

  54. PLAN ENUMERATION 98 GT 8803 // Fall 2018

  55. QUERY OPTIMIZATION • After performing rule-based rewriting, the DBMS will enumerate different plans for the query and estimate their costs. – Single table. – Multiple tables. • It chooses the best plan it has seen for the query after exhausting all plans or some timeout. GT 8803 // Fall 2019 99

  56. SINGLE-TABLE QUERY PLANNING • Pick the best access method. – Sequential Scan – Binary Search (clustered indexes) – Index Scan • Simple heuristics are often good enough for this. • OLTP queries are especially easy. GT 8803 // Fall 2019 100

Recommend


More recommend