motivation
play

Motivation Database as a service (DaaS) User Service Provider - PowerPoint PPT Presentation

Wentao Wu 1 , Yun Chi 2 , Shenghuo Zhu 2 , Junichi Tatemura 2 , Hakan Hacigumus 2 , Jeffrey Naughton 1 1 Dept of Computer Sciences, University of Wisconsin-Madison 2 NEC Laboratories America 1 Motivation Database as a service (DaaS) User


  1. Wentao Wu 1 , Yun Chi 2 , Shenghuo Zhu 2 , Junichi Tatemura 2 , Hakan Hacigumus 2 , Jeffrey Naughton 1 1 Dept of Computer Sciences, University of Wisconsin-Madison 2 NEC Laboratories America 1

  2. Motivation  Database as a service (DaaS) User Service Provider Service Level Database Agreement (SLA) How to predict the execution time of a query before it runs? 2

  3. Applications  Admission c ontrol  Run this query or not?  Query scheduling  If we decide to run it, when?  Progress monitoring  How long should we wait if something is wrong?  System sizing  How much hardware does it require to run in the given time? 3

  4. Use Optimizers’ Cost Estimates?  Query optimizers have cost estimates for queries.  Can we just use them?  Previous work ([Ganapathi ICDE’09], [ Akdere ICDE’12 ])  Query optimizers’ cost estimates are unusable . avg err: 120% Naïve Scaling : Predict the execution time T by scaling the cost estimate C , i.e., T = a · C Fig. 5 of [Akdere ICDE’12 ] 4

  5. Why Does Naïve Cost Unit Value Scaling Fail? c s : seq_page_cost 1.0 c r : rand_page_cost 4.0  PostgreSQL’s cost model c t : cpu_tuple_cost 0.01 𝐷 = 𝑜 𝑡 𝑑 𝑡 + 𝑜 𝑠 𝑑 𝑠 + 𝑜 𝑢 𝑑 𝑢 + 𝑜 𝑗 𝑑 𝑗 + 𝑜 𝑝 𝑑 𝑝 c i : cpu_index_tuple_cost 0.005 c o : cpu_operator_cost 0.0025 Naïve Scaling 𝑑 𝑠 𝑑 𝑢 𝑑 𝑗 𝑑 𝑝 ′ ⋅ (𝑜 𝑡 + 𝑜 𝑠 𝑈 = 𝑏 ⋅ 𝐷 = 𝑑 𝑡 + 𝑜 𝑢 + 𝑜 𝑗 + 𝑜 𝑝 ) 𝑑 𝑡 𝑑 𝑡 𝑑 𝑡 𝑑 𝑡 Should be ′ = 𝑏 ⋅ 𝑑 𝑡 = 𝑏 ⋅ 1.0 = 𝑏 𝑑 𝑡 correct!  The assumptions required (for naïve scaling to work)  The ratios between the c ’s are correct.  The n ’s are correct. 5

  6. Beat Naïve Scaling  PostgreSQL’s cost model 𝐷 = 𝑜 𝑡 𝑑 𝑡 + 𝑜 𝑠 𝑑 𝑠 + 𝑜 𝑢 𝑑 𝑢 + 𝑜 𝑗 𝑑 𝑗 + 𝑜 𝑝 𝑑 𝑝 Unfortunately, both the c ’s and the n ’s could be incorrect !  To beat naïve scaling  Use machine learning ([Ganapathi ICDE’09], [ Akdere ICDE’12 ])  Calibrate the c ’s and the n ’s! (our work) 6

  7. What if We Use Calibrated c ’s and n ’s ?  Cost models become much more effective. Prediction by Naïve Scaling: Prediction by Calibration: 𝑈 𝑞𝑠𝑓𝑒 = ∑𝑑 ′ ⋅ 𝑜′ 𝑈 𝑞𝑠𝑓𝑒 = 𝑏 ⋅ (∑𝑑 ⋅ 𝑜) 7

  8. Main Idea  How can we calibrate the c ’s and the n ’s ?  Calibrate the c ’s: use profiling queries .  Calibrate the n ’s: refine cardinality estimates . 8

  9. Contribution of This Work  We proposed a systematic framework to calibrate the cost models used by the query optimizer.  We showed that the calibrated cost model is much better than naïvely scaling the cost estimates.  We further showed that the calibrated cost model is also much better than the state-of-the-art machine- learning based approaches. 9

  10. Cost Unit c s : seq_page_cost Calibrating The c ’s c r : rand_page_cost c t : cpu_tuple_cost  Basic idea (an example) c i : cpu_index_tuple_cost  Want to know the true 𝑑 𝑢 and 𝑑 𝑝 c o : cpu_operator_cost R in memory q 1 : select * from R 𝑢 1 = 𝑑 𝑢 ∙ 𝑜 𝑢 𝑢 2 = 𝑑 𝑢 ∙ 𝑜 𝑢 + 𝑑 𝑝 ∙ 𝑜 𝑝 q 2 : select count(*) from R  General case  k cost units (i.e., k unknowns) => k queries (i.e., k equations)  k = 5 in the case of PostgreSQL 10

  11. How to Pick Profiling Queries?  Completeness  Each c should be covered by at least one query.  Conciseness  The set of queries is incomplete if any query is removed.  Simplicity  Each query should be as simple as possible. 11

  12. Profiling Queries For PostgreSQL Isolate the unknowns and solve them one per equation ! R in memory 𝑢 1 = 𝑑 𝑢 ∙ 𝑜 𝑢1 q 1 : select * from R R in memory 𝑢 2 = 𝑑 𝑢 ∙ 𝑜 𝑢2 + 𝑑 𝑝 ∙ 𝑜 𝑝2 q 2 : select count(*) from R R in memory q 3 : select * from R where R.A 𝑢 3 = 𝑑 𝑢 ∙ 𝑜 𝑢3 + 𝑑 𝑗 ∙ 𝑜 𝑗3 + 𝑑 𝑝 ∙ 𝑜 𝑝3 < a (R.A with an Index) R on disk q 4 : select * from R 𝑢 4 = 𝑑 𝑡 ∙ 𝑜 𝑡4 + 𝑑 𝑢 ∙ 𝑜 𝑢4 R on disk 𝑢 5 q 5 : select * from R where R.B = 𝑑 𝑡 ∙ 𝑜 𝑡5 + 𝑑 𝑠 ∙ 𝑜 𝑠5 + 𝑑 𝑢 ∙ 𝑜 𝑢5 < b (R.B unclustered Index) + 𝑑 𝑗 ∙ 𝑜 𝑗5 + 𝑑 𝑝 ∙ 𝑜 𝑝5 12

  13. Calibrating The n ’s  The n ’s are functions of N ’s (i.e., input cardinalities).  Calibrating the n ’s => Calibrating the N ’s Example 1 (In-Memory Sort) n o 𝑡𝑑 = [2 ∙ 𝑂 𝑢 ∙ log 𝑂 𝑢 ] ∙ 𝑑 𝑝 + 𝑢𝑑 𝑝𝑔 𝑑ℎ𝑗𝑚𝑒 𝑠𝑑 = 𝑑 𝑢 ⋅ 𝑂 𝑢 Example 2 (Nested-Loop Join) 𝑡𝑑 = 𝑡𝑑 𝑝𝑔 𝑝𝑣𝑢𝑓𝑠 𝑑ℎ𝑗𝑚𝑒 + 𝑡𝑑 𝑝𝑔 𝑗𝑜𝑜𝑓𝑠 𝑑ℎ𝑗𝑚𝑒 𝑝 ⋅ 𝑂 𝑢 𝑗 + 𝑂 𝑢 𝑝 ⋅ 𝑠𝑑 𝑝𝑔 𝑗𝑜𝑜𝑓𝑠 𝑑ℎ𝑗𝑚𝑒 𝑠𝑑 = 𝑑 𝑢 ⋅ 𝑂 𝑢 n t 𝑡𝑑 : start-cost 𝑠𝑑 : run-cost 𝑢𝑑 = 𝑡𝑑 + 𝑠𝑑 : total-cost 𝑂 𝑢 : # of input tuples 13

  14. Refine Cardinality Estimates  Cardinality Estimation Traditional Role Our Case (Execution (Query Optimization) Time Prediction) # of Plans Hundreds/Thousands of 1 Time per Plan Must be very short Can be a bit longer Precision Important Critical Approach Histograms (dominant) Sampling (one option) 14

  15. A Sampling-Based Estimator  Estimate the selectivity 𝜍 𝑟 of a select-join query 𝑟 . [Haas et al., J. Comput. Syst. Sci. 1996] q : R 1 ⋈ R 2 n samples (w/ replacement) Partition 𝜍 1 = |𝐶 11 ⋈ 𝐶 22 | 𝐶 11 × |𝐶 22 | 𝐶 11 𝐶 21 𝐶 11 ⋈ 𝐶 22 𝜍 1 s 1 : 𝐶 12 𝐶 22 𝑜 𝜍 𝑟 = 1 ……….. ……….. ො 𝑜 ෍ 𝜍 𝑗 … … 𝑗=1 ⋈ 𝐶 1𝑂 1 𝜍 𝑜 𝐶 21 s n : 𝐶 2𝑂 2 𝐶 1𝑂 1 R 2 R 1 The estimator ො 𝜍 𝑟 is unbiased and strongly consistent ! 15

  16. The Cardinality Refinement Algorithm  Design the algorithm based on the previous estimator. Problem Our Solution 1. The estimator needs random 1. Take samples offline and store I/Os at runtime to take samples. them as tables in the database. 2. Query plans usually contain 2. Estimate multiple operators in a more than one operators. single run, by reusing partial results. 3. The estimator only works for 3. Rely on PostgreSQL’s cost select/join operators. models for aggregates . Future work : Add estimators for aggregates ([Charikar PODS’00]). 16

  17. The Cardinality Refinement Algorithm (Example) 𝑟 1 = 𝑆 1 ⋈ 𝑆 2 Plan for q : 𝑟 2 = 𝑆 1 ⋈ 𝑆 2 ⋈ 𝑆 3 agg agg 𝑡 ⋈ 𝑆 2 𝑡 | 𝜍 𝑟 1 = |𝑆 1 q 2 ෞ Rewrite Run 𝑡 | × |𝑆 2 𝑡 | |𝑆 1 ⋈ ⋈ 𝑡 ⋈ 𝑆 2 𝑡 ⋈ 𝑆 3 𝑡 | 𝜍 𝑟 2 = |𝑆 1 q 1 ⋈ ⋈ ෞ 𝑡 𝑆 3 𝑡 | × |𝑆 2 𝑡 | × |𝑆 3 𝑡 | 𝑆 3 |𝑆 1 𝑡 𝑡 𝑆 1 𝑆 2 𝑆 1 𝑆 2 𝑡 are samples (as tables) of 𝑆 1 , 𝑆 2 , 𝑆 3 Reuse 𝑡 , 𝑆 2 𝑡 , 𝑆 3 𝑆 1 For agg , use PostgreSQL’s estimates based on the refined input estimates from q 2 . 17

  18. Experimental Settings  PostgreSQL 9.0.4, Linux 2.6.18  TPC-H 1GB and 10GB databases  Both uniform and skewed data distribution  Two different hardware configurations  PC1: 1-core 2.27 GHz Intel CPU, 2GB memory  PC2: 8-core 2.40 GHz Intel CPU, 16GB memory 18

  19. Calibrating Cost Units PC1: Cost Unit Calibrated Calibrated Default (ms) (normalized to c s ) c s : seq_page_cost 5.53e-2 1.0 1.0 c r : rand_page_cost 6.50e-2 1.2 4.0 c t : cpu_tuple_cost 1.67e-4 0.003 0.01 c i : cpu_index_tuple_cost 3.41e-5 0.0006 0.005 c o : cpu_operator_cost 1.12e-4 0.002 0.0025 PC2: Cost Unit Calibrated Calibrated Default (ms) (normalized to c s ) c s : seq_page_cost 5.03e-2 1.0 1.0 c r : rand_page_cost 4.89e-1 9.7 4.0 c t : cpu_tuple_cost 1.41e-4 0.0028 0.01 c i : cpu_index_tuple_cost 3.34e-5 0.00066 0.005 c o : cpu_operator_cost 7.10e-5 0.0014 0.0025 19

  20. Prediction Precision  Metric of precision  Mean Relative Error (MRE)  Dynamic database workloads  Unseen queries frequently occur.  Compare with existing approaches  Naive scaling  More complex machine learning approaches 20

  21. Existing Machine-Learning Methods  The idea  Represent a query as a feature vector  Train a regression model  SVM [Akdere ICDE’12 ]  REP trees [Xiong SoCC’11 ]  KCCA [Ganapathi ICDE’09 ]  Did not compare since [Akdere ICDE’12 ] is better. 21

  22. Precision on TPC-H 1GB DB Uniform data: Naïve Scaling E t : c ’s (calibrated) + n ’s ( true cardinalities) E o : c ’s (calibrated) + n ’s (cardinalities by optimizer ) E s : c ’s (calibrated) + n ’s (cardinalities by sampling ) 22

  23. Precision on TPC-H 1GB DB (Cont.) Skewed data: E t : c ’s (calibrated) + n ’s ( true cardinalities) E o : c ’s (calibrated) + n ’s (cardinalities by optimizer ) E s : c ’s (calibrated) + n ’s (cardinalities by sampling ) 23

  24. Precision on TPC-H 10GB DB Uniform data (similar results on skewed data): E t : c ’s (calibrated) + n ’s ( true cardinalities) E o : c ’s (calibrated) + n ’s (cardinalities by optimizer ) E s : c ’s (calibrated) + n ’s (cardinalities by sampling ) 24

Recommend


More recommend