dynamic and decentralized global analytics via machine
play

Dynamic and Decentralized Global Analytics via Machine Learning Hao - PowerPoint PPT Presentation

SoCC18, Carlsbad, CA, USA Dynamic and Decentralized Global Analytics via Machine Learning Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta Query Processing Query Plan startName , studioName 1.CREATE


  1. SoCC’18, Carlsbad, CA, USA Dynamic and Decentralized Global Analytics via Machine Learning Hao Wang 1 , Di Niu 2 , Baochun Li 1 1 University of Toronto, 2 University of Alberta

  2. Query Processing Query Plan π startName , studioName 1.CREATE VIEW MoviesOf1996 AS 2. SELECT * 3. FROM Movies 4. WHERE year = 1996; ⋈ 5. 6.SELECT starName, studioName σ year = 1996 7.FROM MoviesOf1996 JOIN StarsIN; StarsIn Movies ! 2

  3. Decentralized Global Analytics SQL Query ⋈ DAG Hive / SparkSQL parse QEP Map-Reduce Candidates ⋈ select QEP Hadoop / Spark DC1 DC3 ⋈ DC2 … ! 3

  4. Fluctuating WAN 800 600 400 Bandwidth (Mbps) Bandwidth in total 200 150 100 50 Bandwidth per connection 3:00 3:15 3:30 3:45 4:00 4:15 4:30 4:45 5:00 Time Google Cloud Taiwan Iowa iperf -t 10 -P 5 ! 4

  5. A Toy Example 01. SELECT 02. C.name, O.orderstatus, 03. L.discount, PS.availqty customer order DC 1 DC 2 400 Mbps 04. FROM 0.5 GB 3.3 GB 05. customer as C, 300 Mbps 100 Mbps 06. order as O, 150 Mbps 150 Mbps 07. lineitem as L, 08. partsupp as PS lineitem partsupp 09. WHERE O.orderkey == L.orderkey, 500 Mbps DC 3 DC 4 15 GB 2.3 GB 10. AND PS.partkey == L.partkey, 11. AND PS.suppkey == L.suppkey, 12. AND C.custkey == O.custkey ! 5

  6. Query Plan Candidates Plan A JOIN JOIN JOIN • The worst plan 1.4G 0.5G 2G 0.5G 1.7G 0.5G JOIN JOIN C C • The baseline JOIN JOIN 1.7G 3.3G 2.6G 2.3G Plan B JOIN JOIN PS O 2.3G 15G 3.3G 0.5G • The initial optimal plan 3.3G 15G 2.3G 15G PS L O C O L PS L • Selected by Clarinet Plan A Plan B Plan C Plan C • … ! 6

  7. A Toy Example 400 BW (Mbps) 300 JOIN JOIN 200 Start End 1.4G 0.5G 1.7G 0.5G 100 t (s) JOIN 0 50 100 150 200 250 C JOIN JOIN 1.7G 3.3G JOIN O 2.3G 15G 3.3G 0.5G 2.3G 15G customer order PS L O C PS L DC 1 DC 2 400 Mbps 0.5 GB 3.3 GB Plan B Plan C 300 Mbps 100 Mbps 150 Mbps 150 Mbps Plan C • The adjusted plan lineitem partsupp 500 Mbps • Adapt to bandwidth fluctuation DC 3 DC 4 15 GB 2.3 GB ! 7

  8. Query Completion Time Δ t: the data movement time Centralized plan Query Completion Time (s) 600 Baseline (Plan A) 550 + Δ t Plan selected by Clarinet (Plan B) 500 Dynamic adjusted plan (Plan C) 450 Δ t The data movement time 400 ! 8

  9. Dynamic Query Planning Challenges: JOIN JOIN 1.4G 0.5G 1.7G 0.5G • Accurately estimating runtime cost JOIN C of query plans. JOIN JOIN 1.7G 3.3G JOIN O 2.3G 15G 3.3G 0.5G • Minimize overall completion time of 2.3G 15G PS L O C PS L queries. Plan B Plan C ! 9

  10. Turbo Data Turbo SQL Query Hive / SparkSQL Model Training parse QEP Planning QEP Candidates Adjustment select QEP Cost Estimator Model Hadoop / Spark DAG DC1 DC3 DC2 … Map-Reduce Evaluation ! 10

  11. Prediction Target ⋈ ⋈ (duration, output size) ⋈ ! 11

  12. Data Generation JOIN reduce tables maps 1.Operator —> Map stage σ price > 100 filter(order o=>(o.price>100)) 2. Operator —> MapReduce stages map(customer c=>(c.custkey, c.values)) orders customer map(order o=>(o.custkey, o.values)) ⋈ reduce(custkey, values) ! 12

  13. Data Generation 15K records Raw Features Range 1 − 16 total_exec_num 1 − 8 per executor cpu_core_num 1 − 4 GB per executor mem_size 5 − 1000 Mbps per link avail_bw tbl1_size , tbl2_size 0 . 3 − 12 GB per table 1 − 90 hdfs_block_num ! 13

  14. Data Preprocessing [1, a , b , c , a 2 , ab , ac , b 2 , bc , c 2 ] [ a , b , c ] 1. Handcrafting features Handcrafted Features 2. Polynomial feature crossing tbl_size_sum = sum ( tbl1_size , tbl2_size ) 3. Feature selection by max_tbl_size = max ( tbl1_size , tbl2_size ) LASSO path min_tbl_size = min ( tbl1_size , tbl2_size ) 1 / avail_bw , 1 / total_exec_num , 1 / cpu_core_num ! 14

  15. Feature Selection duration 1. max_tbl_size 4. cpu_core_num 7. total_exec_num 2. tbl_size_sum 5. max_tbl_size / bw 8. mem_size 6. 1/bw 2 3. min_tbl_size 9. Other features 150 5 Coefficients 100 1 50 2 7 6 4 3 8 0 9 L 1 penalty (decreasing) ! 15

  16. Feature Selection output size 1. max_tbl_size 3. tbl_size_sum 5. Other features 4. min_tbl_size 2 2. tbl1_size 5×10 6 4×10 6 1 100×10 3 3×10 6 Coefficients 4 0 5 2×10 6 3 −100×10 3 1×10 6 2 0 L 1 penalty (decreasing) ! 16

  17. Training 100 LASSO Regression depth-1 80 depth-2 depth-3 Error (%) depth-4 Linear Regression with L1 penalty 60 APE (%) 40 GBRT 20 0 0 200 400 600 Gradient Boosting Regression Tree Number of regression trees 500 ternary regression trees of depth 3 ! 17

  18. Model Test APE i = | y i − h ( x i ) | Absolute × 100%. Percentage Error: y i 40 30 30 APE (%) APE (%) 20 20 10 10 0 0 LASSO GBRT-raw GBRT LASSO GBRT-raw GBRT Duration Output Size ! 18

  19. Model Test 30 30 APE (%) APE (%) 20 20 10 10 0 0 3K 5K 7K 9K 11K 13K 15K 3K 5K 7K 9K 11K 13K 15K Dataset size Dataset size Duration Output Size ! 19

  20. Dynamic Planning Strategies • Shortest Completion Time First (SCTF) duration • Maximum Data Reduction First (MDRF) data_reduction • Maximum Data Reduction Rate First (MDRRF) data_reduction / duration ! 20

  21. Evaluation Setup • TPC-H benchmark Table Location Table Location Taiwan Frankfurt lineitem customer • Google Cloud Singapore Sao Paulo region orders Sydney Northern Virginia supplier nation - 33 instances across Belgium Oregon part partsupp 8 regions ! 21

  22. Query Baseline Turbo-SCTF Turbo-MDRRF Clarinet Turbo-MDRF Query completion time (s) 600 400 Turbo-SCTF 200 • 25.1-38.5% 0 Turbo-MDRF Q2 Q3 Q5 Q7 Q8 Query completion time (s) 2000 • 12.6-37.1% 1500 Turbo-MDRRF 1000 • 25.2-41.4% 500 0 Q9 Q10 Q11 Q18 Q21 ! 22

  23. Pairwise Join JOIN reduce tables maps Baseline Turbo-SCTF Turbo-MDRRF Clarinet Turbo-MDRF Stage completion time (s) Stage completion time (s) 300 Completion time (s) 300 40 300 200 200 200 200 20 100 100 100 The completion time 100 distributions of 0 0 0 0 0 Q2 Q3 Q5 Q7 Q8 pairwise joins. 400 600 Completion time (s) 200 40 300 300 400 200 200 100 20 200 100 100 0 0 0 0 0 Q9 Q10 Q11 Q18 Q21 ! 23

  24. Case Study Brazil--Taiwan Virginia--Taiwan Virginia--Sydney Brazil--Virginia Brazil--Sydney Taiwan--Sydney 800 600 BW (Mbps) 400 The Gantt chart of 200 the query Q21 0 Baseline Clarinet SCTF MDRF MDRRF 5:20 5:30 5:40 Time ! 24

  25. Related Work Data Task Plan Working Work Placement Scheduling Optimization Mode √ √ Geode [26] static √ √ WANanalytics [27] static √ √ Iridium [20] static √ SWAG [16] static √ JetSteam [21] static √ √ Clarinet [25] static √ Lube [15] dynamic √ Graphene [14] static √ Turbo dynamic ! 25

  26. Conclusion • Turbo: dynamic query planning with awareness of WAN bandwidths • Data-driven cost estimation of pairwise join with accuracy over 95% • Greedy strategies that reduces the query completion times by up to 41% based on the TPC-H benchmark ! 26

  27. The End Thank You

Recommend


More recommend