toward timely predictable and cost effective data
play

Toward timely, predictable and cost-effective data analytics Renata - PowerPoint PPT Presentation

Toward timely, predictable and cost-effective data analytics Renata Borovica-Gaji DIAS, EPFL Big data proliferation Big data is when the current technology does not enable users to obtain timely , cost-effective , and quality answers to


  1. Toward timely, predictable and cost-effective data analytics Renata Borovica-Gaji ć DIAS, EPFL

  2. Big data proliferation “ Big data is when the current technology does not enable users to obtain timely , cost-effective , and quality answers to data-driven questions . “ [Steve Todd, Berkeley] Technology follows Moore’s Law ₸ * “ The Digital Universe in 2020: Big Data, Bigger Digital Shadows, ₸ “ Trends in big data analytics “, 2014, Kambatla et al and Biggest Growth in the Far East“, 2012, IDC 2

  3. What business analysts want Timely, predictable, cost-effective queries User Wasted [WinterCorp, 2013] 35 frustration resources 30 Cost (million $) Development Expected 25 Administration Actual Time 20 System 15 80% reuse 10 within 3 hours 5 0 DW Minimal data-to- Predictable Low infrastructure insight time response time cost 3

  4. Thesis statement As traditional DBMS rely on predefined assumptions about workload, data and storage, changes cause loss of performance and unpredictability . Insight Query execution must adapt at three levels (to workload , data and hardware ) to stabilize and optimize performance and cost . 4

  5. Outline • Minimize data-to-insight time – Workload-driven adaptation [SIGMOD’12, VLDB’12, CACM’15] • Improve predictability of response time – Data-driven adaptation [DBTest’12, ICDE’15] • Reduce analytics cost – Cold storage & hardware-driven adaptation [VLDB’16] 5

  6. Outline • Minimize data-to-insight time – Workload-driven adaptation • Improve predictability of response time – Data-driven adaptation • Reduce analytics cost – Cold storage & hardware-driven adaptation 6

  7. Data-to-insight time Traditional query stack Raw data querying stack 100 Processing (Q1) Execution breakdown (%) 90 Convert 80 Tokenize 70 querying Parse insight I/O 60 X 50 Time 40 loading 30 20 10 0 data Raw data querying overheads Time to first insight too long Overheads too high Does not scale with data growth Current technology ≠ efficient exploration 7

  8. Optimize raw data querying stack Raw data querying stack Processing (Q1) 100 Convert Execution breakdown (%) 90 Response time Tokenize 80 Parse Q4 70 I/O Q4 NoDB Q3 60 Q2 Q3 50 Q1 40 Let users show by Q4 LOAD Q2 Q3 30 asking queries Q2 20 10 Q1 Q1 0 Raw data DBMS NoDB Not everything needed for Q1 querying NoDB: Workload-driven data loading & tuning 8

  9. PostgresRaw: NoDB from idea to practice Pointers to end of tuples 1. Positional indexing 1|Supplier#01|17|335-1736|5755.94|each slyly... 2|Supplier#02|5|861-2259|4032.68| slyly bold... 3|Supplier#03|1|516-1199|4192.40|blithely... 4|Supplier#04|15|787-7479|4641.08|riously eve... 5|Supplier#05|11|21-151-690-3663|-283.84|. Slyly... 6|Supplier#06|14|24-696-997- 4969|1365.79|final... ... Pointers to attributes scan 2. Cache Workload NationKey Name 17 Supplier#01 Supplier#02 5 … … 3. Statistics 10 Frequency 5 0 1 3 5 7 9 11 13 15 # Buckets Adjust to queries = progressively cheaper access 9

  10. PostgresRaw in action Setting : 7.5M tuples, 150 attributes, 11GB file Queries : 10 arbitrary attributes per query, vary selectivity ~ 7000 ~ 4806 Q20 Q19 Q18 1800 Q17 Q16 Q15 Q14 Q13 Q12 1600 Q11 Q10 Q9 1400 Q8 Q7 Q6 Execution time (sec) Q5 Q4 Q3 1200 Q2 Q1 Load 1000 800 600 400 200 0 Data-to-insight time halved with PostgresRaw MySQL CSV Engine DBMS X DBMS X PostgreSQL PostgresRaw MySQL w/ external files Per query performance comparable to traditional DBMS 10

  11. Summary of PostgresRaw • Query processing engine over raw data files • Uses user queries for partial data loading and tuning • Comparable performance to traditional DBMS IMPACT • Enables timely data exploration with 0 initialization • Decouples user interest from data growth 11

  12. Outline • Minimize data-to-insight time – Workload-driven adaptation • Improve predictability of response time – Data-driven adaptation • Reduce analytics cost – Cold storage & hardware-driven adaptation 12

  13. Index: with or without? Setting : TPC-H, SF10, DBMS-X, Tuning tool 5GB space for indexes 1000 400 Normalized exec. time Tuned With indexes Original Without indexes 100 (log scale) 10 1 0.1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q16 Q18 Q19 Q21 Q22 TPC-H Query Performance hurt after tuning 13

  14. Access path selection problem Index Scan Re-optimization [MID’98, POP’04, RIO’05, BOU’ 14] Full Scan Execution time Performance cliff Full Scan RISK 0 100% Estimated Selectivity Actual Statistics: unreliable advisor Re-optimization: risky 14

  15. Quest for predictable execution Index Scan Execution time Predictable Execution Full Scan RISK 0 100% Selectivity Removing variability due to (sub-optimal) choices 15

  16. Smooth Scan Morph between Index and Sequential Scan based on observed result distribution 16

  17. Morphing mechanism Modes: 1. Index Access: Traditional index access 2. Entire Page Probe: Index access probes entire page 3. Gradual Flattening Access: Probe adjacent region(s) INDEX ... HEAP PAGES Mode 1 Mode 2 Mode 3 17

  18. Morphing policy • Selectivity Increase -> Mode Increase SEL_region >= SEL_global SEL_region < SEL_global • Selectivity Decrease -> Mode Decrease INDEX X: Page with result SR: Region selectivity SG: Global selectivity HEAP PAGES X XX X X X XX X X X X X X XX SR:1 SR:1 SR:0.5 SR:0.75 SR:1 SR:1 SR:0.5 SG: 0 0.66 0.7 1 0.81 0.75 Region snooping = Data-driven adaptation 18

  19. Smooth Scan in action Setting : Micro-benchmark, 25GB table, Order by, Selectivity 0-100% 100000 Execution time (sec) 10000 1000 (log scale) 100 Full Scan 10 Index Scan Sort Scan 1 Smooth Scan 0.1 0 0.001 0.01 0.1 1 20 50 75 100 Selectivity(%) Near-optimal over entire selectivity range 19

  20. Summary of Smooth Scan • Statistics-oblivious access path • Uses region snooping to morph between alternatives • Near-optimal performance for all selectivities IMPACT • Removes access path selection decision • Improves predictability by reducing variability in query execution 20

  21. Outline • Minimize data-to-insight time – Workload-driven adaptation • Improve predictability of response time – Data-driven adaptation • Reduce analytics cost – Cold storage & hardware-driven adaptation 21

  22. Proliferation of cold data “80% enterprise data is cold with 60% CAGR” [ Horison, 2015] “cold data: incredibly valuable for analysis” [Intel, 2013] Cold Storage Devices (CSD) to the rescue Active disks Latency ~10sec B A Power one disk Latency ~10ms Cool one disk PB-size storage at cost ~ tape and latency ~ disks 22

  23. CSD in the storage tiering hierarchy Tiers DRAM $$$ SSD Performance 15k RPM HDD 7200 $$ RPM Capacity HDD $ Tape Archival ms ns µs min hour sec Data Access Latency 23

  24. CSD in the storage tiering hierarchy Tiers DRAM $$$ SSD Performance 15k RPM HDD 7200 $$ RPM Capacity HDD $ CSD COLD ? $ Tape Archival ms ns µs min hour sec Data Access Latency Can we shrink tiers to reduce cost? 24

  25. CSD in the storage tiering hierarchy Tiers DRAM $$$ SSD Performance 15k RPM HDD $$ 7200 ? RPM Capacity HDD $ CSD COLD $ Archival ms ns µs min hour sec Data Access Latency Can we shrink tiers to reduce cost? 25

  26. CSD in the storage tiering hierarchy Storing 100TB of data Tiers [Horison, 2015] DRAM 400 $$$ SSD 350 Performance 15k RPM HDD 300 $159,641 Cost (x1000$) 250 200 150 $ CSD COLD 100 50 0 CSD Trad. ms ns µs min hour sec 2-tier 3-tier Data Access Latency CSD offer significant cost savings (40%) But … can we run queries over CSD? 26

  27. Query execution over CSD Setting : virtualized enterprise datacenter, clients: PostgreSQL , TPCH 50, Q12, CSD: shared, layout: one client per group 5 Postgre CSD Average execution time SQL Ideal HDD 4 (x1000 sec) 3 2 1 0 1 2 3 4 5 Number of clients (groups) Lost opportunity: CSD relegated to archival storage 27

  28. Skipper to the rescue Virtualized enterprise data center Multi-way joins: VM VM1 VM2 VM3 2. Opportunistic execution PostgreSQL triggered upon data arrival MJoin DB1 DB2 DB3 Hash Hash Hash 3. Network I/O Scheduler object-group map. Scan A Scan B Scan C Cache 1. Management A1 B1 C1 A2 Novel ranking algorithm: Progress driven caching : Balances access efficiency Favors caching of objects to across groups and fairness maximize query progress across clients Cold Storage 28

  29. Skipper in action Setting : multitenant enterprise datacenter, clients: TPCH 50, Q12, CSD: shared, layout: one client per group 5 PostgreSQL on CSD PostgreSQL Average exec. time PostgreSQL on HDD Ideal 4 Skipper (x1000 sec) 3 2 1 0 1 2 3 4 5 Number of clients (groups) Approximates HDD-based capacity tier by 20% avg. 29

  30. Summary of Skipper • Efficient query execution over CSD with: 1. Rank-based I/O scheduling 2. Out-of-order execution based on multi-way joins 3. Progress based caching policy • Approximates performance of HDD-based storage tier IMPACT • Cold storage can reduce TCO by shrinking storage hierarchy • Skipper enables data analytics-over-CSD-as-a-service 30

Recommend


More recommend