cheap data analytics using cold
play

Cheap data analytics using cold storage devices Renata - PowerPoint PPT Presentation

Cheap data analytics using cold storage devices Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki Proliferation of cold data 80% enterprise data is cold with 60% CAGR [ Horison] cold data: an incredibly valuable piece of


  1. Cheap data analytics using cold storage devices Renata Borovica-Gajic, Raja Appuswamy, and Anastasia Ailamaki

  2. Proliferation of cold data “80% enterprise data is cold with 60% CAGR” [ Horison] “cold data: an incredibly valuable piece of the analysis pipeline” [Intel] Cold Storage Devices (CSD) to the rescue Active disks Latency ~ 10 secs B A Power one disk Latency ~ 10 ms Cool one disk PB of storage at cost ~ tape and latency ~ disks 2

  3. CSD in the storage tiering hierarchy Tiers DRAM $$$ SSD Performance 15k RPM HDD 7200 $$ RPM Capacity HDD $ VTL Archival ms ns µs min hour sec Data Access Latency 3

  4. CSD in the storage tiering hierarchy Tiers DRAM $$$ SSD Performance 15k RPM HDD 7200 $$ RPM Capacity HDD $ CSD COLD ? $ VTL Archival ms ns µs min hour sec Data Access Latency Can we shrink tiers to further save cost? 4

  5. CSD in the storage tiering hierarchy Tiers DRAM $$$ SSD Performance 15k RPM HDD 7200 $$ ? RPM Capacity HDD $ CSD COLD $ Archival ms ns µs min hour sec Data Access Latency Can we shrink tiers to further save cost? 5

  6. CSD in the storage tiering hierarchy Storing 100TB of data Tiers [Horison, 2015] DRAM 400 $$$ SSD 350 Performance 15k RPM HDD 300 $159,641 Cost (x1000$) 250 200 150 $ CSD COLD 100 50 0 CSD Trad. ms ns µs min hour sec 2-tier 3-tier Data Access Latency CSD offers significant cost savings (40%) But… can we run queries over CSD? 6

  7. Query execution over CSD Traditional setting Virtualized enterprise data center VM2 VM3 VM1 Clients DB1 DB1 DB2 DB3 objects blocks Network A1A2A3 C3 C2 C1C2C3 C1 B4 A3 B3 B1B2B3B4 B2 A2 B1 A1 Cold Storage Tier HDD-Based Capacity Tier     Uniform access Control layout Uniform access Control layout  Static (pull-based) execution Pull-based execution will trigger unwarranted group switches 7

  8. What this means for an enterprise datacenter… Setting : multitenant enterprise datacenter, clients: PostgreSQL , TPCH 50, Q12, CSD: shared, layout: one client per group 5 PostgreSQL CSD CSD 8 Ideal HDD HDD Average execution time Average execution time 4 7 6 (x1000 sec) (x1000 sec) 3 5 4 2 3 2 1 1 0 0 0 10 20 1 2 3 4 5 Number of clients (groups) Group switch latency (sec) Lost opportunity: CSD relegated to archival storage 8

  9. Need hardware-software codesign 1. Data access has to be hardware-driven to minimize group switches 2. Query execution engine has to process data pushed from storage in out-of-order (unpredictable) manner 3. Reduce data round-trips to cold storage by smart data caching 9

  10. Skipper to the rescue Virtualized enterprise data center VM VM1 VM2 VM3 2. PostgreSQL Opportunistic execution with multi-way joins MJoin DB1 DB2 DB3 Hash Hash Hash 3. Network I/O Scheduler object-group map. Scan A Scan B Scan C Cache 1. Management A1 B1 C1 A2 Novel ranking algorithm Progress driven caching Cold Storage 10

  11. Multi-way joins in PostgreSQL Setting : Query AxBxC, A:A1, A2; B: B1,B2; C:C1, C2; VM: PostgreSQL State Manager Join Execution A1,B1,C1 Subplans: A2,B1,C1 Executed Executed Pending Pending Pending A1,B1,C1 A1,B1,C1 A1,B1,C2 A1,B1,C1 MJoin A2,B1,C1 A1,B1,C2 A1,B1,C2 A1,B2,C1 A1,B2,C1 Hash Hash Hash A1,B2,C2 A1,B2,C1 A1,B2,C2 A1,B2,C2 A2,B1,C2 Scan A Scan B Scan C A2,B1,C1 A2,B1,C1 A2,B2,C1 A2,B1,C2 A2,B2,C2 A2,B1,C2 A1 B1 C1 A2,B2,C1 A2,B2,C1 A2 A2,B2,C2 A2,B2,C2 Cache Manager A1 A2 C1 B1 Enable out-of-order opportunistic execution

  12. Progress driven caching Setting : Query AxBxC, Cache size: 4, Cache full, Evict a candidate Cache Pending Pending ? A1 A2 C1 B1 C2 A1,B1,C2 A1,B1,C2 A1,B2,C1 A1,B2,C1 A1,B2,C2 A1,B2,C2 LRU No progress (drop B1) A2,B1,C2 A2,B1,C2 A2,B2,C1 A2,B2,C1 New “Max progress” algorithm A2,B2,C2 A2,B2,C2 Object A1 A2 B1 C1 A1 A2 B1 C2 Executed Progress 1 1 2 0 Progress: 2 A1,B1,C1 A2,B1,C1 Minimizes data roundtrips, maximizes query progress 12

  13. Rank-based scheduling Which group to switch to ? FCFS – Fair but inefficient Group Table objects O1 O2 O3 O44 O5 G1 O1 (DB1), O3 (DB3) G2 O2 (DB2), O4 (DB4) Max-requests: Efficient, not fair G3 O5 (DB5) O1 O3 O1, O2, O3, O4, O5 TIME O1 O3 O2 O4 New Ranking Algorithm O1, O3 O1 O4 O1 O3 O3 O2 Rank(G) = #Requests + ∑Wait O2, O4 O1 O3 O2O4 …. O1 O3 O2 O4 Provides efficiency Provides fairness O5 STARVES O1 O4 O1 O3 O5 O3 O2 O2 O4 Balances efficiency and fairness 13

  14. Skipper in action Setting : multitenant enterprise datacenter, clients: TPCH 50, Q12, CSD: shared, layout: one client per group 5 8 PostgreSQL PostgreSQL Average exec. time Average exec. time Ideal 7 Ideal 4 Skipper 6 Skipper (x1000 sec) (x1000 sec) 5 3 4 2 3 2 1 1 0 0 0 10 20 30 40 1 2 3 4 5 Group switch latency (sec) Number of clients Skipper performs within 20% of HDD-based capacity tier Skipper is resilient to group switch latency 14

  15. Minimizing group switches Setting : multitenant enterprise datacenter, 5 clients: TPCH 50, Q12, CSD: shared, layout: one client per group 100% Exec. time breakdown (%) 90% 80% Transfer time 70% Switch time 60% Processing 50% 40% 30% 20% 10% 0% PostgreSQL Skipper Skipper substantially reduces overhead of group switches 15

  16. Conclusions • Cold storage can substantially reduce TCO – But DBMS performance suffers due to pull-based execution • Skipper enables efficient query execution over CSD with – Out-of-order execution based on multi-way joins – Novel progress based caching policy – Rank based I/O scheduling • Skipper makes data analytics over CSD as a service possible – Providers reduce cost by offloading data to CSD – Customers reduce cost by running inexpensive data analytics over CSD Thank you! 16

Recommend


More recommend