masm efficient online updates in data warehouses
play

MaSM: Efficient Online Updates in Data Warehouses Manos - PowerPoint PPT Presentation

MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs Freshness vs Performance Data warehouse workload Read-only queries (scans)


  1. MaSM: Efficient Online Updates in Data Warehouses Manos Athanassoulis 1 Shimin Chen 2 Anastasia Ailamaki 1 Phillip Gibbons 2 Radu Stoica 1 1 EPFL 2 Intel Labs

  2. Freshness vs Performance • Data warehouse workload – Read-only queries (scans) – Scattered updates – Difficult to combine efficiently TPCH queries (on avg) 2.5 Normalized execution time 2 • Traditionally two choices Freshness 1.5 – Freshness: in-place updates Performance 1 – Performance: batch updates 0.5 0 Query only Query w/ Query only + Ideal • Ideally, zero overhead updates Updates only 2

  3. Freshness vs Performance • Data warehouse workload – Read-only queries (scans) – Scattered updates – Difficult to combine efficiently TPCH queries (on avg) 2.5 Normalized execution time 2 • Traditionally two choices Freshness 1.5 – Freshness: in-place updates Performance 1 – Performance: batch updates 0.5 0 Query only Query w/ Query only + Ideal • Ideally, zero overhead updates Updates only Is zero overhead possible? 3

  4. Freshness AND Performance In Memory Buffered Updates cache updates in memory 1000 normalized migration overhead [Stonebraker et al .’05] [Heman et al.’10] 100 Ø Apply them online Ø Apply them as differential updates 10 ideal x Large memory overhead x Trade-off migration overhead 1 16MB 128MB 1GB 8GB for memory footprint in-memory buffer size To sum up Update Approach Freshness Performance ↓ mem overhead Batched X J J In place X J J In-memory differential X J J 4

  5. Freshness AND Performance In Memory Buffered Updates cache updates in memory 1000 normalized migration overhead [Stonebraker et al .’05] [Heman et al.’10] 100 Ø Apply them online Ø Apply them as differential updates 10 ideal x Large memory overhead x Trade-off migration overhead 1 16MB 128MB 1GB 8GB for memory footprint in-memory buffer size To sum up Update Approach Freshness Performance ↓ mem overhead Batched X J J In place X J J In-memory differential X J J Can we have the cake and eat it too? 5

  6. Use MaSM! Incoming updates Merge data Answer from disks query and flash SSD • Buffer updates on Flash instead of memory Ø Flash has larger capacity and smaller price • But: Flash friendly design is important – Avoid random writes – Limit total writes • e.g. Log-Structure Merge Tree incurs a high number writes per update [O’ Neil et al.’96] 6

  7. Use MaSM! Incoming updates Merge data Answer from disks query and flash SSD • Buffer updates on Flash instead of memory Ø Flash has larger capacity and smaller price • But: Flash friendly design is important – Avoid random writes Let’s think again! – Limit total writes • e.g. Log-Structure Merge Tree incurs a high number writes per update [O’ Neil et al.’96] 7

  8. MaSM core idea Updates (U) Data (D) Current data K Value Type Key Value Key Value Type ü Outer join: D ⟕ U 1 V1’ Mod 1 V1 5 V5’ Mod 5 V5’’ Mod 2 V2 ü Keep latest update only 19 V19’ Mod 9 N/A Del 3 V3 1 V1’ Mod 19 V19’ Mod 4 V4 9 N/A Del Efficient execution 125 V125 Ins 5 V5 125 V125 Ins ü Discard duplicates 6 V6 5 V5’’ Mod ü Re-use information for 7 V7 future queries 8 V8 9 V9 Sort-Merge Join ü Intuitively does both 8

  9. MaSM core idea Updates (U) Data (D) Current data K Value Type Key Value Key Value Type ü Outer join: D ⟕ U 1 V1’ Mod 1 V1 5 V5’ Mod 5 V5’’ Mod 2 V2 ü Keep latest update only 19 V19’ Mod 9 N/A Del 3 V3 1 V1’ Mod 19 V19’ Mod 4 V4 9 N/A Del Efficient execution 125 V125 Ins 5 V5 125 V125 Ins ü Discard duplicates 6 V6 5 V5’’ Mod ü Re-use information for 7 V7 future queries 8 V8 9 V9 Sort-Merge Join MaSM merges data with updates using sort- ü Intuitively does both merge join and materializing sorted runs 9

  10. Outline • Introduction – Prior work: Differential Updates – MaSM sneak peak • MaSM architecture • Evaluation – Query response time – Sustained update rate • Conclusions 10

  11. MaSM in detail Updates Main memory M= !!" M pages SSD e.g. GBs Disks (main data) e.g. TBs 11

  12. MaSM in detail Main memory M= !!" Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan M pages Run Scan Run Scan Run Scan M pages SSD e.g. GBs Disks (main data) e.g. TBs 12

  13. MaSM in detail Main memory M= !!" Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan M pages Run Scan Run Scan Run Scan M pages SSD e.g. GBs Disks (main data) e.g. TBs Merge pages from HDD, SSD and RAM with negligible overhead! 13

  14. Reducing MaSM memory Main memory ! ≤ 2, S ≤ & Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan S pages Run Scan Run Scan Run Scan αM-S pages SSD e.g. GBs Disks (main data) e.g. TBs 1-pass runs 2-pass runs 14

  15. Reducing MaSM memory Main memory ! ≤ 2, S ≤ & Merge data & updates Incoming Mem Scan query Merge updates Table Range Scan S pages Run Scan Run Scan Run Scan αM-S pages SSD e.g. GBs Disks (main data) e.g. TBs 1-pass runs 2-pass runs Trade-off extra writes for memory size. 15

  16. Impact of α on SSD wear 2 SSD writes per update ! − #. !%& ! 1 Memory footprint = αM 0.2M 2M 0.2 2.0 f (M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2 M= **+ 16

  17. Impact of α on SSD wear 2 SSD writes per update ! − #. !%& ! 1 Memory footprint = αM 0.2M 2M 0.2 2.0 f (M) ≤ α ≤ 2 e.g., M=1000, 0.2 ≤ α ≤ 2 M= **+ 10x smaller memory for 2x more writes! 17

  18. Outline • Introduction – Prior work: Differential Updates – MaSM sneak peak • MaSM architecture • Evaluation – Query response time – Sustained update rate • Conclusions 18

  19. Experimental setup • Dell Precision 690 Workstation – Intel Xeon Quad (2.33MHz, 8MB L2), 4GB DRAM, Ubuntu Linux, kernel 2.6.24 • Dedicated SATA disk for main data – 7200rpm Seagate Barracuda, 77MB/s sequential bandwidth • Intel X25-E SSD for caching updates – 250 MB/s sequential read, 170MB/s sequential write bandwidth; 35,000 4KB-sized random reads/second • Prototype row store: – Implemented in-place updates, indexed updates, MaSM 19

  20. Query performance on synthetic data 100GB main data, 4GB flash for cached updates, 16MB memory normalized time 4 3 2 1 0 4KB 100KB 1MB 10MB 100MB 1GB 10GB 100GB range size in-place updates MaSM w/ coarse-grain index MaSM w/ fine-grain index • MaSM has negligible impact on 10MB or larger scans • MaSM with fine-grain index incurs 4% overhead for 4KB ranges (modeling point queries) 20

  21. TPCH replay experiment 3537s 2000 execution time (s) 1500 1000 500 0 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 q1 q2 q3 q4 q5 q6 q7 q8 q9 query w/o updates query w/ in-place updates query w/ MaSM updates • Replay TPCH disk traces recorded from commercial row store; random online updates 21

  22. TPCH replay experiment 3537s 2000 execution time (s) 1500 1000 500 0 q10 q11 q12 q13 q14 q15 q16 q18 q19 q21 q22 q1 q2 q3 q4 q5 q6 q7 q8 q9 query w/o updates query w/ in-place updates query w/ MaSM updates • Replay TPCH disk traces recorded from commercial row store; random online updates Queries with MaSM see less than 1% overhead! 22

  23. Update performance 14000 Update Rate (upd/s) 12K 12000 10000 8000 6K 6000 3.5K 4000 2000 48 0 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD 23

  24. Update performance 14000 Update Rate (upd/s) 12K 12000 10000 8000 6K 6000 3.5K 4000 2000 48 Efficient usage of a few GB of flash can 0 in-place updates MaSM 2GB SSD MaSM 4GB SSD MaSM 8GB SSD increase update rate up to 258x! 24

  25. To sum up MaSM enables on-line updates in DW • Negligible query overhead (less than 1% for TPCH) • Supports a high update rate (up to 12k) • Tunable memory footprint vs SSD wear • Low migration cost (one-time 2.2x) • SSD-friendly behavior – Limited number of writes per updates TPCH queries (on avg) 2.5 – No random writes on SSD Normalized execution time • Easy DBMS integration 2 • Ensure ACID properties 1.5 1 0.5 0 Query only Query w/ Query only + Ideal 25 MaSM updates Updates only

  26. To sum up MaSM enables on-line updates in DW • Negligible query overhead (less than 1% for TPCH) Update Approach Freshness Performance ↓ mem overhead • Supports a high update rate (up to 12k) Batched X J J • Tunable memory footprint vs SSD wear In place J X J • Low migration cost (one-time 2.2x) In-memory differential X J J • SSD-friendly behavior MaSM and SSD J J J – Limited number of writes per updates TPCH queries (on avg) 2.5 – No random writes on SSD Normalized execution time • Easy DBMS integration 2 • Ensure ACID properties 1.5 Thank you! 1 0.5 0 Query only Query w/ Query only + Ideal 26 MaSM updates Updates only

Recommend


More recommend