dbms data loading
play

DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, - PowerPoint PPT Presentation

DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis* , Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki Data loading: A necessary evil Volume => Expensive Top query performance 40 zettabytes by


  1. DBMS Data Loading: An Analysis on Modern Hardware Adam Dziedzic, Manos Karpathiotakis* , Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

  2. Data loading: A necessary evil  Volume => Expensive  Top query performance 40 zettabytes by 2020*  Velocity => Continuous  ACID guarantees Fresh data = Interesting data ₸ * [IDC12] ₸ Abad [IISWC12] Data loading is a persistent analysis bottleneck 2

  3. Loading a DBMS Convert Read Load How does hardware affect loading? 3

  4. Loading a DBMS Convert Read Load How does hardware affect loading? 3

  5. Experimental setup  Hardware – Dual socket 8 cores Intel(R) Xeon(R) CPU E5-2640 – 64 GB RAM – HDD: 4 x 500 GB 7.5k RPM SATA disks – SSD: 3 x 200GB SSD disks – DAS: 24 x 500 GB 7.5k RPM SATA disks • Software – PostgreSQL, DBMS-R – MonetDB, DBMS-C – PostgreSQL parallel external loader ( “PCOPY” ) • Benchmarks & Real-world Datasets 4

  6. Single-threaded data loading [Input storage: HDD Destination storage: DAS] Symantec Loading Time TPC-H Loading Time 7500 7500 DBMS-C Loading Time (sec) Loading Time (sec) 6000 6000 PostgreSQL MonetDB 4500 4500 DBMS-R 3000 3000 1500 1500 0 0 0 25 50 75 100 0 25 50 75 100 File size (GB) File size (GB) Dataset characteristics matter Effect of compression 5

  7. Parallel data loading Input storage: HDD - Destination storage: DAS 16 threads TPC-H Loading Time Symantec Loading Time 3000 3000 DBMS-C Loading Time (sec) Loading Time (sec) MonetDB 2000 2000 DBMS-R PCOPY 1000 1000 0 0 0 25 50 75 100 0 25 50 75 100 File size (GB) File size (GB) Speedup – 16 threads DBMS-R PCOPY MonetDB DBMS-C TPC-H 100GB 1.25 2.77 1.72 2.84 Symantec 100GB 0.87 1.9 2.1 - Sublinear speedup for 16 threads 6

  8. Resource Utilization DBMS-R, TPC-H SF10 Write BW In: HDD - Out: DAS 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 Time (sec) Unable to saturate resources 7

  9. Resource Utilization DBMS-R, TPC-H SF10 Write BW In: HDD - Out: DAS 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 Time (sec) Unable to saturate resources 7

  10. Resource Utilization DBMS-R, TPC-H SF10 Read BW Write BW In: HDD - Out: DAS 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 Time (sec) Unable to saturate resources 7

  11. Resource Utilization DBMS-R, TPC-H SF10 CPU Read BW Write BW In: HDD - Out: DAS 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 Time (sec) Unable to saturate resources 7

  12. Resource Utilization 30 DBMS-R MonetDB 25 I/O Wait (%) 20 15 10 5 0 0 25 50 75 100 DBMS-R, TPC-H SF10 CPU Read BW Write BW In: HDD - Out: DAS 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 Time (sec) Unable to saturate resources 7

  13. Read patterns [TPC-H SF10 Input storage: HDD Destination storage: DAS] MonetDB DBMS-R Block Address Block Address Elapsed Time Elapsed Time Random I/O causes underutilization 8

  14. Serial reader vs. Parallel readers [TPC-H SF10 Input storage: HDD Destination storage: DAS] Serial Reader PCOPY read utilization 100 Read Utilzation (%) Parallel Readers 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 Time (sec) Serial reader improves read utilization # readers depends on input device speed 9

  15. Impact of storage 10

  16. Impact of storage [TPC-H SF10] Slow input storage Varying input storage HDD Source Storage DAS Destination Storage Loading Time (sec) Loading Time (sec) 400 400 HDD to DAS HDD to DAS HDD to SSD SSD to DAS 300 300 HDD to ramfs ramfs to DAS 200 200 100 100 0 0 PCOPY DBMS-R DBMS-C PCOPY DBMS-R DBMS-C Slow source storage bottlenecks all systems Write bottleneck when source storage is fast 11

  17. Best-case storage scenario [TPC-H SF10] DBMS-R ramfs Source Storage In: ramfs – Out: ramfs ramfs to HDD ramfs to DAS ramfs to SSD ramfs to ramfs 100 400 Loading Time (sec) CPU Utilization (%) 80 300 60 200 40 100 20 0 0 PCOPY DBMS-R DBMS-C 0 5 10 15 20 25 30 Time (sec) Device Bandwidth: 12.8 GB/sec Read Rate: 250 MB/sec 100% CPU utilization, yet B/W still underutilized 12

  18. Data loading: Where does time go? [10 int columns; 10GB] 100% Other Write data 80% CPU breakdown (%) Logging 60% Tuple creation 40% Conversion Tokenizing 20% Parsing 0% PostgreSQL MonetDB Parsing, conversion, tokenization hotspots 13

  19. Reducing data loading overheads Reduced NoDB RAW Accesses Data Vaults SDS/Q Optimized Code Path Instant Loading GPUs HW FPGAs 14

  20. Bulk loading on modern hardware • General case: Resource under-utilization • Slow destination storage matters • Complex code paths bound max speed 15

  21. Bulk loading on modern hardware • General case: Resource under-utilization • Slow destination storage matters • Complex code paths bound max speed Thank You! Questions? 15

  22. Backup Slides 16

  23. 50x data growth from 2010 to 2020 [IDC2012] Can DBMS keep up with data growth? 23

  24. Storage Characteristics Name Capacity Configuration Read Speed Write Speed RPM HDD 2TB 4 x HDD (RAID-0) 170 MB/s 160 MB/s 7.5K DAS 12TB 24 x HDD (RAID-0) 1100 MB/s 330 MB/s 7.5K SSD 600GB 3 x SSD (RAID-0) 565 MB/s 268 MB/s n/a 24

  25. Parallel data loading – 16 threads [Input storage: HDD Destination storage: DAS] Symantec TPC-H 4 16 PCOPY MonetDB DBMS-R DBMS-C 3 12 Speedup Speedup 2 8 1 4 0 0 0 25 50 75 100 0 25 50 75 100 File size (GB) File size (GB) Sublinear speedup for 16x DoP 7

  26. Single-threaded loading – Extra datasets Input storage: HDD Destination storage: DAS TPC-C Loading Time SDSS Loading Time 7500 7500 PostgreSQL Loading Time (sec) Loading Time (sec) 6000 6000 MonetDB DBMS-R 4500 4500 DBMS-C 3000 3000 1500 1500 0 0 0 25 50 75 100 0 25 50 75 100 File size (GB) File size (GB) Column stores invest in compression 26

  27. Parallel data loading – Extra datasets Input storage: HDD Destination storage: DAS SDSS Loading Time TPC-C Loading Time 3000 4000 PCOPY Loading Time (sec) Loading Time (sec) MonetDB 3000 2000 DBMS-R 2000 DBMS-C 1000 1000 0 0 0 25 50 75 100 0 25 50 75 100 File size (GB) File size (GB) 27

  28. Parallel data loading – Extra datasets Input storage: HDD Destination storage: DAS SDSS TPC-C 16 4 PostgreSQL MonetDB DBMS-R DBMS-C 12 3 Speedup Speedup 8 2 4 1 0 0 0 25 50 75 100 0 25 50 75 100 File size (GB) File size (GB) 28

  29. The effect of compression [10GB] DB size / input file Name TPC-H TPC-C SDSS Symantec DBMS-R 1.5 1.3 1.5 1.5 PostgreSQL 1.4 1.4 1.4 1.1 DBMS-C 0.27 0.82 0.18 0.25 MonetDB 1.1 1.4 1.0 0.92 Column stores: Reduced footprint favors OLAP 29

  30. Resource Utilization 30 PCOPY DBMS-R MonetDB DBMS-C 25 I/O Wait (%) 20 15 10 5 0 0 25 50 75 100 DBMS-R, TPC-H SF10 CPU Read BW Write BW In: HDD - Out: DAS 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 Time (sec) Unable to saturate resources 8

  31. MonetDB utilization [Data: TPCH – SF10 Input storage: HDD Destination storage: DAS] 100 Utilzation (%) 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 Time (sec) 31

  32. PCOPY utilization [Data: TPCH – SF10 Input storage: HDD Destination storage: DAS] 100 CPU Read BW Utilzation (%) 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 110 Time (sec) 32

  33. DBMS-C utilization [Data: TPCH – SF10 Input storage: HDD Destination storage: DAS] 100 Utilzation (%) 80 60 40 20 0 0 25 50 75 100 125 150 175 200 225 Time (sec) 33

  34. DBMS-C read patterns 34

  35. Reducing data loading overheads • In situ querying [SIGMOD12, VLDB14] • Data Vaults: Exploit metadata [Ivanova12,Kargin15] • Instant Loading: SIMD & Code gen. [Muehlbauer13] • Accelerators (FPGAs, GPUs) 35

Recommend


More recommend