insights of approximate query processing systems
play

Insights of Approximate Query Processing Systems Presented by: - PowerPoint PPT Presentation

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda Introduction Background VerdictDB & SnappyData Experiment Setup Evaluation Insights Insights of Approximate Query Processing


  1. Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang

  2. Agenda § Introduction § Background § VerdictDB & SnappyData § Experiment Setup § Evaluation § Insights Insights of Approximate Query Processing Systems PAGE 2

  3. Why AQP? # of Day Income (CAD) 1 150 240 2 195.71 Avg(Income) 180 3 4 200 5 230 6 190 180 7 shop income Insights of Approximate Query Processing Systems PAGE 3

  4. Why AQP? more efficient (50% rows) accuracy > 95% # of Day Income (CAD) 1 150 240 2 195.71 Avg(Income) 180 3 4 200 186.67 5 230 6 190 180 7 shop income Insights of Approximate Query Processing Systems PAGE 4

  5. Why AQP? 99.9% Identical 100x-200x Faster Insights of Approximate Query Processing Systems PAGE 5

  6. Sampling Based AQP Insights of Approximate Query Processing Systems PAGE 6

  7. Sampling Based AQP Query Column Set (QCS) Insights of Approximate Query Processing Systems PAGE 7

  8. • Spark Why SnappyData & VerdictDB ? • Open-source* Online/ Distributed/ Name Platform Algorithm Skewed Offline Standalone BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes Distribution-aware Online Sapprox Online Distributed Hadoop No sampling Approximation-enabled Approxhadoop Online Distributed Hadoop No MapReduce Quickr Online Distributed N/A ASALQA algorithm No Spark as a computational SnappyData Online Distributed Spark and GemFire engine; GemFire as No transactional store Mini-batch execution OLA FluoDB Online Distributed Spark No Model XDB Online Standalone PostgreSQL Wander join No VerdictDB Online Standalone Spark SQL Database learning No Reuse answers of past IDEA Online Standalone N/A overlapping queries for new No query BEAS Online Standalone Commercial DBMS Approximability theorem No ABS Online Standalone N/A Bootstrap No Insights of Approximate Query Processing Systems PAGE 8

  9. SnappyData SDE is NOT open source Insights of Approximate Query Processing Systems PAGE 9

  10. SnappyData + WITH ERROR QCS FRACTION Insights of Approximate Query Processing Systems PAGE 10

  11. VerdictDB Insights of Approximate Query Processing Systems PAGE 11

  12. VerdictDB Insights of Approximate Query Processing Systems PAGE 12

  13. Experiment Setup § Cluster Setup § SnappyData: 1 locator, 1 lead, and 2 servers Insights of Approximate Query Processing Systems PAGE 13

  14. Experiment Setup § Cluster Setup § SnappyData: 1 locator, 1 lead, and 2 servers § VerdictDB on Spark: 1 master and 2 executors § Each Node § 24/32 GB memory used § 500 GB HDD Insights of Approximate Query Processing Systems PAGE 14

  15. Experiment Setup § TPC-H Benchmark § OLAP § 22 queries includes Aggregation, Join, etc. § Well known and standard § Customizable § Data § 1GB and 10GB § Uniformly distributed Insights of Approximate Query Processing Systems PAGE 15

  16. Evaluation SnappyData • Stratified Sampling • In-memory VerdictDB • Uniform Sampling • Not in-memory (bug?) Insights of Approximate Query Processing Systems PAGE 16

  17. SnappyData - Latency Execution time (ms) using TPC-H (SF=10, fraction 0.1) 35,000 Q1: Up to 3.6x speedup 29439 30,000 ~0.0001 Error 25,000 20,000 15,000 10,000 8092 6629 3870 5,000 1399 1832 0 Q1 Q6 Q14 SnappyData SnappyData_AQP (>95% accuracy) Insights of Approximate Query Processing Systems PAGE 17

  18. SnappyData - Accuracy fraction 0.01 Base Table Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction) 0.02 fraction 0.02 0.1 0.01 Sample Tables ... 0.01 0.00 Time (ms) for TPC-H Q14 result (SF=10) 0.00 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 given different sample tables (fraction) 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 fraction fraction fraction fraction Snappy 0.01 0.1 0.2 0.3 Insights of Approximate Query Processing Systems PAGE 18

  19. SnappyData- Creating Sample Tables Time (ms) for creating SnappyData sample tables with different fractions 250,000 200,000 150,000 100,000 50,000 0 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Insights of Approximate Query Processing Systems PAGE 19

  20. VerdictDB - Latency Execution time (ms) using TPC-H (SF=10, fraction 0.1) 250,000 Up to ~ 11x speedup! 206034 200,000 150,000 99195 88912 100,000 50,000 24355 17598 15210 0 Q1 Q6 Q14 SparkSQL VerdictDB (> 95% accuracy) Insights of Approximate Query Processing Systems PAGE 20

  21. VerdictDB - Speedup Speedup for TPC-H (SF=1, fraction=0.1) Speedup for TPC-H (SF=10, fraction=0.1) 12 14 12 10 10 8 8 6 6 4 4 2 2 0 0 Q1 Q6 Q14 Q1 Q6 Q14 Speedup Speedup Insights of Approximate Query Processing Systems PAGE 21

  22. VerdictDB - Creating Sample Tables Time (ms) for creating VerdictDB sample tables with different fraction 900,000 800,000 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Insights of Approximate Query Processing Systems PAGE 22

  23. VerdictDB - Accuracy fraction 0.01 Base Table Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction) 0.25 fraction 0.1 0.2 0.15 Sample Tables ... 0.1 Time (ms) for TPC-H Q14 result (SF=10) 0.05 given different sample tables (fraction) 0 120,000 fraction fraction fraction fraction fraction 100,000 0.01 0.05 0.1 0.2 0.3 80,000 60,000 converge! 40,000 20,000 0 fraction fraction fraction 0.1 fraction 0.2fraction 0.3 SparkSQL 0.01 0.05 Insights of Approximate Query Processing Systems PAGE 23

  24. Other Queries? Q14 Error: ~ 1.7% Speedup: ~1.7X Q19 Error: ~ 80% Speedup: ~5.5X Insights of Approximate Query Processing Systems PAGE 24

  25. Other Queries? Key missing in sample tables! Careful design of sample table or original table! Q7 AQP not working! Insights of Approximate Query Processing Systems PAGE 25

  26. Insights § AQP performs well: § For aggregate functions such as SUM, AVG and COUNT § When WHERE is simple § Users’ foreseen is important! § for both query and original table Insights of Approximate Query Processing Systems PAGE 26

  27. Future Work § Test error estimation in sampling § Other sampling techniques § Biased Sampling § Database learning § Approximate hardware Insights of Approximate Query Processing Systems PAGE 27

Recommend


More recommend