Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang
Agenda § Introduction § Background § VerdictDB & SnappyData § Experiment Setup § Evaluation § Insights Insights of Approximate Query Processing Systems PAGE 2
Why AQP? # of Day Income (CAD) 1 150 240 2 195.71 Avg(Income) 180 3 4 200 5 230 6 190 180 7 shop income Insights of Approximate Query Processing Systems PAGE 3
Why AQP? more efficient (50% rows) accuracy > 95% # of Day Income (CAD) 1 150 240 2 195.71 Avg(Income) 180 3 4 200 186.67 5 230 6 190 180 7 shop income Insights of Approximate Query Processing Systems PAGE 4
Why AQP? 99.9% Identical 100x-200x Faster Insights of Approximate Query Processing Systems PAGE 5
Sampling Based AQP Insights of Approximate Query Processing Systems PAGE 6
Sampling Based AQP Query Column Set (QCS) Insights of Approximate Query Processing Systems PAGE 7
• Spark Why SnappyData & VerdictDB ? • Open-source* Online/ Distributed/ Name Platform Algorithm Skewed Offline Standalone BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes Distribution-aware Online Sapprox Online Distributed Hadoop No sampling Approximation-enabled Approxhadoop Online Distributed Hadoop No MapReduce Quickr Online Distributed N/A ASALQA algorithm No Spark as a computational SnappyData Online Distributed Spark and GemFire engine; GemFire as No transactional store Mini-batch execution OLA FluoDB Online Distributed Spark No Model XDB Online Standalone PostgreSQL Wander join No VerdictDB Online Standalone Spark SQL Database learning No Reuse answers of past IDEA Online Standalone N/A overlapping queries for new No query BEAS Online Standalone Commercial DBMS Approximability theorem No ABS Online Standalone N/A Bootstrap No Insights of Approximate Query Processing Systems PAGE 8
SnappyData SDE is NOT open source Insights of Approximate Query Processing Systems PAGE 9
SnappyData + WITH ERROR QCS FRACTION Insights of Approximate Query Processing Systems PAGE 10
VerdictDB Insights of Approximate Query Processing Systems PAGE 11
VerdictDB Insights of Approximate Query Processing Systems PAGE 12
Experiment Setup § Cluster Setup § SnappyData: 1 locator, 1 lead, and 2 servers Insights of Approximate Query Processing Systems PAGE 13
Experiment Setup § Cluster Setup § SnappyData: 1 locator, 1 lead, and 2 servers § VerdictDB on Spark: 1 master and 2 executors § Each Node § 24/32 GB memory used § 500 GB HDD Insights of Approximate Query Processing Systems PAGE 14
Experiment Setup § TPC-H Benchmark § OLAP § 22 queries includes Aggregation, Join, etc. § Well known and standard § Customizable § Data § 1GB and 10GB § Uniformly distributed Insights of Approximate Query Processing Systems PAGE 15
Evaluation SnappyData • Stratified Sampling • In-memory VerdictDB • Uniform Sampling • Not in-memory (bug?) Insights of Approximate Query Processing Systems PAGE 16
SnappyData - Latency Execution time (ms) using TPC-H (SF=10, fraction 0.1) 35,000 Q1: Up to 3.6x speedup 29439 30,000 ~0.0001 Error 25,000 20,000 15,000 10,000 8092 6629 3870 5,000 1399 1832 0 Q1 Q6 Q14 SnappyData SnappyData_AQP (>95% accuracy) Insights of Approximate Query Processing Systems PAGE 17
SnappyData - Accuracy fraction 0.01 Base Table Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction) 0.02 fraction 0.02 0.1 0.01 Sample Tables ... 0.01 0.00 Time (ms) for TPC-H Q14 result (SF=10) 0.00 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 given different sample tables (fraction) 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 fraction fraction fraction fraction Snappy 0.01 0.1 0.2 0.3 Insights of Approximate Query Processing Systems PAGE 18
SnappyData- Creating Sample Tables Time (ms) for creating SnappyData sample tables with different fractions 250,000 200,000 150,000 100,000 50,000 0 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Insights of Approximate Query Processing Systems PAGE 19
VerdictDB - Latency Execution time (ms) using TPC-H (SF=10, fraction 0.1) 250,000 Up to ~ 11x speedup! 206034 200,000 150,000 99195 88912 100,000 50,000 24355 17598 15210 0 Q1 Q6 Q14 SparkSQL VerdictDB (> 95% accuracy) Insights of Approximate Query Processing Systems PAGE 20
VerdictDB - Speedup Speedup for TPC-H (SF=1, fraction=0.1) Speedup for TPC-H (SF=10, fraction=0.1) 12 14 12 10 10 8 8 6 6 4 4 2 2 0 0 Q1 Q6 Q14 Q1 Q6 Q14 Speedup Speedup Insights of Approximate Query Processing Systems PAGE 21
VerdictDB - Creating Sample Tables Time (ms) for creating VerdictDB sample tables with different fraction 900,000 800,000 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Insights of Approximate Query Processing Systems PAGE 22
VerdictDB - Accuracy fraction 0.01 Base Table Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction) 0.25 fraction 0.1 0.2 0.15 Sample Tables ... 0.1 Time (ms) for TPC-H Q14 result (SF=10) 0.05 given different sample tables (fraction) 0 120,000 fraction fraction fraction fraction fraction 100,000 0.01 0.05 0.1 0.2 0.3 80,000 60,000 converge! 40,000 20,000 0 fraction fraction fraction 0.1 fraction 0.2fraction 0.3 SparkSQL 0.01 0.05 Insights of Approximate Query Processing Systems PAGE 23
Other Queries? Q14 Error: ~ 1.7% Speedup: ~1.7X Q19 Error: ~ 80% Speedup: ~5.5X Insights of Approximate Query Processing Systems PAGE 24
Other Queries? Key missing in sample tables! Careful design of sample table or original table! Q7 AQP not working! Insights of Approximate Query Processing Systems PAGE 25
Insights § AQP performs well: § For aggregate functions such as SUM, AVG and COUNT § When WHERE is simple § Users’ foreseen is important! § for both query and original table Insights of Approximate Query Processing Systems PAGE 26
Future Work § Test error estimation in sampling § Other sampling techniques § Biased Sampling § Database learning § Approximate hardware Insights of Approximate Query Processing Systems PAGE 27
Recommend
More recommend