Insights of Approximate Query Processing Systems Presented by: - PowerPoint PPT Presentation

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang

Agenda § Introduction § Background § VerdictDB & SnappyData § Experiment Setup § Evaluation § Insights Insights of Approximate Query Processing Systems PAGE 2

Why AQP? # of Day Income (CAD) 1 150 240 2 195.71 Avg(Income) 180 3 4 200 5 230 6 190 180 7 shop income Insights of Approximate Query Processing Systems PAGE 3

Why AQP? more efficient (50% rows) accuracy > 95% # of Day Income (CAD) 1 150 240 2 195.71 Avg(Income) 180 3 4 200 186.67 5 230 6 190 180 7 shop income Insights of Approximate Query Processing Systems PAGE 4

Why AQP? 99.9% Identical 100x-200x Faster Insights of Approximate Query Processing Systems PAGE 5

Sampling Based AQP Insights of Approximate Query Processing Systems PAGE 6

Sampling Based AQP Query Column Set (QCS) Insights of Approximate Query Processing Systems PAGE 7

• Spark Why SnappyData & VerdictDB ? • Open-source* Online/ Distributed/ Name Platform Algorithm Skewed Offline Standalone BlinkDB Offline Distributed Hive/Hadoop (Shark) Stratified sampling Yes Distribution-aware Online Sapprox Online Distributed Hadoop No sampling Approximation-enabled Approxhadoop Online Distributed Hadoop No MapReduce Quickr Online Distributed N/A ASALQA algorithm No Spark as a computational SnappyData Online Distributed Spark and GemFire engine; GemFire as No transactional store Mini-batch execution OLA FluoDB Online Distributed Spark No Model XDB Online Standalone PostgreSQL Wander join No VerdictDB Online Standalone Spark SQL Database learning No Reuse answers of past IDEA Online Standalone N/A overlapping queries for new No query BEAS Online Standalone Commercial DBMS Approximability theorem No ABS Online Standalone N/A Bootstrap No Insights of Approximate Query Processing Systems PAGE 8

SnappyData SDE is NOT open source Insights of Approximate Query Processing Systems PAGE 9

SnappyData + WITH ERROR QCS FRACTION Insights of Approximate Query Processing Systems PAGE 10

VerdictDB Insights of Approximate Query Processing Systems PAGE 11

VerdictDB Insights of Approximate Query Processing Systems PAGE 12

Experiment Setup § Cluster Setup § SnappyData: 1 locator, 1 lead, and 2 servers Insights of Approximate Query Processing Systems PAGE 13

Experiment Setup § Cluster Setup § SnappyData: 1 locator, 1 lead, and 2 servers § VerdictDB on Spark: 1 master and 2 executors § Each Node § 24/32 GB memory used § 500 GB HDD Insights of Approximate Query Processing Systems PAGE 14

Experiment Setup § TPC-H Benchmark § OLAP § 22 queries includes Aggregation, Join, etc. § Well known and standard § Customizable § Data § 1GB and 10GB § Uniformly distributed Insights of Approximate Query Processing Systems PAGE 15

Evaluation SnappyData • Stratified Sampling • In-memory VerdictDB • Uniform Sampling • Not in-memory (bug?) Insights of Approximate Query Processing Systems PAGE 16

SnappyData - Latency Execution time (ms) using TPC-H (SF=10, fraction 0.1) 35,000 Q1: Up to 3.6x speedup 29439 30,000 ~0.0001 Error 25,000 20,000 15,000 10,000 8092 6629 3870 5,000 1399 1832 0 Q1 Q6 Q14 SnappyData SnappyData_AQP (>95% accuracy) Insights of Approximate Query Processing Systems PAGE 17

SnappyData - Accuracy fraction 0.01 Base Table Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction) 0.02 fraction 0.02 0.1 0.01 Sample Tables ... 0.01 0.00 Time (ms) for TPC-H Q14 result (SF=10) 0.00 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 given different sample tables (fraction) 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 fraction fraction fraction fraction Snappy 0.01 0.1 0.2 0.3 Insights of Approximate Query Processing Systems PAGE 18

SnappyData- Creating Sample Tables Time (ms) for creating SnappyData sample tables with different fractions 250,000 200,000 150,000 100,000 50,000 0 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Insights of Approximate Query Processing Systems PAGE 19

VerdictDB - Latency Execution time (ms) using TPC-H (SF=10, fraction 0.1) 250,000 Up to ~ 11x speedup! 206034 200,000 150,000 99195 88912 100,000 50,000 24355 17598 15210 0 Q1 Q6 Q14 SparkSQL VerdictDB (> 95% accuracy) Insights of Approximate Query Processing Systems PAGE 20

VerdictDB - Speedup Speedup for TPC-H (SF=1, fraction=0.1) Speedup for TPC-H (SF=10, fraction=0.1) 12 14 12 10 10 8 8 6 6 4 4 2 2 0 0 Q1 Q6 Q14 Q1 Q6 Q14 Speedup Speedup Insights of Approximate Query Processing Systems PAGE 21

VerdictDB - Creating Sample Tables Time (ms) for creating VerdictDB sample tables with different fraction 900,000 800,000 700,000 600,000 500,000 400,000 300,000 200,000 100,000 0 fraction 0.01 fraction 0.1 fraction 0.2 fraction 0.3 Insights of Approximate Query Processing Systems PAGE 22

VerdictDB - Accuracy fraction 0.01 Base Table Actual Error for TPC-H Q14 result (SF=10) given different sample tables (fraction) 0.25 fraction 0.1 0.2 0.15 Sample Tables ... 0.1 Time (ms) for TPC-H Q14 result (SF=10) 0.05 given different sample tables (fraction) 0 120,000 fraction fraction fraction fraction fraction 100,000 0.01 0.05 0.1 0.2 0.3 80,000 60,000 converge! 40,000 20,000 0 fraction fraction fraction 0.1 fraction 0.2fraction 0.3 SparkSQL 0.01 0.05 Insights of Approximate Query Processing Systems PAGE 23

Other Queries? Q14 Error: ~ 1.7% Speedup: ~1.7X Q19 Error: ~ 80% Speedup: ~5.5X Insights of Approximate Query Processing Systems PAGE 24

Other Queries? Key missing in sample tables! Careful design of sample table or original table! Q7 AQP not working! Insights of Approximate Query Processing Systems PAGE 25

Insights § AQP performs well: § For aggregate functions such as SUM, AVG and COUNT § When WHERE is simple § Users’ foreseen is important! § for both query and original table Insights of Approximate Query Processing Systems PAGE 26

Future Work § Test error estimation in sampling § Other sampling techniques § Biased Sampling § Database learning § Approximate hardware Insights of Approximate Query Processing Systems PAGE 27

Insights of Approximate Query Processing Systems Presented by: - PowerPoint PPT Presentation

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda Introduction Background VerdictDB & SnappyData Experiment Setup Evaluation Insights Insights of Approximate Query Processing

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Database Systems II Query Compiler CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas

Asymmetric Price Transmission in the Brazilian Rice Market: A review of methodologies Jacques

CSE 521S Final Review Set up and test your demo in

Solar Apartments Opportunities for deploying PV on multi-occupancy residential buildings Mike

Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Our Goal: reuse

Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query

Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab

Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations Supun

Insights of Approximate Query Processing Systems Presented by: - PowerPoint PPT Presentation

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda Introduction Background VerdictDB & SnappyData Experiment Setup Evaluation Insights Insights of Approximate Query Processing

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CAS CS 460/660 Introduction to Database Systems Query Evaluation II 1.1 Cost-based Query

Database Systems II Query Compiler CMPT 454, Simon Fraser University, Fall 2009, Martin Ester

CSE 232A Graduate Database Systems Arun Kumar Topic 4: Query Optimization Chapters 12 and

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit &amp; Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas

Asymmetric Price Transmission in the Brazilian Rice Market: A review of methodologies Jacques

CSE 521S Final Review Set up and test your demo in

Solar Apartments Opportunities for deploying PV on multi-occupancy residential buildings Mike

Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Our Goal: reuse

Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query

Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab

Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations Supun

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.