Taster: Self-Tuning , Elastic and Online Approximate Query - PowerPoint PPT Presentation

JOIN AGGR SAMPLE HASH Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki

Data Exploration vs. Data Preparation Challenges of interactive data exploration Exploratory Applications - Dynamic & data-driven Scientific exploration “Internet of Things” analytics Interactive response time Instant access to data Reduce result precision Building data – use AQP summaries is expensive Enable AQP with minimal pre-processing 2

Performance vs. flexibility Offline AQP Online AQP Sample Query Inject sampling (e.g., BlinkDB) (e.g., Quickr) selection Reduce intermediate data Reduce I/O Reduce CPU load Online Offline Sampler Query Sampling Workload Pre-sampling No Preprocessing Workload knowledge required No workload knowledge No storage overhead 0.5-2x storage overhead ~10x performance ~2x performance 3

Reducing pre-processing time 11 node SparkSQL cluster, TPC-H (300GB) 200 queries (18 TPC-H query templates) 25 Baseline Offline AQP (BlinkDB) Online AQP (Quickr) Cumulative time (hours) 20 Sampling pays off after 85 queries 15 10 Sampling pays off after 159 queries 5 over online AQP 0 0 20 40 60 80 100 120 140 160 180 200 Query Sequence Ideal: No sampling preparation cost & interactive access 4

Enhancing Online Approx. Query Processing • Reduce the amount of data accessed – Materialize and Re-use intermediate generated summaries • Adapt materialized summaries to workload and storage budget • Use a variety of summaries other than samples What to If/When to If/When to materialize materialize evict 5

Materialize and re-use synopses Γ • Store all subplans and statistics in hitmap sampler – Update when subplans re-appear ⨝ Calculate prospective gains (cost:benefit) • - Performance gains over future workload σ sampler - Storage cost - Maximize benefit – Knapsack constraint problem ⨝ C Decide to materialize • σ σ - Inject materializer operator - Store intermediate result in-memory and flush offline Summary A B S 1 warehouse 6

Adapting materialized summaries materialize materialize use • Window-based prediction w = 2 S 2 S 1 S 4 S 2 Useful Summaries S 1 S 2 S 4 S 4 S 1 S 2 Summary S 4 S 1 S 2 warehouse Q 1 Q 2 Q 3 Q 5 Q 6 Q 4 • Ideal window size depends on: user, task, data Keep statistics for (1-a)w, w, (1+a)w • Adapt window size based on quality of predictions • Abide to storage requirements despite workload shifts Online tuning of window size improves forecast efficiency 7

Combining different data summaries Sketches Sampling All queries on subset of data Some queries on all data - Keep schema of original table - Count/Sum/Avg - Precision depends on query - Aggregations - Uniform/Stratified sampling - Single grouping attribute Answer large subset of queries Answer specific queries Large size ~ 10% of input Compact ~KB I/O cost depending on size Constant access time Utilize each summary where useful 8

Taster Architecture SQL query - Inject approximation operators into plans Online - Re-use existing materialized synopses Query tuner Optimization Query - Choose which synopsis to generate Execution Metadata store - Store statistics about the historical plans Data Synopsis warehouse - Store the synopsis over HDFS

Experimental Setup Datasets - TPC-H: sf300 (300GB), 18 query templates Systems - SparkSQL (2.1.0) - BlinkDB, Quickr, Taster over SparkSQL (2.1.0) Hardware - 11 nodes x 2 x Intel Xeon X5660 CPU @ 2.80GHz, 48GB RAM, 10GbE (fix for each) 10

End-to-End execution time 11 node SparkSQL cluster, TPC-H sf300, 200 queries (18 TPC-H templates) 1400 1200 Offline sampling Query Execution Execution time (min) 1000 800 600 400 200 0 Baseline Quickr BlinkDB Taster BlinkDB Taster (50%) (50%) (100%) (100%) Taster offers comparable execution time to state-of-the-art

Adapting to shifting workload 11 node SparkSQL cluster, TPC-H sf300, 80 queries (18 TPC-H templates) 20 Execution time Execution time (min) 15 10 5 0 0 20 40 60 80 Query Sequence Taster adapts efficiently to changes in workload 12

Take home message • Piggy-back the creation of summaries over the query execution – In the context of distributed approximate query processing • Adapt data summaries to workload shifts and reduce storage budget • Provide query performance comparable to offline AQP approaches – With reduced building and storage cost Thank you! 13

Taster: Self-Tuning , Elastic and Online Approximate Query - PowerPoint PPT Presentation

JOIN AGGR SAMPLE HASH Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki Data Exploration vs. Data Preparation Challenges of interactive data

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Online Query Processing Exposure to online query processing algorithms and fundamentals A

Insights of Approximate Query Processing Systems Presented by: Huanyi Chen Ruoxi Zhang Agenda

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

Asymmetric Price Transmission in the Brazilian Rice Market: A review of methodologies Jacques

CSE 521S Final Review Set up and test your demo in

Solar Apartments Opportunities for deploying PV on multi-occupancy residential buildings Mike

Basic Settings for Building a Better Model Noman Ahsanuzzaman, Ph.D., P.E. Region 4, USEPA

Database Learning: Toward a Database that Becomes Smarter Over Time Yongjoo Park Our Goal: reuse

Database Learning Yongjoo Park Our Goal: reuse the work. Users Database query Answer to query

Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab

Incremental and Approximate Inference for Faster Occlusion-based Deep CNN Explanations Supun