JOIN AGGR SAMPLE HASH Taster: Self-Tuning , Elastic and Online Approximate Query Processing Matthaios Olma Odysseas Papapetrou Raja Appuswamy Anastasia Ailamaki
Data Exploration vs. Data Preparation Challenges of interactive data exploration Exploratory Applications - Dynamic & data-driven Scientific exploration “Internet of Things” analytics Interactive response time Instant access to data Reduce result precision Building data – use AQP summaries is expensive Enable AQP with minimal pre-processing 2
Performance vs. flexibility Offline AQP Online AQP Sample Query Inject sampling (e.g., BlinkDB) (e.g., Quickr) selection Reduce intermediate data Reduce I/O Reduce CPU load Online Offline Sampler Query Sampling Workload Pre-sampling No Preprocessing Workload knowledge required No workload knowledge No storage overhead 0.5-2x storage overhead ~10x performance ~2x performance 3
Reducing pre-processing time 11 node SparkSQL cluster, TPC-H (300GB) 200 queries (18 TPC-H query templates) 25 Baseline Offline AQP (BlinkDB) Online AQP (Quickr) Cumulative time (hours) 20 Sampling pays off after 85 queries 15 10 Sampling pays off after 159 queries 5 over online AQP 0 0 20 40 60 80 100 120 140 160 180 200 Query Sequence Ideal: No sampling preparation cost & interactive access 4
Enhancing Online Approx. Query Processing • Reduce the amount of data accessed – Materialize and Re-use intermediate generated summaries • Adapt materialized summaries to workload and storage budget • Use a variety of summaries other than samples What to If/When to If/When to materialize materialize evict 5
Materialize and re-use synopses Γ • Store all subplans and statistics in hitmap sampler – Update when subplans re-appear ⨝ Calculate prospective gains (cost:benefit) • - Performance gains over future workload σ sampler - Storage cost - Maximize benefit – Knapsack constraint problem ⨝ C Decide to materialize • σ σ - Inject materializer operator - Store intermediate result in-memory and flush offline Summary A B S 1 warehouse 6
Adapting materialized summaries materialize materialize use • Window-based prediction w = 2 S 2 S 1 S 4 S 2 Useful Summaries S 1 S 2 S 4 S 4 S 1 S 2 Summary S 4 S 1 S 2 warehouse Q 1 Q 2 Q 3 Q 5 Q 6 Q 4 • Ideal window size depends on: user, task, data Keep statistics for (1-a)w, w, (1+a)w • Adapt window size based on quality of predictions • Abide to storage requirements despite workload shifts Online tuning of window size improves forecast efficiency 7
Combining different data summaries Sketches Sampling All queries on subset of data Some queries on all data - Keep schema of original table - Count/Sum/Avg - Precision depends on query - Aggregations - Uniform/Stratified sampling - Single grouping attribute Answer large subset of queries Answer specific queries Large size ~ 10% of input Compact ~KB I/O cost depending on size Constant access time Utilize each summary where useful 8
Taster Architecture SQL query - Inject approximation operators into plans Online - Re-use existing materialized synopses Query tuner Optimization Query - Choose which synopsis to generate Execution Metadata store - Store statistics about the historical plans Data Synopsis warehouse - Store the synopsis over HDFS
Experimental Setup Datasets - TPC-H: sf300 (300GB), 18 query templates Systems - SparkSQL (2.1.0) - BlinkDB, Quickr, Taster over SparkSQL (2.1.0) Hardware - 11 nodes x 2 x Intel Xeon X5660 CPU @ 2.80GHz, 48GB RAM, 10GbE (fix for each) 10
End-to-End execution time 11 node SparkSQL cluster, TPC-H sf300, 200 queries (18 TPC-H templates) 1400 1200 Offline sampling Query Execution Execution time (min) 1000 800 600 400 200 0 Baseline Quickr BlinkDB Taster BlinkDB Taster (50%) (50%) (100%) (100%) Taster offers comparable execution time to state-of-the-art
Adapting to shifting workload 11 node SparkSQL cluster, TPC-H sf300, 80 queries (18 TPC-H templates) 20 Execution time Execution time (min) 15 10 5 0 0 20 40 60 80 Query Sequence Taster adapts efficiently to changes in workload 12
Take home message • Piggy-back the creation of summaries over the query execution – In the context of distributed approximate query processing • Adapt data summaries to workload shifts and reduce storage budget • Provide query performance comparable to offline AQP approaches – With reduced building and storage cost Thank you! 13
Recommend
More recommend