How to Win a Hot Dog Eating Contest: Incremental View Maintenance with Batch Updates Milos Nikolic, Mohammad Dashti, Christoph Koch DATA lab, EPFL SIGMOD, 28 th June 2016
REALTIME APPLICATIONS Web Analytics Sensor Networks Cloud Monitoring ACTIONS EVENTS RUNTIME DECISION ENGINE SUPPORT Continuously Continuously arriving data evaluated views 2
REALTIME SYSTEMS: REQUIREMENTS LOW LATENCY PROCESSING Incremental view maintenance Q(D + ∆D) = Q(D) + ∆Q(D, ∆D) COMPLEX CONTINUOUS QUERIES SQL queries (w/ nested aggregates) No window semantics SCALABLE PROCESSING Synchronous execution model 3
IN THIS TALK Q1: How does the size of update affect the performance of incremental computation? Q2: (Idea) How to achieve efficient distributed incremental computation? 4
HIGH-PERFORMANCE INCREMENTAL COMPUTATION PROBLEM: DBMS & stream engines with classical IVM can have poor performance on fast, long-lived data OUR APPROACH: Compilation of SQL queries into incremental engines Recursive Code Generation = + IVM (C++, Scala, Spark) PERF: Million view refreshes/sec for single-tuple updates 5
Relations: R(A,B), S(B,C) SUM(A * C) ⋈ Q := SELECT SUM(R.A * S.C) B FROM R, S R S WHERE R.B = S.B Delta for ΔR Update Q Optimized Delta Δ R Q SUM(L * R) ⋈ Optimize B Delta Δ R Q SUM(A) SUM(C) SUM(A * C) GROUP BY B GROUP BY B ⋈ B ΔR S ΔR S 6
Relations: R(A,B), S(B,C) SUM(A * C) ⋈ Q := SELECT SUM(R.A * S.C) B FROM R, S R S WHERE R.B = S.B Update Q ΔR ΔS Update Q Optimized Delta Δ S Q Optimized Delta Δ R Q SUM(L * R) SUM(L * R) ⋈ ⋈ B B SUM(A) SUM(C) SUM(A) SUM(C) GROUP BY B GROUP BY B GROUP BY B GROUP BY B ΔR S R ΔS mR mS Pre-compute Pre-compute 7
SUM(A * C) SUM(A) ⋈ GROUP BY B B ON UPDATE R BY ΔR: Q R S R mR // Pre-aggregate batch tmp[B] := SELECT B, SUM(A) Update Q Update mR FROM ΔR GROUP BY B SUM(A) SUM(L * R) GROUP BY B // Update Q ⋈ tmp ΔR B Q += SELECT SUM(tmp.V * mS.V) FROM tmp, mS SUM(A) SUM(C) WHERE tmp.B = mS.B GROUP BY B GROUP BY B tmp mS S ΔR // Update mR mR[B] += SELECT * FROM tmp Common delta expressions 9
ON UPDATE R BY ΔR: void onUpdateR(List<T> dR) { // Pre-aggregate batch // Pre-aggregate batch HashMap<int,int> tmp; tmp[B] := SELECT B, SUM(A) foreach (dA,dB) in dR FROM ΔR tmp[dB] += dA; GROUP BY B // Update Q // Update Q (of type int) foreach (k,v) in tmp Q += SELECT SUM(tmp.V * mS.V) Q += v * mS[k]; FROM tmp, mS WHERE tmp.B = mS.B // Update mR // Update mR foreach (k,v) in tmp mR[B] += SELECT * FROM tmp mR[k] += v; } 10
void onUpdateR(int dA, int dB) { void onUpdateR(List<T> dR) { Q += dA * mS[dB]; // Pre-aggregate batch mR[dB] += dA; HashMap<int,int> tmp; foreach (dA,dB) in dR } BASELINE tmp[dB] += dA; // Update Q (of type int) CODE SPECIALIZATION foreach (k,v) in tmp Q += v * mS[k]; Primitive-type parameters No intermediate maps // Update mR Loop elimination foreach (k,v) in tmp mR[k] += v; Partial evaluation, inlining } 11
SINGLE-TUPLE VS. BATCH IVM TPC-H, 10GB stream, batch size = 1…100,000, C++ BS=1 BS=10 BS=100 MAIN RESULTS BS=1K BS=10K BS=100K 1) Best performance w/ medium NORMALIZED THROUGHPUT batch sizes (= bite sizes ) 1.6 2) Single-tuple processing faster 1.4 1.2 Single-tuple for 5 queries; 7 queries within 1.0 20% of best-batch performance 0.8 3) Batch pre-aggregation can 0.6 enable cheaper maintenance 0.4 0.2 4) OOM faster than DBMS 0.0 Q3 Q9 12
DISTRIBUTED IVM DESIGN CHOICE 1: LOCAL IVM PROGRAM Local ➞ Distributed programs ON UPDATE R STATEMENT 1 STATEMENT 2 CHALLENGE: Dependencies among statements STATEMENT 3 prevent arbitrary re-orderings ON UPDATE S STATEMENT 4 DESIGN CHOICE 2: STATEMENT 5 Synchronous execution model STATEMENT 6 (on top of Spark) STATEMENT 7 13
OUR APPROACH LOCATION TAGS: L OCAL , P ARTITIONED BY KEY , R ANDOM Annotate each node in query plan with location info LOCATION TRANSFORMERS: Insert communication operations into query plan to preserve query semantics REPARTITION GATHER SCATTER HOLISTIC OPTIMIZATION: Minimize network cost 14
CONCLUSION Much more in the paper: • Single-tuple vs. batch incremental processing (single-tuple can be better!) + more experiments • Distributed IVM (+ optimization framework) • IVM of queries with nested aggregates • Code and data-structure specialization Download: http://www.dbtoaster.org 15
Recommend
More recommend