Engineering Aggregation Operators for Relational In-Memory Database Systems Ingo Müller – PhD defense – February 11, 2016 Institute of Theoretical Informatics, Algorithmics II, Department of Informatics In cooperation with SAP SE www.kit.edu KIT – The Research University in the Helmholtz Association
Introduction – The Race of Database Systems Data growth [RG12] Hardware evolution [Bui12] +60%/yr 1000000 1000 Size of the Digital Universe [EiB] +80%/yr Relative Performance 100000 100 10000 gap Database systems 10 +9%/yr 1000 100 1 Time Time Trend 1: data volumes increase exponentially (or faster) Trend 2: compute power increases exponentially But also more and more complex, for example memory access Database systems are in a continuous race to translate Moore‘s law. Ingo Müller – PhD Defense 2 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Introduction – Grouping with Aggregation Input (Sales) Output Store Item Price 1.00 € 1.00 € Berlin Berlin pen Store Item 3.00 € 3.00 € Berlin Berlin paper 3.00 € Paris 2.00 € 2.00 € Paris Paris ruler 3.00 € Vienna 1.00 € 1.00 € Berlin Berlin pen 5.00 € Berlin 1.00 € 1.00 € Paris Paris pen 3.00 € 3.00 € Vienna Vienna paper What is the sum of the prices of all sold items per store? SELECT Store, SUM (Price) AS Sum FROM Sales GROUP BY Store Ingo Müller – PhD Defense 3 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Challenges and Overview Cache efficiency * lower bound + (optimal) recursive algorithm Optimizer independence * adaptive execution strategy Memory constraint * [SIGMOD15] intra-operator pipelining * low-level tuning of inner loops CPU friendliness * work stealing Parallelism * robust algorithm design Skewed data distribution adaptive pre-aggregation Communication efficiency * compatible with major DB architectures System integration Result: up to 3.7x faster and robust enough for use in production. Ingo Müller – PhD Defense 4 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Challenge: Cache Efficiency – Motivation Two textbook algorithms: Hash-Aggregation Insert every row into hash map with grouping attributes as key Aggregate to existing intermediate result Sort-Aggregation Sort input by grouping attributes Aggregate consecutive rows in a single pass M = cache size B = block size N = input size K = output size Can we do better? Long standing conjecture: no! Ingo Müller – PhD Defense 5 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
External Memory Model – Proof Techniques Known lower bounds for Aggregation N input records Based on comparisons [MR91,AK+93] K output records Do not hold for Hashing! Proof technique [AV88,Gre12] Count the number of possible permutations after t transfers block of B Compare with possible number of records input permutations cache of M records Modifications for Aggregation Allow semi-group operation in cache Count “permutations” as before “external” memory Ingo Müller – PhD Defense 6 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
External Memory Model – Result Lower bound* for Aggregation 𝑂 𝑂 𝐿 𝐿 𝑄𝐶 log 𝑁 𝐶 log 𝑁 block transfers 𝐶 𝐶 𝐶 𝐶 *simplified asymptotic worst case Same bound as for Sorting Multisets [AK+93] M = cache size B = block size N = input size K = output size We confirm: Aggregation is as hard as Sorting! Use as guideline. Ingo Müller – PhD Defense 7 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Outline Cache efficiency lower bound (optimal) recursive algorithm Optimizer independence adaptive execution strategy Memory constraint intra-operator pipelining Ingo Müller – PhD Defense 8 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Challenge: Adaptivity – Motivation Traditional approach [Gra93] Implement HashAggregation and SortAggregation Optimizer selects implementation based on statistics beforehand Problem Wrong statistics may lead to suboptimal performance M = cache size B = block size N = input size K = output size Our goal: adaptively switch between Hashing and Sorting during execution. Ingo Müller – PhD Defense 9 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Adaptivity – Mixing Hashing and Sorting Recursive algorithm: In each level of recursion: mix Hashing and Sorting adaptively Partitioning recurses when necessary Hashing ends recursion when possible efficiently Ingo Müller – PhD Defense 10 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Adaptivity – Mixing Hashing and Sorting Our mechanism achieves the best of Hashing and Sorting. Ingo Müller – PhD Defense 11 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Evaluation – Comparison with Prior Work 3.7x 2 Xeon E7-8870 CPUs (each 10 cores) N = 2 32 , uniform distribution Original implementation of [CR07,YR+11] Efficient recursive processing is crucial for large outputs. Ingo Müller – PhD Defense 12 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Outline Cache efficiency lower bound (optimal) recursive algorithm Optimizer independence adaptive execution strategy Memory constraint intra-operator pipelining Ingo Müller – PhD Defense 13 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Memory Constraint – Intra-Operator Pipelining Split work into blocks Recycle free blocks Limit number of blocks Interleave/Overlap processing levels Pipelining allows to limit the amount of intermediate memory. Ingo Müller – PhD Defense 14 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Memory Constraint – Intra-Operator Scheduling PQ PQ Balance In which level to work? Heuristic: target 50% memory usage On which partition to work? Priority queue on partition length Ingo Müller – PhD Defense Ingo Müller – PhD Defense 15 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Memory Constraint – Evaluation 2x 1.2% of unconstraint Input size = 16GiB, memory constraint = 256MiB Input size = 16GiB, K = 2 23 Performance basically preserved (for moderate result sizes) Trade-off between memory usage and performance Cache efficiency can be achieved under memory constraint. Ingo Müller – PhD Defense 16 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Summary Cache efficiency * lower bound + (optimal) recursive algorithm Optimizer independence * adaptive execution strategy Memory constraint * [SIGMOD15] intra-operator pipelining * low-level tuning of inner loops CPU friendliness * work stealing Parallelism * robust algorithm design Skewed data distribution adaptive pre-aggregation Communication efficiency * compatible with major DB architectures System integration Thank you! Questions? Ingo Müller – PhD Defense 17 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics
Recommend
More recommend