engineering aggregation operators
play

Engineering Aggregation Operators for Relational In-Memory Database - PowerPoint PPT Presentation

Engineering Aggregation Operators for Relational In-Memory Database Systems Ingo Mller PhD defense February 11, 2016 Institute of Theoretical Informatics, Algorithmics II, Department of Informatics In cooperation with SAP SE


  1. Engineering Aggregation Operators for Relational In-Memory Database Systems Ingo Müller – PhD defense – February 11, 2016 Institute of Theoretical Informatics, Algorithmics II, Department of Informatics In cooperation with SAP SE www.kit.edu KIT – The Research University in the Helmholtz Association

  2. Introduction – The Race of Database Systems Data growth [RG12] Hardware evolution [Bui12] +60%/yr 1000000 1000 Size of the Digital Universe [EiB] +80%/yr Relative Performance 100000 100 10000 gap Database systems 10 +9%/yr 1000 100 1 Time Time Trend 1: data volumes increase exponentially (or faster) Trend 2: compute power increases exponentially But also more and more complex, for example memory access Database systems are in a continuous race to translate Moore‘s law. Ingo Müller – PhD Defense 2 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  3. Introduction – Grouping with Aggregation Input (Sales) Output Store Item Price 1.00 € 1.00 € Berlin Berlin pen Store Item 3.00 € 3.00 € Berlin Berlin paper 3.00 € Paris 2.00 € 2.00 € Paris Paris ruler 3.00 € Vienna 1.00 € 1.00 € Berlin Berlin pen 5.00 € Berlin 1.00 € 1.00 € Paris Paris pen 3.00 € 3.00 € Vienna Vienna paper What is the sum of the prices of all sold items per store? SELECT Store, SUM (Price) AS Sum FROM Sales GROUP BY Store Ingo Müller – PhD Defense 3 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  4. Challenges and Overview Cache efficiency *  lower bound + (optimal) recursive algorithm Optimizer independence *  adaptive execution strategy Memory constraint * [SIGMOD15]  intra-operator pipelining *  low-level tuning of inner loops CPU friendliness *  work stealing Parallelism *  robust algorithm design Skewed data distribution  adaptive pre-aggregation Communication efficiency *  compatible with major DB architectures System integration Result: up to 3.7x faster and robust enough for use in production. Ingo Müller – PhD Defense 4 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  5. Challenge: Cache Efficiency – Motivation Two textbook algorithms: Hash-Aggregation Insert every row into hash map with grouping attributes as key Aggregate to existing intermediate result Sort-Aggregation Sort input by grouping attributes Aggregate consecutive rows in a single pass M = cache size B = block size N = input size K = output size Can we do better? Long standing conjecture: no! Ingo Müller – PhD Defense 5 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  6. External Memory Model – Proof Techniques Known lower bounds for Aggregation N input records Based on comparisons [MR91,AK+93] K output records  Do not hold for Hashing! Proof technique [AV88,Gre12] Count the number of possible permutations after t transfers block of B Compare with possible number of records input permutations cache of M records Modifications for Aggregation Allow semi-group operation in cache Count “permutations” as before “external” memory Ingo Müller – PhD Defense 6 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  7. External Memory Model – Result Lower bound* for Aggregation 𝑂 𝑂 𝐿 𝐿 𝑄𝐶 log 𝑁 𝐶 log 𝑁 block transfers 𝐶 𝐶 𝐶 𝐶 *simplified asymptotic worst case Same bound as for Sorting Multisets [AK+93] M = cache size B = block size N = input size K = output size We confirm: Aggregation is as hard as Sorting!  Use as guideline. Ingo Müller – PhD Defense 7 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  8. Outline Cache efficiency  lower bound  (optimal) recursive algorithm Optimizer independence  adaptive execution strategy Memory constraint  intra-operator pipelining Ingo Müller – PhD Defense 8 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  9. Challenge: Adaptivity – Motivation Traditional approach [Gra93] Implement HashAggregation and SortAggregation Optimizer selects implementation based on statistics beforehand Problem Wrong statistics may lead to suboptimal performance M = cache size B = block size N = input size K = output size Our goal: adaptively switch between Hashing and Sorting during execution. Ingo Müller – PhD Defense 9 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  10. Adaptivity – Mixing Hashing and Sorting Recursive algorithm: In each level of recursion: mix Hashing and Sorting adaptively Partitioning recurses when necessary Hashing ends recursion when possible efficiently Ingo Müller – PhD Defense 10 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  11. Adaptivity – Mixing Hashing and Sorting Our mechanism achieves the best of Hashing and Sorting. Ingo Müller – PhD Defense 11 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  12. Evaluation – Comparison with Prior Work 3.7x 2 Xeon E7-8870 CPUs (each 10 cores) N = 2 32 , uniform distribution Original implementation of [CR07,YR+11] Efficient recursive processing is crucial for large outputs. Ingo Müller – PhD Defense 12 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  13. Outline Cache efficiency  lower bound  (optimal) recursive algorithm Optimizer independence  adaptive execution strategy Memory constraint  intra-operator pipelining Ingo Müller – PhD Defense 13 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  14. Memory Constraint – Intra-Operator Pipelining Split work into blocks Recycle free blocks Limit number of blocks Interleave/Overlap processing levels Pipelining allows to limit the amount of intermediate memory. Ingo Müller – PhD Defense 14 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  15. Memory Constraint – Intra-Operator Scheduling PQ PQ Balance In which level to work?  Heuristic: target 50% memory usage On which partition to work?  Priority queue on partition length Ingo Müller – PhD Defense Ingo Müller – PhD Defense 15 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  16. Memory Constraint – Evaluation 2x 1.2% of unconstraint Input size = 16GiB, memory constraint = 256MiB Input size = 16GiB, K = 2 23 Performance basically preserved (for moderate result sizes) Trade-off between memory usage and performance Cache efficiency can be achieved under memory constraint. Ingo Müller – PhD Defense 16 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

  17. Summary Cache efficiency *  lower bound + (optimal) recursive algorithm Optimizer independence *  adaptive execution strategy Memory constraint * [SIGMOD15]  intra-operator pipelining *  low-level tuning of inner loops CPU friendliness *  work stealing Parallelism *  robust algorithm design Skewed data distribution  adaptive pre-aggregation Communication efficiency *  compatible with major DB architectures System integration Thank you! Questions? Ingo Müller – PhD Defense 17 Feb. 11, 2016 Institute of Theoretical Informatics, Algorithmics II Engineering Aggregation Operators for Relational In-Memory Database Systems Department of Informatics

Recommend


More recommend