An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - PowerPoint PPT Presentation

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan Müller, Hasso Plattner Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 – November 2nd

Agenda ■ Data aggregation ■ Aggregations in Hyrise ■ Cost factors and benchmarks ■ Summary and future work DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Motivation ■ Aggregation □ Abstractions for conceptualizing the real world □ Resource-intensive operation in OLAP workloads ■ Columnar in-memory databases □ Optimized for set processing □ Enables reunification of OLTP and OLAP [1] [1] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In SIGMOD, 2009. DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Data Aggregation ■ SELECT product, SUM (amount) AS revenue FROM sales WHERE year=2011 GROUP BY product ■ Two phases □ Grouping (hash-based, sorting, nested loops) □ Calculation DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Data Aggregation in Hyrise ■ Columnar Storage ■ Dictionary compression id fruit amount id fruit amount 1 apples 10 1 1 1 "fruit" dictionary 2 bananas 10 2 2 1 1 apples 3 cherries 20 3 3 2 2 bananas 4 apples 10 4 1 1 3 cherries 5 cherries 30 5 3 1 4 grapes 6 melons 50 6 5 3 5 melons 7 bananas 20 7 2 4 8 grapes 40 8 4 1 DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Data Aggregation – Grouping table id fruit amount 1 1 1 position on "fruit" dictionary 2 2 1 1 0, 3 1 apples 3 3 2 2 1, 6 2 bananas 4 1 1 3 2, 4 3 cherries 5 3 1 4 7 4 grapes 6 5 3 5 5 5 melons 7 2 4 8 4 1 DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Data Aggregation - Calculation table "amount" dictionary id fruit amount 1 1 1 1 10 position on "fruit" 2 2 1 2 20 1 0, 3 3 3 2 3 30 2 1, 6 4 1 1 4 40 3 2, 4 5 3 1 5 50 4 7 6 5 3 5 5 7 2 4 8 4 1 fruit sum(amount) apples 20 bananas 50 cherries 30 grapes 10 melons 40 DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Cost Modeling in IMDBs ■ Cost model based on cache misses [2] □ Basic patterns for each cache level □ Combine patterns to model complex operations [2] Manegold, S. et al. 2002. Generic database cost models for hierarchical memory systems. VLDB . (2002). DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Cost Modeling of the Aggregation DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Cost Factors ■ Data-dependent □ Dataset size □ Distinct grouping values □ Distinct grouping values distribution □ Sorting of grouping values ■ Data-independent □ Aggregation function □ Hash implementation DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Hash Implementations ■ Ordering □ Sorted tree ( std::map) □ Unsorted hash table ( std::unordered_map) ■ Pre-allocation strategies □ Naïve implementation (automatic growth) □ Distinct value setup (scaffold of hash map) □ Histogram-based setup (scaffold and position list) DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Data Set Size ■ Number of bytes processed during query □ Input table data □ Intermediate result data □ Final result data ■ Influencing factors □ Relation size □ Relational expression □ Grouping attributes □ Aggregation attributes □ Aggregation functions DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Distinct Values – L1 Cache Misses DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Distinct Values – CPU Cycles DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Distinct Values Distribution ● Uniform − L1 cache misses ● 8e+06 Exponential − L1 cache misses Uniform − L2 cache misses Exponential − L2 cache misses ● 6e+06 Number of cache misses ● 4e+06 2e+06 ● ● ● 0e+00 1 10 100 1000 10000 100000 Number of distinct values DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Impact of Sorted Attributes – L1 Misses DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Hash Implementations – L1 Misses DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Hash Implementations – CPU Cycles DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Summary ■ High impact of aggregation operation in a mixed OLTP and OLAP workload ■ Data characteristics influence aggregation performance ■ Hash implementation choice matters ■ Pre-allocation can reduce cache misses ■ Future work □ Multithreaded aggregations □ Extension of existing cost models DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

20 Thank you! stephan.mueller@hpi.uni-potsdam.de http://epic.hpi.uni-potsdam.de DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - PowerPoint PPT Presentation

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan Mller, Hasso Plattner Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 November 2nd Agenda

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Chapter 4 Chapter 4 Marginal Costing and Cost-Volume-Profit Analysis Cost behaviour Cost

Review of data aggregation Review of data aggregation Query distribution AVERAGE 1 1 2 2 3

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Cost Benefit Analysis ECN 240 CMD ECN 240 Cost Benefit Analysis Intro Cost Benefit Analysis

Grouping and Aggregation Grouping and Aggregation in the Concept- -Oriented Data Model Oriented

Chapter 2: Cost Behavior, Activity Analysis, and Cost Estimation Agenda History of Cost

Strategic Cyber Cost- Effectiveness Analysis Robin Smith Brief Introduction Arke Cost

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing

Deep Learning and Hardware: Matching the Demands from the Machine Learning Community Ekapol

Classifying local four gluon S-matrices Subham Dutta Chowdhury November 20, 2020 YITP Strings

EE 721: Types and Functions in VHDL September 9, 2009 1 Overview VHDL views the system being

P o s t g r e S Q L a s a C o l u m n a r S t o r e DCPUG May 2014 Reston, VA Stephen Frost

Check MIB <draft-nunzi-check-mib-00.txt> Giorgio Nunzi, Juergen Quittek, Marcus Brunner,

Intro to Oracle TimesTen --- By Sima Zhu Why in-memory? Basic Architecture TimesTen

Todays Class Carnegie Mellon Univ. Storage Models Dept. of Computer Science System

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - PowerPoint PPT Presentation

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan Mller, Hasso Plattner Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 November 2nd Agenda

for each dst in my.out_edges if dst.depth &gt; my.depth+1 then dst.depth = my.depth+1

Data Collection and Aggregation Data Collection and Aggregation 1 Challenges: data Challenges:

Evolution of valley depth and width Evolution of valley depth and width Evolution of valley depth

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

RGBD Tutorial 14210240041 Gu Pan Image RGB YUV Lab Depth Image RGB image Depth image Each pixel in

Chapter 4 Chapter 4 Marginal Costing and Cost-Volume-Profit Analysis Cost behaviour Cost

Review of data aggregation Review of data aggregation Query distribution AVERAGE 1 1 2 2 3

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Cost Benefit Analysis ECN 240 CMD ECN 240 Cost Benefit Analysis Intro Cost Benefit Analysis

Grouping and Aggregation Grouping and Aggregation in the Concept- -Oriented Data Model Oriented

Chapter 2: Cost Behavior, Activity Analysis, and Cost Estimation Agenda History of Cost

Strategic Cyber Cost- Effectiveness Analysis Robin Smith Brief Introduction Arke Cost

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing

Deep Learning and Hardware: Matching the Demands from the Machine Learning Community Ekapol

Classifying local four gluon S-matrices Subham Dutta Chowdhury November 20, 2020 YITP Strings

EE 721: Types and Functions in VHDL September 9, 2009 1 Overview VHDL views the system being

P o s t g r e S Q L a s a C o l u m n a r S t o r e DCPUG May 2014 Reston, VA Stephen Frost

Check MIB &lt;draft-nunzi-check-mib-00.txt&gt; Giorgio Nunzi, Juergen Quittek, Marcus Brunner,

Intro to Oracle TimesTen --- By Sima Zhu Why in-memory? Basic Architecture TimesTen

Todays Class Carnegie Mellon Univ. Storage Models Dept. of Computer Science System

for each dst in my.out_edges if dst.depth > my.depth+1 then dst.depth = my.depth+1

Check MIB <draft-nunzi-check-mib-00.txt> Giorgio Nunzi, Juergen Quittek, Marcus Brunner,