Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures Frank Dehne School of Computer Science Centre For Advanced Studies Canada Frank Dehne ■
Parallel Data Analytics Joint work with R.Bordawekar (IBM Yorktown), J.Dale (IBM Littletown), R.Grosset (IBM Toronto), M.Genkin (IBM Toronto), S.Jou (IBM Toronto), P.Jain (IBM Littletown), M.Petitclerc (IBM Laval), A.Rau- Chaplin (Dalhousie), D.Robillard (Carleton), F.Thomas (IBM Ottawa), H.Zaboli (Carleton), R.Zhou (Carleton). Frank Dehne ■
Online Analytical Processing (OLAP) IBM/COGNOS ● Insight ● Workspace ● Report/Studio Frank Dehne ■
Online Analytical Processing (OLAP) Frank Dehne ■
Online Analytical Processing (OLAP) Operations: AC ● roll-up ● drill-down C ● slice A ● dice BC ABC AB B Frank Dehne ■
Online Analytical Processing (OLAP) Operations: AC ● roll-up ● drill-down C ● slice A ● dice ABCD ABC ABD ACD BCD BC ABC AC AB AD BC BD CD AB B A A B C D D All Frank Dehne ■
Online Analytical Processing (OLAP) Operations: AC ● roll-up ● drill-down C ● slice A ● dice ABCD Traditional: Data Cube ABC ABD ACD BCD BC ABC AC Pre-compute group-bys to improve AB AD BC BD CD AB query response time. B A A B C D D Static or Batch Updates All Frank Dehne ■
OLAP vs. OLTP OLTP System OLAP System Source of data Operational data Consolidated data Purpose of data Business operations Planning, decision support Type of data Snapshot of ongoing business Multi-dimensional views of “historic” data Updates Small and fast Periodic long-running batch jobs Queries Relatively simple, involving few Often complex, involving data records aggregations of large data sets Processing speed Typically very fast Depends on amount of data involved; batch updates and complex queries may take many hours Source: Frank Dehne ■
The Five V's Of “Big Data” ● Volume ABCD ● Velocity ABC ABD ACD BCD ● Variety ● Veracity AC AB AD BC BD CD ● Value A A B C D D All Frank Dehne ■
Real-Time OLAP Insert & Query ● Avoid static data cube Stream structure and batch updates. AC ● Stream of insert and OLAP C A Real-Time query operations. OLAP ● Inserts are immediate. Engine BC ABC ● OLAP queries operate on AB B latest up-to-date data set. Query Results Frank Dehne ■
Real-Time OLAP Insert & Query ● Problem: Performance Stream AC ● Static data cube was C A Real-Time introduced to improve OLAP performance... Engine BC ABC AB B Query Results Frank Dehne ■
Real-Time OLAP Insert & Query Stream Research Question: AC Can parallel computing be C A Real-Time used to improve performance for OLAP real-time OLAP? Engine BC ABC AB B Query Results Frank Dehne ■
Parallel Computing Multi-core Processor Cloud / Cluster distributed memory shared memory Frank Dehne ■
Real-Time OLAP on Multi-Core Processors Insert & Query Stream AC C A Real-Time OLAP Engine BC ABC AB B Query Results Frank Dehne ■
Real-Time OLAP on Multi-Core Processors Insert & Query Stream Real-Time OLAP Parallel Engine DC-Tree Query Results Frank Dehne ■
Real-Time OLAP on Multi-Core Processors ● Multidimensional tree data Insert & Query Stream structure. ● Operations: insert and query. ● Enhanced for data Real-Time aggregation and dimension OLAP Parallel hierarchies (Kriegel, Engine DC-Tree ICDE 2000) ● Enhanced for multi-core parallel computing (Dehne Query Results, CCGrid 2012) Frank Dehne ■
Sequential DC-Tree ● Ester, Kohlhammer, Kriegel (ICDE 2000). ● Adaptation of R-tree for OLAP. ● Replaces total ordering by conceptual hierarchies. ● Replaces minimum bounding rectangles (MBR) by minimum describing sets (MDS). ● Adds internal directory nodes. R-Tree Frank Dehne ■
Conceptual Hierarchies Frank Dehne ■
Conceptual Hierarchies Data representation: Frank Dehne ■
Minimum Describing Set (MDS) MBR MDS Frank Dehne ■
Parallel DC-Tree inserts/queries results Stream of ● Inserts ● OLAP queries parallel DC-tree memory (Dehne, CCGrid 2012) multi-core processor Frank Dehne ■
Parallel DC-Tree inserts/queries results Parallelization: ● Insert and OLAP query operations are executed concurrently. ● OLAP query operations that need to search multiple parallel DC-tree subtrees of a node are split memory into multiple concurrent processes. multi-core processor Frank Dehne ■
Parallel DC-Tree inserts/queries results Main Problems: ● Interference between concurrent insert and OLAP query operations. ● Consistency (Strong Serialization): OLAP query results have to include transient inserts that have been parallel DC-tree issued prior. memory multi-core processor Frank Dehne ■
Parallel DC-Tree inserts/queries results Race Conditions: ● Inserts and queries run at different speeds. ● Insert traverse root to leaf and back to root ● Queries need to traverse subtrees parallel DC-tree depending on data volume to be aggregated. memory ● Insert and query operations can overtake each other. multi-core processor Frank Dehne ■
Data Structure Add: MDS ID Time Measure ● Right Sibling Links Stamp List ● Time Stamps R D1 1 20 D2 2 20 L1 3 10 L2 6 10 L3 4 10 L4 5 10 Frank Dehne ■
Lengthy Case Analysis... CASE: New node gets old time stamp ● Insert creates a directory node split ● Concurrent OLAP D1 D4 D1 1 4 1 query returns back up the tree and finds D2 D3 D2 2 3 tree structure 2 3 D3 changed. Frank Dehne ■
Parallel DC-Tree Performance Architecture: ● Intel Xeon Westmere EX ● 20 Cores (2 Sockets) ● 40 Hardware Threads (Hyperthreading) ● 256 GB Memory IBM Research Labs, Toronto Frank Dehne ■
Parallel DC-Tree Performance Data: ● Transaction Processing Performance Council Decision Support Benchmark (TPC-DS) Frank Dehne ■
TPC-DS Benchmark Hierarchy Levels 8 Dimensions Frank Dehne ■
Performance OLAP Query Response Time ● 100 GB data set (10 Mil. Records) ● 10,000 queries ● 1,000 insertions Frank Dehne ■
Performance Throughput ● 100 GB data set (10 Mil. Records) ● 10,000 queries ● 1,000 insertions Frank Dehne ■
Performance Total Total Response time Response time 5 sec. -> .25 sec. 2.7 sec. -> .13 sec. Frank Dehne ■
Performance Total Total IBM CAS Research Impact Of The Year Award Response time Response time 5 sec. -> .25 sec. 2.7 sec. -> .13 sec. Frank Dehne ■
Real-Time OLAP on Cloud Architectures Frank Dehne ■
Real-Time OLAP on Cloud Architectures Insert & OLAP Query Stream AC C A Real-Time OLAP Engine BC ABC AB B OLAP Query Results Frank Dehne ■
Cloud Computing Architecture ● Large scale compute cluster ● Virtual machines on demand ● Elastic: Dynamic addition of compute resources ● Dedicated storage devices (e.g. S3 buckets) Frank Dehne ■
Velocity OLAP (vOLAP) System Architecture Frank Dehne ■
Velocity OLAP (vOLAP) System Architecture Insert Server Client C Image I k Subset Di Worker Zookeeper: Global Image & Sync Frank Dehne ■
Velocity OLAP (vOLAP) System Architecture OLAP Query Server Client C Image I k Subsets: Worker Worker Worker Frank Dehne ■
Subset Data Structure Frank Dehne ■
Load Balancing Frank Dehne ■
Insert/Query Stream Serialization ● Strong serialization of insert and OLAP query operations within each session. ● Strong serialization of insert and OLAP query operations within sessions attached to the same server (workgroup). Frank Dehne ■
Between Servers: Probablistic Serialization Frank Dehne ■
Between Servers: Probablistic Serialization ● n = 1 billion ● 50% coverage (500 million reported data items) ● 1 second elapsed time: approx. 1% probablity of 2 missing data items (0.0000004% of the result) Frank Dehne ■
vOLAP Performance Architecture: ● Amazon EC2 ● servers: c3.8xlarge ● workers: c3.4xlarge ● clients / manager / Zookeper: c3.2xlarge ● Linux 3.14.35, ZeroMQ 4.0.5, Zookeeper 3.4.6. Frank Dehne ■
vOLAP Performance Data: TPC-DS Hierarchy Levels 8 Dimensions Frank Dehne ■
Horizontal Scale-Up Performance Frank Dehne ■
Impact of Workload Mix Frank Dehne ■
Impact of Query Coverage Frank Dehne ■
More recommend