real time on line analytical processing olap on multi
play

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and - PowerPoint PPT Presentation

Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures Frank Dehne School of Computer Science Centre For Advanced Studies Canada Frank Dehne www.dehne.net Parallel Data Analytics Joint work with R.Bordawekar


  1. Real-Time On-line Analytical Processing (OLAP) On Multi-Core and Cloud Architectures Frank Dehne School of Computer Science Centre For Advanced Studies Canada Frank Dehne ■ www.dehne.net

  2. Parallel Data Analytics Joint work with R.Bordawekar (IBM Yorktown), J.Dale (IBM Littletown), R.Grosset (IBM Toronto), M.Genkin (IBM Toronto), S.Jou (IBM Toronto), P.Jain (IBM Littletown), M.Petitclerc (IBM Laval), A.Rau- Chaplin (Dalhousie), D.Robillard (Carleton), F.Thomas (IBM Ottawa), H.Zaboli (Carleton), R.Zhou (Carleton). Frank Dehne ■ www.dehne.net

  3. Online Analytical Processing (OLAP) IBM/COGNOS ● Insight ● Workspace ● Report/Studio Frank Dehne ■ www.dehne.net

  4. Online Analytical Processing (OLAP) Frank Dehne ■ www.dehne.net

  5. Online Analytical Processing (OLAP) Operations: AC ● roll-up ● drill-down C ● slice A ● dice BC ABC AB B Frank Dehne ■ www.dehne.net

  6. Online Analytical Processing (OLAP) Operations: AC ● roll-up ● drill-down C ● slice A ● dice ABCD ABC ABD ACD BCD BC ABC AC AB AD BC BD CD AB B A A B C D D All Frank Dehne ■ www.dehne.net

  7. Online Analytical Processing (OLAP) Operations: AC ● roll-up ● drill-down C ● slice A ● dice ABCD Traditional: Data Cube ABC ABD ACD BCD BC ABC AC Pre-compute group-bys to improve AB AD BC BD CD AB query response time. B A A B C D D Static or Batch Updates All Frank Dehne ■ www.dehne.net

  8. OLAP vs. OLTP OLTP System OLAP System Source of data Operational data Consolidated data Purpose of data Business operations Planning, decision support Type of data Snapshot of ongoing business Multi-dimensional views of “historic” data Updates Small and fast Periodic long-running batch jobs Queries Relatively simple, involving few Often complex, involving data records aggregations of large data sets Processing speed Typically very fast Depends on amount of data involved; batch updates and complex queries may take many hours Source: AcceleratedAnalytics.com Frank Dehne ■ www.dehne.net

  9. The Five V's Of “Big Data” ● Volume ABCD ● Velocity ABC ABD ACD BCD ● Variety ● Veracity AC AB AD BC BD CD ● Value A A B C D D All Frank Dehne ■ www.dehne.net

  10. Real-Time OLAP Insert & Query ● Avoid static data cube Stream structure and batch updates. AC ● Stream of insert and OLAP C A Real-Time query operations. OLAP ● Inserts are immediate. Engine BC ABC ● OLAP queries operate on AB B latest up-to-date data set. Query Results Frank Dehne ■ www.dehne.net

  11. Real-Time OLAP Insert & Query ● Problem: Performance Stream AC ● Static data cube was C A Real-Time introduced to improve OLAP performance... Engine BC ABC AB B Query Results Frank Dehne ■ www.dehne.net

  12. Real-Time OLAP Insert & Query Stream Research Question: AC Can parallel computing be C A Real-Time used to improve performance for OLAP real-time OLAP? Engine BC ABC AB B Query Results Frank Dehne ■ www.dehne.net

  13. Parallel Computing Multi-core Processor Cloud / Cluster distributed memory shared memory Frank Dehne ■ www.dehne.net

  14. Real-Time OLAP on Multi-Core Processors Insert & Query Stream AC C A Real-Time OLAP Engine BC ABC AB B Query Results Frank Dehne ■ www.dehne.net

  15. Real-Time OLAP on Multi-Core Processors Insert & Query Stream Real-Time OLAP Parallel Engine DC-Tree Query Results Frank Dehne ■ www.dehne.net

  16. Real-Time OLAP on Multi-Core Processors ● Multidimensional tree data Insert & Query Stream structure. ● Operations: insert and query. ● Enhanced for data Real-Time aggregation and dimension OLAP Parallel hierarchies (Kriegel et.al., Engine DC-Tree ICDE 2000) ● Enhanced for multi-core parallel computing (Dehne Query Results et.al., CCGrid 2012) Frank Dehne ■ www.dehne.net

  17. Sequential DC-Tree ● Ester, Kohlhammer, Kriegel (ICDE 2000). ● Adaptation of R-tree for OLAP. ● Replaces total ordering by conceptual hierarchies. ● Replaces minimum bounding rectangles (MBR) by minimum describing sets (MDS). ● Adds internal directory nodes. R-Tree Frank Dehne ■ www.dehne.net

  18. Conceptual Hierarchies Frank Dehne ■ www.dehne.net

  19. Conceptual Hierarchies Data representation: Frank Dehne ■ www.dehne.net

  20. Minimum Describing Set (MDS) MBR MDS Frank Dehne ■ www.dehne.net

  21. Parallel DC-Tree inserts/queries results Stream of ● Inserts ● OLAP queries parallel DC-tree memory (Dehne et.al., CCGrid 2012) multi-core processor Frank Dehne ■ www.dehne.net

  22. Parallel DC-Tree inserts/queries results Parallelization: ● Insert and OLAP query operations are executed concurrently. ● OLAP query operations that need to search multiple parallel DC-tree subtrees of a node are split memory into multiple concurrent processes. multi-core processor Frank Dehne ■ www.dehne.net

  23. Parallel DC-Tree inserts/queries results Main Problems: ● Interference between concurrent insert and OLAP query operations. ● Consistency (Strong Serialization): OLAP query results have to include transient inserts that have been parallel DC-tree issued prior. memory multi-core processor Frank Dehne ■ www.dehne.net

  24. Parallel DC-Tree inserts/queries results Race Conditions: ● Inserts and queries run at different speeds. ● Insert traverse root to leaf and back to root ● Queries need to traverse subtrees parallel DC-tree depending on data volume to be aggregated. memory ● Insert and query operations can overtake each other. multi-core processor Frank Dehne ■ www.dehne.net

  25. Data Structure Add: MDS ID Time Measure ● Right Sibling Links Stamp List ● Time Stamps R D1 1 20 D2 2 20 L1 3 10 L2 6 10 L3 4 10 L4 5 10 Frank Dehne ■ www.dehne.net

  26. Lengthy Case Analysis... CASE: New node gets old time stamp ● Insert creates a directory node split ● Concurrent OLAP D1 D4 D1 1 4 1 query returns back up the tree and finds D2 D3 D2 2 3 tree structure 2 3 D3 changed. Frank Dehne ■ www.dehne.net

  27. Parallel DC-Tree Performance Architecture: ● Intel Xeon Westmere EX ● 20 Cores (2 Sockets) ● 40 Hardware Threads (Hyperthreading) ● 256 GB Memory IBM Research Labs, Toronto Frank Dehne ■ www.dehne.net

  28. Parallel DC-Tree Performance Data: tpc.org ● Transaction Processing Performance Council Decision Support Benchmark (TPC-DS) Frank Dehne ■ www.dehne.net

  29. TPC-DS Benchmark Hierarchy Levels 8 Dimensions Frank Dehne ■ www.dehne.net

  30. Performance OLAP Query Response Time ● 100 GB data set (10 Mil. Records) ● 10,000 queries ● 1,000 insertions Frank Dehne ■ www.dehne.net

  31. Performance Throughput ● 100 GB data set (10 Mil. Records) ● 10,000 queries ● 1,000 insertions Frank Dehne ■ www.dehne.net

  32. Performance Total Total Response time Response time 5 sec. -> .25 sec. 2.7 sec. -> .13 sec. Frank Dehne ■ www.dehne.net

  33. Performance Total Total IBM CAS Research Impact Of The Year Award Response time Response time 5 sec. -> .25 sec. 2.7 sec. -> .13 sec. Frank Dehne ■ www.dehne.net

  34. Real-Time OLAP on Cloud Architectures Frank Dehne ■ www.dehne.net

  35. Real-Time OLAP on Cloud Architectures Insert & OLAP Query Stream AC C A Real-Time OLAP Engine BC ABC AB B OLAP Query Results Frank Dehne ■ www.dehne.net

  36. Cloud Computing Architecture ● Large scale compute cluster ● Virtual machines on demand ● Elastic: Dynamic addition of compute resources ● Dedicated storage devices (e.g. S3 buckets) Frank Dehne ■ www.dehne.net

  37. Velocity OLAP (vOLAP) System Architecture Frank Dehne ■ www.dehne.net

  38. Velocity OLAP (vOLAP) System Architecture Insert Server Client C Image I k Subset Di Worker Zookeeper: Global Image & Sync Frank Dehne ■ www.dehne.net

  39. Velocity OLAP (vOLAP) System Architecture OLAP Query Server Client C Image I k Subsets: Worker Worker Worker Frank Dehne ■ www.dehne.net

  40. Subset Data Structure Frank Dehne ■ www.dehne.net

  41. Load Balancing Frank Dehne ■ www.dehne.net

  42. Insert/Query Stream Serialization ● Strong serialization of insert and OLAP query operations within each session. ● Strong serialization of insert and OLAP query operations within sessions attached to the same server (workgroup). Frank Dehne ■ www.dehne.net

  43. Between Servers: Probablistic Serialization Frank Dehne ■ www.dehne.net

  44. Between Servers: Probablistic Serialization ● n = 1 billion ● 50% coverage (500 million reported data items) ● 1 second elapsed time: approx. 1% probablity of 2 missing data items (0.0000004% of the result) Frank Dehne ■ www.dehne.net

  45. vOLAP Performance Architecture: ● Amazon EC2 ● servers: c3.8xlarge ● workers: c3.4xlarge ● clients / manager / Zookeper: c3.2xlarge ● Linux 3.14.35, ZeroMQ 4.0.5, Zookeeper 3.4.6. Frank Dehne ■ www.dehne.net

  46. vOLAP Performance Data: TPC-DS Hierarchy Levels 8 Dimensions Frank Dehne ■ www.dehne.net

  47. Horizontal Scale-Up Performance Frank Dehne ■ www.dehne.net

  48. Impact of Workload Mix Frank Dehne ■ www.dehne.net

  49. Impact of Query Coverage Frank Dehne ■ www.dehne.net

Recommend


More recommend