This work has been supported by ESF project No. 2009/ 0216/ 1DP/ 1.1.1.2.0/ 09/ API A/ VI AA/ 044 Near Real-time Data Warehousing with Multi-stage Trickle & Flip J ā nis Zuters , University of Latvia, BIR 2011, October 8, 2011
Data Flow In a Data Warehouse Near Real-time Data Data Source Warehouse OLAP ETL Data Loading CDC
Near Real-time Refreshment Data Data Source Warehouse ETL Microbatch ETL OLAP Frequent data loading (e.g., hourly) Data loading conflicts
The ‘Trickle & Flip’ Methodology Data Staging Source Tables ETL OLAP 1h copy flip Copy of Data Staging Warehouse Tables
‘Trickle & Flip’: The Algorithm ALGORITHM trickle_and_flip_refresh ( R ) DW – data warehouse D1 , D2 – staging partitions with the same data format as DW R – refreshment rate (e.g., 1 hour) D1 is being fed from the source BEGIN Do Every R % e.g., every hour Copy D1 to D2 % DW should not be locked by querying Flip D2 and DW
‘Trickle & Flip’: Demonstration Data Source 1h ETL flip Staging Copy of Data Tables Staging Warehouse Tables • 3 complete copies of data!
‘Trickle & Flip’ With Real-time Partition Data Staging Source OLAP Tables Static Data ETL 1h copy flip • 3 complete Real-time Partition copies of Copy of real-time Data Warehouse Staging data! Tables
Why ‘Trickle & Flip’ is Better Than “Simple” Near Real-time? • Data warehouse suffers from data loading to a significantly less extent (flipping is very fast) flip Copy of Real-time Staging Partition Tables
What Are The Issues of the “Pure” ‘Trickle & Flip’ Approach? • Frequent • If refreshment cycle times copying of become very frequent: large amount querying over real-time data of data becomes unconvenient OLAP copy flip Real-time Partition
Assumptions to Add More Stages to ‘Trickle & Flip’ • Adding data to a smaller table (i.e., with less data) is faster; • Updating last changes to a table is faster than making full copy of the last version
Multi-stage ‘Trickle & Flip’ Data Staging Staging Source Tables 2 Tables 1 add? add? OLAP add add 5 min. 1 hour ETL Static Staging Real-time Data Real-time Tables 0 Partition 1 Partition 2 Data Warehouse
Multi-stage ‘Trickle & Flip’: The Algorithm (For a Single Stage) ALGORITHM multiple_trickle_and_flip_refresh ( R1, M, H`, H ) H – real-time partition for the current hour M , H` – staging partitions R1 –refreshment rate (e.g., 5 minutes) M is being continuously fed from the source BEGIN Do Every R1 % e.g., every 5 minutes Add M to H` Empty M If H is available % not locked by querying Add H` to H Empty H`
Multi-stage ‘Trickle & Flip’: Demonstration Data Staging Staging Source Tables 1 Tables 2 OLAP ETL Static Staging Data Tables 0 Real-time Real-time Partition 1 Partition 2
The Summary of Multi-stage ‘Trickle & Flip’ • Total amount of data copying is reduced; • Collisions between data loading and querying activities have been reduced; • More advanced querying system is required to fully benefit from the approach
Thank You! Data Data Data Staging Staging Staging Staging Staging Staging Source Source Source Tables 2 Tables 2 Tables 2 Tables 1 Tables 1 Tables 1 add? add? add? add? add? add? OLAP OLAP OLAP add add add add add add 5 min. 5 min. 5 min. 1 hour 1 hour 1 hour ETL ETL ETL Static Static Staging Staging Real-time Real-time Real-time Data Data Real-time Real-time Real-time Tables 0 Tables 0 Partition 1 Partition 1 Partition 1 Partition 2 Partition 2 Partition 2 Data Warehouse Data Warehouse J ā nis Zuters , University of Latvia
Recommend
More recommend