Data Inges*on for the Connected World John Meehan, Cansu Aslantas, Stan Zdonik (Brown University) Nesime Tatbul (Intel Labs & MIT) Jiang Du (University of Toronto)
The IoT Era
Tradi*onal Data Inges*on (ETL) E XTRACT T RANSFORM L OAD DATA CLEANING OLAP/STORAGE STAGING DATA SOURCES INTERMEDIATE RESULTS DATA DATA FLAT NORMALIZATION WAREHOUSE FILES INTERMEDIATE RESULTS 3
An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform 3. Update warehouse hWp://www.tpc.org/tpcdi/ 4 Poess et al, VLDB 2014
An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform ü Data collected into flat files ü Heterogeneous data types 3. Update warehouse ü Incremental update from an OLTP source, once a day hWp://www.tpc.org/tpcdi/ 5 Poess et al, VLDB 2014.
An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform 3. Update warehouse ü Storage for intermediate results ü Transac*onal state management hWp://www.tpc.org/tpcdi/ 6 Poess et al, VLDB 2014.
An Example: TPC-DI • Brokerage firm • 6 heterogeneous sources • 3 key parts: 1. Ingest raw data 2. ETL transform 3. Update warehouse ü Bulk loading hWp://www.tpc.org/tpcdi/ 7 Poess et al, VLDB 2014.
Streaming Data Inges*on • In modern apps such as IoT: – real-*me streams of data from a large number of sources – majority of these sources report in the form of *me-series – data currency & low latency is key for real-*me decision making & control ü Need a stream-based inges*on architecture ü Must pay aWen*on to *me-series data type and opera*ons (both during inges*on & analy*cs) 8
An Architecture for Streaming Data Inges*on 9
Implementa*on STREAMING ETL OLAP S-STORE POSTGRES DATA SOURCES SP1 SP3 SP2 KAFKA DISK STORAGE MAIN-MEMORY COLLECTOR STORAGE DATA DATA MIGRATOR BIGDAWG 10
Implementa*on STREAMING ETL OLAP S-STORE POSTGRES DATA SOURCES SP1 SP3 SP2 KAFKA DISK STORAGE MAIN-MEMORY COLLECTOR STORAGE DATA DATA MIGRATOR BIGDAWG 11
-Store : Shared Mutable State in Streaming • A hybrid system for transac*on & stream processing – combines main-memory OLTP with streaming constructs (windowing, triggers, dataflow graphs) • Transac*ons as user-defined stored procedures (Java + SQL) • Three complementary correctness guarantees – ACID , for individual transac*ons – Ordered execu8on , for streams and dataflow graphs – Exactly-once processing , for streams (no loss or duplicates due to failures/recovery) 12
Example: A TPC-DI Dataflow Graph in S-Store UPDATE DATE, DA TE, TRADE TIME, TIME, SECURITY ACCOUNT DATA STATUS, ST TUS, LOOKUP LOOKUP (STAGING) TYPE TYPE Date Time DimSecurity DimAccount DimTrade Status Type 13
Example: A TPC-DI Dataflow Graph in S-Store UPDATE DATE, DA TE, TRADE TIME, TIME, SECURITY ACCOUNT DATA STATUS, ST TUS, LOOKUP LOOKUP Transaction Execution (TE) = (STAGING) TYPE TYPE An instance of a stored procedure executing on an input batch TE1 TE2 Date Time DimAccount DimTrade DimSecurity Status Type 14
Example: A TPC-DI Dataflow Graph in S-Store UPDATE DATE, DA TE, TRADE TIME, TIME, SECURITY ACCOUNT DATA STATUS, ST TUS, LOOKUP LOOKUP Shared state (STAGING) TYPE TYPE read or written by TEs TE1 TE2 Date Time DimAccount DimTrade DimSecurity Status Type 15
Implementa*on STREAMING ETL OLAP S-STORE POSTGRES DATA SOURCES SP1 SP3 SP2 KAFKA DISK STORAGE MAIN-MEMORY COLLECTOR STORAGE DATA DATA MIGRATOR BIGDAWG 16
Data Migrator • Provides durable migra*on into the data warehouse using an ack mechanism that simulates 2PC • Leverages the BigDAWG polystore middleware ( see Session 4 ) – can support a variety of des*na*on warehouses – can par*cipate in federated querying • Supports both “push” and “pull” modes 17
TPC-DI Experiment: Push vs. Pull Tradeoffs • How omen to migrate? Push or pull? • Impacts: – Maximum ingest latency in S-Store – Query execu*on *me in Postgres – Staleness of the query results in Postgres • Result summary: Push in small batches, every 1-5 seconds. Fine-grained inges*on performs well. 18
Ongoing Work • Time-series data management (inges*on & beyond) – New inges*on challenges and opportuni*es (e.g., synchroniza*on/alignment of *me-series, using predic*ve techniques for dealing with missing/delayed values) – Append-based updates, window-based reads – Need to support complex analy*cs opera*ons (forecas*ng/ predic*on, paWern matching, anomaly detec*on, signal processing) – Exploit the resources on edge devices 19
Recommend
More recommend