Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao - PowerPoint PPT Presentation

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao May 2018

Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 2

Data structuring for SQL-on-Hadoop ● Partitioning 4

Data organization for SQL-on-Hadoop ● Columnar File Formats Parquet ORC 5

Data organization for SQL-on-Hadoop ● Sorting ● Bucketing 6

Data organization for SQL-on-Hadoop Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000 7

Difficulties in Structuring Data ● Workload Aware identification ● Evolving query patterns of optimal data structure ● Data pipeline dependencies ● Flexibility of data structuring ● Seamless restructuring ● Large number of consumers ● Continuous and automatic ● Data Admin Involvement maintenance ● Downtime NO DOWNTIME! 9

Basics: Materialized View ● A materialized view is a database object that contains the results of a query. ● It is a view for which the data has been materialized . ● Materialized Views can be consumed automatically by the query engine Example: CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales; Effect: Query rewrite SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv; 11

Materialized Views in Hive for Data Restructuring Interesting properties of Materialized Views in Hive: ● A copy of the data(full, partial or transformed) ● Used automatically by the engine based on cost analysis ● Can be stored as ORC, Parquet etc ● Multiple materialized views can co-exist, optimally chosen Plus: Storage is cheap Idea: Create multiple materialized views of the full data with desired structures 12

Materialized Views for Data Restructuring Example: Original Table T1: Query1: SELECT * from T1 where customer_id = 26988 and month ● Partitioned on Year, Month, Day = “January”; ● Stored as Text Rewritten: SELECT * from MV1 where customer_id = 26988 and month = “January”; Materialized View MV1: ● Partitioned on Year, Month, Day ● Sorted on Customer_Id Query2: SELECT * from T1 where seller_id = 121 and month = ● Stored as ORC “January”; Materialized View MV2: Rewritten: SELECT * from MV2 where seller_id = 121 and month = ● Partitioned on Year, Month, Day “January”; ● Sorted on Seller_Id ● Stored as ORC 13

Materialized Views in SQL-on-Hadoop engines ● Basic implementation available in Apache Hive 2.3.0 ○ Uses Apache Calcite for query optimization and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries ● Not available in Presto ● Not available in Spark 14

Challenges with Materialized Views ● Invalidation ○ Only a subset of use cases can work with stale data ● Rebuilds and Refreshes ○ Prohibitively expensive for full data copies ● Maintenance Isolation ○ Ongoing queries get affected 16

FastCopy: A framework for Autonomous Materialized Views ● Materialized Views for Sorting, Partitioning and Bucketing for structuring data ● Synchronous invalidation on table updates ● Asynchronous automatic refreshes ● Maintenance isolation by refreshes in their own scheduler queues, or even their own cluster ● Recommendation Engine to suggest Materialized Views ● Cross engine support for using Materialized Views 18

Qubole FastCopy Infrastructure 19

Qubole FastCopy Infrastructure FastCopy Creation 20

Qubole FastCopy Infrastructure FastCopy Creation FastCopy Creation 23

Qubole FastCopy Infrastructure Incoming query for rewrite 24

Qubole FastCopy Infrastructure Query Rewrite 25

Qubole FastCopy Infrastructure Invalidation and Refresh 28

Fun Details ● Auto detect added, dropped or updated partitions using partition level tokens ● Multi Version Concurrency Control for FastCopy ● Minion clusters for workload isolation 35

Recommendations ● Top Tables 36

Recommendations ● Column Usage as Filter predicates 38

Revise 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 46

Status ● FastCopy is at an internal Alpha ● Will soon be released as a beta for customers in the next Quarter ● Contribute to Open Source Thank You 47

Thank You Abhishek Somani, Adesh Rao May 2018

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao - PowerPoint PPT Presentation

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao May 2018 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views

ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel -

Lazy Maintenance of Materialized Views Jingren Zhou, Microsoft Research, USA Paul Larson,

Efficient Maintenance of Materialized Top- k Views Ke Yi, Hai Yu, Jun Yang Dept. of Computer

Views 2 Designing the user interface Roy Scholten hi Views . Views 2 Views 2 have you heard

Dont forget materialized views Stephanie Baltus Sr. Software engineer 1 About 2 A bit of

Easy Freshness with Pequod Cache Joins Bryan Kate, Eddie Kohler, Mike Kester Harvard University

CLICKHOUSE MATERIALIZED VIEWS A SECRET WEAPON FOR HIGH PERFORMANCE ANALYTICS Robert Hodges --

Structured Materialized Views for XML Queries Andrei Arion 1 , 2 eronique Benzaken 2 V Ioana

Securing Materialized Views: a Rewriting-Based Approach Sarah Nait Bahloul, Emmanuel Coquery and

An Evolutionary Approach to Materialized Views Selection in a Data Warehouse Environment by

Views 1 Views A view is a relation defined in terms of stored tables (called base tables )

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Saving Professor Campbell Anne McDougall & Genevieve Goupil ETL teachers , Montreal

T09: ETL Reference Design Overview (402.8.4) Artur Apresyan Fermilab US-MTD Technical Review

Data Science in the Wild Lecture 5: ETL - Extract, Transform, Load - 2 Eran Toch Data Science

Ca se study: Cha lle ng e s fa c e d b y E MI F in utilising the OMOP CDM Jo ha n va n de r

E.T.L. The underestimated requisite to being data-driven E.T.L. The underestimated requisite to

Trees! Crafting Decatur- Appropriate Regulations Cit y Commission Work S ession April 7, 2014

EUR/NAT AVSEC GROUP (ENAVSECG) Presentation Arja Pulliainen Vladimir Chertok Beken Seidakhmetov Co

OPPORTUNITY DAY Q2 / 2019 10 SEP 2019 1 Innovative Logistics Service and Solution Provider I

ANY GET IT Emera is a leader in renewable and clean energy focus of all operations of all

Enabling fast data. 1 | DBS-H Ltd. Big Data Integration DBS-H: Why we do what we do? To remain

Near Real-time Data Warehousing with Multi-stage Trickle & Flip J nis Zuters , University

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao - PowerPoint PPT Presentation

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao May 2018 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views

ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel -

Lazy Maintenance of Materialized Views Jingren Zhou, Microsoft Research, USA Paul Larson,

Efficient Maintenance of Materialized Top- k Views Ke Yi, Hai Yu, Jun Yang Dept. of Computer

Views 2 Designing the user interface Roy Scholten hi Views . Views 2 Views 2 have you heard

Dont forget materialized views Stephanie Baltus Sr. Software engineer 1 About 2 A bit of

Easy Freshness with Pequod Cache Joins Bryan Kate, Eddie Kohler, Mike Kester Harvard University

CLICKHOUSE MATERIALIZED VIEWS A SECRET WEAPON FOR HIGH PERFORMANCE ANALYTICS Robert Hodges --

Structured Materialized Views for XML Queries Andrei Arion 1 , 2 eronique Benzaken 2 V Ioana

Securing Materialized Views: a Rewriting-Based Approach Sarah Nait Bahloul, Emmanuel Coquery and

An Evolutionary Approach to Materialized Views Selection in a Data Warehouse Environment by

Views 1 Views A view is a relation defined in terms of stored tables (called base tables )

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Saving Professor Campbell Anne McDougall &amp; Genevieve Goupil ETL teachers , Montreal

T09: ETL Reference Design Overview (402.8.4) Artur Apresyan Fermilab US-MTD Technical Review

Data Science in the Wild Lecture 5: ETL - Extract, Transform, Load - 2 Eran Toch Data Science

Ca se study: Cha lle ng e s fa c e d b y E MI F in utilising the OMOP CDM Jo ha n va n de r

E.T.L. The underestimated requisite to being data-driven E.T.L. The underestimated requisite to

Trees! Crafting Decatur- Appropriate Regulations Cit y Commission Work S ession April 7, 2014

EUR/NAT AVSEC GROUP (ENAVSECG) Presentation Arja Pulliainen Vladimir Chertok Beken Seidakhmetov Co

OPPORTUNITY DAY Q2 / 2019 10 SEP 2019 1 Innovative Logistics Service and Solution Provider I

ANY GET IT Emera is a leader in renewable and clean energy focus of all operations of all

Enabling fast data. 1 | DBS-H Ltd. Big Data Integration DBS-H: Why we do what we do? To remain

Near Real-time Data Warehousing with Multi-stage Trickle &amp; Flip J nis Zuters , University

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Saving Professor Campbell Anne McDougall & Genevieve Goupil ETL teachers , Montreal

Near Real-time Data Warehousing with Multi-stage Trickle & Flip J nis Zuters , University