autonomous etl with materialized views
play

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao - PowerPoint PPT Presentation

Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao May 2018 Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views


  1. Autonomous ETL With Materialized Views Abhishek Somani, Adesh Rao May 2018

  2. Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 2

  3. Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 3

  4. Data structuring for SQL-on-Hadoop ● Partitioning 4

  5. Data organization for SQL-on-Hadoop ● Columnar File Formats Parquet ORC 5

  6. Data organization for SQL-on-Hadoop ● Sorting ● Bucketing 6

  7. Data organization for SQL-on-Hadoop Speedup of Unsorted vs Sorted ORC data on TPCDS scale 1000 7

  8. Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 8

  9. Difficulties in Structuring Data ● Workload Aware identification ● Evolving query patterns of optimal data structure ● Data pipeline dependencies ● Flexibility of data structuring ● Seamless restructuring ● Large number of consumers ● Continuous and automatic ● Data Admin Involvement maintenance ● Downtime NO DOWNTIME! 9

  10. Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 10

  11. Basics: Materialized View ● A materialized view is a database object that contains the results of a query. ● It is a view for which the data has been materialized . ● Materialized Views can be consumed automatically by the query engine Example: CREATE MATERIALIZED VIEW mv AS SELECT seller_id, seller_name, num_item*cost AS value FROM sales; Effect: Query rewrite SELECT seller_id, num_item*cost AS value FROM sales; ~ SELECT seller_id, value FROM mv; 11

  12. Materialized Views in Hive for Data Restructuring Interesting properties of Materialized Views in Hive: ● A copy of the data(full, partial or transformed) ● Used automatically by the engine based on cost analysis ● Can be stored as ORC, Parquet etc ● Multiple materialized views can co-exist, optimally chosen Plus: Storage is cheap Idea: Create multiple materialized views of the full data with desired structures 12

  13. Materialized Views for Data Restructuring Example: Original Table T1: Query1: SELECT * from T1 where customer_id = 26988 and month ● Partitioned on Year, Month, Day = “January”; ● Stored as Text Rewritten: SELECT * from MV1 where customer_id = 26988 and month = “January”; Materialized View MV1: ● Partitioned on Year, Month, Day ● Sorted on Customer_Id Query2: SELECT * from T1 where seller_id = 121 and month = ● Stored as ORC “January”; Materialized View MV2: Rewritten: SELECT * from MV2 where seller_id = 121 and month = ● Partitioned on Year, Month, Day “January”; ● Sorted on Seller_Id ● Stored as ORC 13

  14. Materialized Views in SQL-on-Hadoop engines ● Basic implementation available in Apache Hive 2.3.0 ○ Uses Apache Calcite for query optimization and query rewrite ○ Multi file format support. Uses ORC (by default) for optimized columnar storage of materialized queries ● Not available in Presto ● Not available in Spark 14

  15. Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 15

  16. Challenges with Materialized Views ● Invalidation ○ Only a subset of use cases can work with stale data ● Rebuilds and Refreshes ○ Prohibitively expensive for full data copies ● Maintenance Isolation ○ Ongoing queries get affected 16

  17. Agenda 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 17

  18. FastCopy: A framework for Autonomous Materialized Views ● Materialized Views for Sorting, Partitioning and Bucketing for structuring data ● Synchronous invalidation on table updates ● Asynchronous automatic refreshes ● Maintenance isolation by refreshes in their own scheduler queues, or even their own cluster ● Recommendation Engine to suggest Materialized Views ● Cross engine support for using Materialized Views 18

  19. Qubole FastCopy Infrastructure 19

  20. Qubole FastCopy Infrastructure FastCopy Creation 20

  21. Qubole FastCopy Infrastructure FastCopy Creation 21

  22. Qubole FastCopy Infrastructure FastCopy Creation 22

  23. Qubole FastCopy Infrastructure FastCopy Creation FastCopy Creation 23

  24. Qubole FastCopy Infrastructure Incoming query for rewrite 24

  25. Qubole FastCopy Infrastructure Query Rewrite 25

  26. Qubole FastCopy Infrastructure Query Rewrite 26

  27. Qubole FastCopy Infrastructure Query Rewrite 27

  28. Qubole FastCopy Infrastructure Invalidation and Refresh 28

  29. Qubole FastCopy Infrastructure Invalidation and Refresh 29

  30. Qubole FastCopy Infrastructure Invalidation and Refresh 30

  31. Qubole FastCopy Infrastructure Invalidation and Refresh 31

  32. Qubole FastCopy Infrastructure Invalidation and Refresh 32

  33. Qubole FastCopy Infrastructure Invalidation and Refresh 33

  34. Qubole FastCopy Infrastructure Invalidation and Refresh 34

  35. Fun Details ● Auto detect added, dropped or updated partitions using partition level tokens ● Multi Version Concurrency Control for FastCopy ● Minion clusters for workload isolation 35

  36. Recommendations ● Top Tables 36

  37. Recommendations ● Top Tables 37

  38. Recommendations ● Column Usage as Filter predicates 38

  39. Recommendations ● Column Usage as Filter predicates 39

  40. Recommendations ● Column Usage as Filter predicates 40

  41. Recommendations ● Top Tables 41

  42. Recommendations ● Top Tables 42

  43. Recommendations ● Column Usage as Filter predicates 43

  44. Recommendations ● Column Usage as Filter predicates 44

  45. Recommendations ● Column Usage as Filter predicates 45

  46. Revise 1. Standard techniques for structuring data for SQL-on-Hadoop (Hive, Presto, Spark etc) 2. Difficulties in structuring data 3. A case for Materialized Views 4. Challenges with Materialized Views 5. Solution 46

  47. Status ● FastCopy is at an internal Alpha ● Will soon be released as a beta for customers in the next Quarter ● Contribute to Open Source Thank You 47

  48. Thank You Abhishek Somani, Adesh Rao May 2018

Recommend


More recommend