tactical data engineering
play

Tactical data engineering Julian Hyde April 1718, 2019 San - PowerPoint PPT Presentation

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems 1. DBMS query program


  1. Tactical data engineering Julian Hyde April 17–18, 2019 San Francisco

  2. @julianhyde

  3. DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems

  4. 1. DBMS

  5. query program table 1 table 2 file 1 file 2 File system vs. DBMS

  6. query program program file 1 file 2 file 1 file 2 File system vs. DBMS

  7. query program program (merge join) sorted sorted file 1 file 2 file 1 file 2 sorted file 2 Efficient join: reorganize the data and rewrite the program

  8. DBMS adds value Abstraction ● ● Declarative language ● Planning Easily reorganize data, add new algorithms ● Governance ● ● Metadata ● Security And, I propose: ● Adaptability

  9. 2. Data pipeline

  10. Cloud DB source table table table table table table source The data pipeline: Extract - Load - Transform

  11. interactive users business query SQL query Cloud DB source table table table table table table source The data pipeline: Extract - Load - Transform

  12. query program program file 1 file 2 file 1 file 2 File system vs. DBMS

  13. business query query SQL query program program Cloud DB table table table file 1 file 2 file 1 file 2 table table table File system vs. DBMS vs. analytic data system

  14. business programmers analysts users business query query SQL query program program Cloud DB table table table file 1 file 2 file 1 file 2 table table table File system vs. DBMS vs. analytic data system

  15. 3. DBMS tricks

  16. Re-organize data a 1 Raw data c 3 c 4 b 2 a 1 a 1 c 3 a 1 a 1 a 1 1 a .. b b 2 c 4 b 2 c 3 c 3 c 7 2 c .. c c 3 c 4 c 4 b 2 1 c 4 b 2 b 2 Index Sort Partition Replicate Summarize

  17. Caching a 1 Raw data a 1 Copy of c 3 c 3 data in memory c 4 c 4 b 2 b 2

  18. Apache Calcite Apache top-level project Query planning framework used in many projects and products Also works standalone: federated query engine with SQL / JDBC front end Apache community development model calcite.apache.org github.com/apache/calcite

  19. SELECT d.name, COUNT (*) AS c FROM Emps AS e Relational algebra JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT (*) > 5 Sort [c DESC] Based on set theory, plus ORDER BY c DESC operators: Project, Filter, Aggregate, Project [name, c] Union, Join, Sort Filter [c > 5] Requires: declarative language (SQL), query planner Aggregate [deptno, COUNT(*) AS c] Original goal: data independence Filter [e.age < 30] Enables: query optimization, new Join [e.deptno = d.deptno] algorithms and data structures Scan [Emps] Scan [Depts]

  20. SELECT d.name, COUNT (*) AS c FROM ( SELECT * FROM Emps Algebraic rewrite WHERE e.age > 50) AS e JOIN Depts AS d USING (deptno) GROUP BY d.deptno HAVING COUNT (*) > 5 Sort [c DESC] Optimize by applying rewrite rules that ORDER BY c DESC preserve semantics Project [name, c] Hopefully the result is less expensive; Filter [c > 5] but it’s OK if it’s not (planner keeps “before” and “after”) Aggregate [deptno, COUNT(*) AS c] Planner uses dynamic programming, Join [e.deptno = d.deptno] seeking the lowest total cost Filter [e.age > 50] Scan [Emps] Scan [Depts]

  21. Views Aggregate [deptno, MIN(salary)] SELECT deptno, MIN (salary) FROM Managers Filter [age > 50] WHERE age > 50 GROUP BY deptno Scan [Managers] CREATE VIEW Managers AS Project [id, deptno, salary, age] SELECT * FROM Emps AS e Join [e.id = underling.manager] WHERE EXISTS ( SELECT * Aggregate [manager] FROM Emps AS underling WHERE underling.manager = e.id) Scan [Emps] Scan [Emps]

  22. View query (after expansion) SELECT deptno, MIN (salary) FROM Managers WHERE age > 50 Aggregate [deptno, MIN(salary)] GROUP BY deptno Filter [age > 50] CREATE VIEW Managers AS Project [id, deptno, salary, age] SELECT * FROM Emps AS e Join [e.id = underling.manager] WHERE EXISTS ( SELECT * Aggregate [manager] FROM Emps AS underling WHERE underling.manager = e.id) Scan [Emps] Scan [Emps]

  23. Materialized view Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Aggregate [COUNT(*)] SELECT COUNT (*) AS c FROM Emps Filter [deptno = 10 AND gender = ‘M’] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]

  24. Materialized view: rewrite query to match Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Aggregate [deptno, gender, COUNT(*), SUM(salary)] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]

  25. Materialized view: rewrite query to match Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Aggregate [deptno, gender, COUNT(*), SUM(salary)] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]

  26. Materialized view: substitute table scan Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Scan [EmpSummary] WHERE deptno = 10 AND gender = ‘M’

  27. Materialized view: substitute table scan Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT c FROM EmpSummary Scan [EmpSummary] WHERE deptno = 10 AND gender = ‘M’

  28. 4. Analytics

  29. view: orders { dimension: id { primary_key: yes type: number sql: ${TABLE}.id ;; } dimension: customer_id { # field: orders.customer_id sql: ${TABLE}.customer_id ;; } dimension: amount { # field: orders.amount type: number value_format: "0.00" sql: ${TABLE}.amount ;; } measure: count { # field: orders.count type: count # creates a sql COUNT(*) } measure: total_amount { type: sum sql: ${amount} ;; “orders” view in LookML } }

  30. 5. Evolving the data pipeline

  31. interactive users business query SQL query Cloud DB source table table table table table table source

  32. table table table table table table Data engineering

  33. table table In memory table table table table table table table file Data engineering is not a static problem

  34. data engineer table table In memory table table table table table table table file Who is responsible for data engineering?

  35. data data engineer analyst system scientist (runtime adaptation) table table In memory table table table table table table table file Data engineering - empower users, reduce friction

  36. view: customer_order_facts { derived_table: { sql: SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id ;; } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; } dimension: lifetime_amount { type: number value_format: "0.00" sql: ${TABLE}.lifetime_amount ;; LookML - derived table (based } } on SQL)

  37. view: customer_order_facts { derived_table: { explore_source: orders { column: customer_id { field: order.customer_id } column: first_order { field: order.first_order } column: lifetime_amount { field: order.lifetime_amount } } } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] LookML - derived table (based sql: ${TABLE}.first_order_date ;; } on an Explore)

  38. Flavors of derived table Derived table flavor Purpose SQL equivalent Ephemeral Query expansion CREATE VIEW Persistent Query is executed once, used by CREATE TABLE AS several queries until it expires SELECT Transparent Populated as persistent DT, but CREATE MATERIALIZED can be used even if the business VIEW query does not reference it by name Each flavor comes can be based on either an Explore or SQL

  39. Building materialized views Challenges: ● Design Which materializations to create? ● Populate Load them with data Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived Initial focus: summary tables (materialized views over star schemas)

Recommend


More recommend