Tactical data engineering Julian Hyde April 17–18, 2019 San Francisco
@julianhyde
DBMS Data pipeline & analytics DBMS tricks Evolving the data pipeline Tactical data Adaptive data engineering systems
1. DBMS
query program table 1 table 2 file 1 file 2 File system vs. DBMS
query program program file 1 file 2 file 1 file 2 File system vs. DBMS
query program program (merge join) sorted sorted file 1 file 2 file 1 file 2 sorted file 2 Efficient join: reorganize the data and rewrite the program
DBMS adds value Abstraction ● ● Declarative language ● Planning Easily reorganize data, add new algorithms ● Governance ● ● Metadata ● Security And, I propose: ● Adaptability
2. Data pipeline
Cloud DB source table table table table table table source The data pipeline: Extract - Load - Transform
interactive users business query SQL query Cloud DB source table table table table table table source The data pipeline: Extract - Load - Transform
query program program file 1 file 2 file 1 file 2 File system vs. DBMS
business query query SQL query program program Cloud DB table table table file 1 file 2 file 1 file 2 table table table File system vs. DBMS vs. analytic data system
business programmers analysts users business query query SQL query program program Cloud DB table table table file 1 file 2 file 1 file 2 table table table File system vs. DBMS vs. analytic data system
3. DBMS tricks
Re-organize data a 1 Raw data c 3 c 4 b 2 a 1 a 1 c 3 a 1 a 1 a 1 1 a .. b b 2 c 4 b 2 c 3 c 3 c 7 2 c .. c c 3 c 4 c 4 b 2 1 c 4 b 2 b 2 Index Sort Partition Replicate Summarize
Caching a 1 Raw data a 1 Copy of c 3 c 3 data in memory c 4 c 4 b 2 b 2
Apache Calcite Apache top-level project Query planning framework used in many projects and products Also works standalone: federated query engine with SQL / JDBC front end Apache community development model calcite.apache.org github.com/apache/calcite
SELECT d.name, COUNT (*) AS c FROM Emps AS e Relational algebra JOIN Depts AS d USING (deptno) WHERE e.age < 40 GROUP BY d.deptno HAVING COUNT (*) > 5 Sort [c DESC] Based on set theory, plus ORDER BY c DESC operators: Project, Filter, Aggregate, Project [name, c] Union, Join, Sort Filter [c > 5] Requires: declarative language (SQL), query planner Aggregate [deptno, COUNT(*) AS c] Original goal: data independence Filter [e.age < 30] Enables: query optimization, new Join [e.deptno = d.deptno] algorithms and data structures Scan [Emps] Scan [Depts]
SELECT d.name, COUNT (*) AS c FROM ( SELECT * FROM Emps Algebraic rewrite WHERE e.age > 50) AS e JOIN Depts AS d USING (deptno) GROUP BY d.deptno HAVING COUNT (*) > 5 Sort [c DESC] Optimize by applying rewrite rules that ORDER BY c DESC preserve semantics Project [name, c] Hopefully the result is less expensive; Filter [c > 5] but it’s OK if it’s not (planner keeps “before” and “after”) Aggregate [deptno, COUNT(*) AS c] Planner uses dynamic programming, Join [e.deptno = d.deptno] seeking the lowest total cost Filter [e.age > 50] Scan [Emps] Scan [Depts]
Views Aggregate [deptno, MIN(salary)] SELECT deptno, MIN (salary) FROM Managers Filter [age > 50] WHERE age > 50 GROUP BY deptno Scan [Managers] CREATE VIEW Managers AS Project [id, deptno, salary, age] SELECT * FROM Emps AS e Join [e.id = underling.manager] WHERE EXISTS ( SELECT * Aggregate [manager] FROM Emps AS underling WHERE underling.manager = e.id) Scan [Emps] Scan [Emps]
View query (after expansion) SELECT deptno, MIN (salary) FROM Managers WHERE age > 50 Aggregate [deptno, MIN(salary)] GROUP BY deptno Filter [age > 50] CREATE VIEW Managers AS Project [id, deptno, salary, age] SELECT * FROM Emps AS e Join [e.id = underling.manager] WHERE EXISTS ( SELECT * Aggregate [manager] FROM Emps AS underling WHERE underling.manager = e.id) Scan [Emps] Scan [Emps]
Materialized view Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Aggregate [COUNT(*)] SELECT COUNT (*) AS c FROM Emps Filter [deptno = 10 AND gender = ‘M’] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]
Materialized view: rewrite query to match Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Aggregate [deptno, gender, COUNT(*), SUM(salary)] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]
Materialized view: rewrite query to match Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Aggregate [deptno, gender, COUNT(*), SUM(salary)] WHERE deptno = 10 AND gender = ‘M’ Scan [Emps]
Materialized view: substitute table scan Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT COUNT (*) AS c FROM Emps Scan [EmpSummary] WHERE deptno = 10 AND gender = ‘M’
Materialized view: substitute table scan Aggregate [deptno, gender, COUNT(*), SUM(salary)] [EmpSummary] = Scan CREATE MATERIALIZED VIEW EmpSummary AS Scan [Emps] SELECT deptno, gender, COUNT (*) AS c, SUM (sal) AS s FROM Emps GROUP BY deptno, gender Project [c] Filter [deptno = 10 AND gender = ‘M’] SELECT c FROM EmpSummary Scan [EmpSummary] WHERE deptno = 10 AND gender = ‘M’
4. Analytics
view: orders { dimension: id { primary_key: yes type: number sql: ${TABLE}.id ;; } dimension: customer_id { # field: orders.customer_id sql: ${TABLE}.customer_id ;; } dimension: amount { # field: orders.amount type: number value_format: "0.00" sql: ${TABLE}.amount ;; } measure: count { # field: orders.count type: count # creates a sql COUNT(*) } measure: total_amount { type: sum sql: ${amount} ;; “orders” view in LookML } }
5. Evolving the data pipeline
interactive users business query SQL query Cloud DB source table table table table table table source
table table table table table table Data engineering
table table In memory table table table table table table table file Data engineering is not a static problem
data engineer table table In memory table table table table table table table file Who is responsible for data engineering?
data data engineer analyst system scientist (runtime adaptation) table table In memory table table table table table table table file Data engineering - empower users, reduce friction
view: customer_order_facts { derived_table: { sql: SELECT customer_id, MIN(DATE(time)) AS first_order_date, SUM(amount) AS lifetime_amount FROM order GROUP BY customer_id ;; } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] sql: ${TABLE}.first_order_date ;; } dimension: lifetime_amount { type: number value_format: "0.00" sql: ${TABLE}.lifetime_amount ;; LookML - derived table (based } } on SQL)
view: customer_order_facts { derived_table: { explore_source: orders { column: customer_id { field: order.customer_id } column: first_order { field: order.first_order } column: lifetime_amount { field: order.lifetime_amount } } } dimension: customer_id { type: number primary_key: yes sql: ${TABLE}.customer_id ;; } dimension_group: first_order { type: time timeframes: [date, week, month] LookML - derived table (based sql: ${TABLE}.first_order_date ;; } on an Explore)
Flavors of derived table Derived table flavor Purpose SQL equivalent Ephemeral Query expansion CREATE VIEW Persistent Query is executed once, used by CREATE TABLE AS several queries until it expires SELECT Transparent Populated as persistent DT, but CREATE MATERIALIZED can be used even if the business VIEW query does not reference it by name Each flavor comes can be based on either an Explore or SQL
Building materialized views Challenges: ● Design Which materializations to create? ● Populate Load them with data Maintain Incrementally populate when data changes ● Rewrite Transparently rewrite queries to use materializations ● ● Adapt Design and populate new materializations, drop unused ones ● Express Need a rich algebra, to model how data is derived Initial focus: summary tables (materialized views over star schemas)
Recommend
More recommend