research collection
play

Research Collection https://doi.org/10.3929/ethz-b-000294819 - PDF document

Research Collection https://doi.org/10.3929/ethz-b-000294819 information please consult the Terms of use. This page was generated automatically upon download from the ETH Zurich Research Collection. For more Creative Commons Attribution 4.0


  1. Research Collection https://doi.org/10.3929/ethz-b-000294819 information please consult the Terms of use. This page was generated automatically upon download from the ETH Zurich Research Collection. For more Creative Commons Attribution 4.0 International Rights / License: http://doi.org/10.1145/3267809.3267822 Originally published in: Permanent Link: Conference Paper 2018-10 Publication Date: Marroquín, Renato; Müller, Ingo; Makreshanski, Darko; Alonso, Gustavo Author(s): Shared Query Execution Pay One, Get Hundreds for Free: Reducing Cloud Costs through ETH Library

  2. Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution Renato Marroquín , Ingo Müller, Darko Makreshanski, Gustavo Alonso | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 1

  3. Motivation: Query-as-a-service  Pros  Disadvantages  No need to move data outside  Expensive with frequent usage the cloud  No intuitive way to optimize  No infrastructure deployment  Cost  No database maintenance  Throughput  Pricing model  Pay-per-byte-processed  Popular Systems | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 2

  4. Motivation: Use cases Parameter exploration BI reporting Ad-hoc analytics  Observations:  High overlap among concurrent queries  Query burst from single user  Similar workload from multiple users Work sharing: carry out redundant work only once. | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 3

  5. Our Goal Reduce query-as-a-service costs using work sharing techniques through query rewriting | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 4

  6. Outline  Related work  Multi-query execution  Evaluation  Summary  Conclusions | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 5

  7. Multi Query Execution: General sharing workflow Our Focus Collect Co-plan Separate Co-execute | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 6

  8. Shared execution techniques 2010 – 2017 1988 – 2005 2005 – 2010 Scan Work MQO Sharing Sharing 1 st generation: Crescando 2 nd generation: SharedDB, MQJoin, BatchDB Name Age Sex query_id  Annotate tuples with Renato 30 m Q1,Q3 query_id attribute Darko 30 m Q1,Q2,Q3  Supports all relational Ingo 32 m Q1 operators | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 7

  9. Shared execution techniques 2010 – 2017 1988 – 2005 2005 – 2010 Scan Work MQO Sharing Sharing  Previous approaches need dedicated engine  High implementation effort  Needs vendor support This work: rewrite shared query plans as SQL. | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 8

  10. Outline  Related work  Multi-query execution | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 9

  11. Multi Query Execution: Data-query model Name Age Sex SELECT * FROM Emp WHERE … Renato 30 m Darko 30 m Q1: … age < 40 Q2: … name LIKE ‘%k%’ Ingo 32 m Q3: … age <= 30 AND age > 20 Gustavo 50 m Name Age Sex query_id Renato 30 m { 1, 3 } Darko 30 m { 1, 2, 3 } Ingo 32 m { 1 } Gustavo 50 m { } | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 10

  12. Multi Query Execution: The query_id attribute Name: Age: Sex: query_id: VARCHAR INT ENUM INTEGER ARRAY Renato 30 m { 1, 3 } Darko 30 m { 1, 2, 3 } Ingo 32 m { 1 }  Result of Qx is given by SELECT * FROM R WHERE x = ANY(query_id)  This is standard SQL! | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 11

  13. Shared operators: Shared scan Name Age Sex SELECT * FROM Emp WHERE … Renato 30 m Q1: … age < 40 Darko 30 m Q2: … name LIKE ‘%k%’ Q3: … age <= 30 AND age > 20 Ingo 32 m Gustavo 50 m SELECT *, ARRAY_REMOVE( ARRAY[ CASE WHEN age < 40 THEN 1 ELSE 0 END, CASE WHEN name LIKE ‘%k%’ THEN 2 ELSE 0 END, CASE WHEN age <= 30 AND age > 20 THEN 3 ELSE 0 END ] , 0) AS query_id FROM Emp WHERE (age < 40) OR (name LIKE ‘%k%’) OR (age <= 30 AND age > 20) | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 12

  14. Shared operators: Shared join Departments Employees Name Age Sex D_ID query_id D_ID D_Name query_id Renato 30 m 1 { 1, 3 } 1 Systems { 2, 3 } Darko 30 m 1 { 1, 2, 3 } 2 Algorithms { 2 } Ingo 32 m 1 { 1 } 3 ML { 1, 2, 3 } SELECT * FROM Emp e JOIN Dep d ON e.did = d.did WHERE … Name Age Sex D_ID D_Name query_id Renato 30 m 1 Systems { 3 } Darko 30 m 1 Systems { 2, 3 } | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 13

  15. Shared operators: Shared join Departments Employees Name Age Sex D_ID query_id D_ID D_Name query_id Renato 30 m 1 { 1, 3 } 1 Systems { 2, 3 } Darko 30 m 1 { 1, 2, 3 } 2 Algorithms { 2 } Ingo 32 m 1 { 1 } 3 ML { 1, 2, 3 } WITH shared_emp AS (…), -- shared scan on emp shared_dep AS (…), -- shared scan on dep shared_join_helper AS ( SELECT R.name, R.age, R.sex, S.d_id, S.d_name, ARRAY_INTERSECT(R.query_id, S.query_id) AS query_id FROM shared_emp e JOIN shared_dep d ON e.d_id = d.d_id) SELECT * FROM shared_join_helper WHERE CARDINALITY(query_id) > 0 | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 14

  16. Shared operators: Other shared operators  GROUP BY  Use UNNEST to duplicate tuples for each query  Then group by query_id  LIMIT / TOP K  PARTITION BY query_id  Then filter by RANK() Enough to express all TPC-H queries! | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 15

  17. Outline  Related work  Multi-query execution  Evaluation  Summary  Conclusions | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 16

  18. Evaluation: Query-as-a-Service systems  Systems-under-test  Amazon Athena  Google BigQuery  Run SQL queries against files in cloud storage  Apache Parquet  Google internal columnar storage  Micro-benchmarks  End-to-end query execution | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 17

  19. Evaluation: Shared scan  Data  Lineitem from TPC-H Scale Factor 100  Workload  Parameterized queries  Queries in a group/batch have equal selectivity | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 18

  20. Evaluation: Shared scan Queries Selectivity: Google BigQuery Execution time increases sublinearly with query count. 84GB table, native columnar format | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 19

  21. Evaluation: Shared scan Queries Selectivity: Amazon Athena 27GB, Parquet format  Pay per processed byte Monetary cost (almost) independent of batch size. Google Big Query: constant 0.011$ | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 20

  22. Evaluation: TPC-H  Data  TPC-H Scale Factor 100  Simulate multiple apps interacting with QaaS  Each emitting different TPC-H queries  Workload  128 instantiations of each query type | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 21

  23. Evaluation: TPC-H Throughput improvement over Query-at-a-Time (batch size = 128) Athena BigQuery 100 Throughput Improvement 10 1 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 TPC-H Query Considerable speedup: up to ~128x. | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 22

  24. Evaluation: TPC-H Cost compared to Query-at-a-Time (batch size = 128) Query-at-a-time SharedExecution 1000.00 100.00 Cost [$] 16x 10.00 128x 1.00 107x $80 $240 ~$14 <$1 $227 ~$1.7 0.10 Athena BigQuery (all) BigQuery(without Q10) Again: cost (almost) independent of batch size. | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 23

  25. Outline  Related work  Multi-query execution  Evaluation  Summary  Conclusions | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 24

  26. Summary  Shared query execution through query rewriting  Rewriting can be done with standard SQL  Evaluate our approach by executing queries end-to-end  Show improvements in execution cost and throughput Conclusion  Sharing can be implemented in:  Client  Library  Middleware | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 25

  27. Thanks! | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 26

  28. Backup Slides | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 27

  29. Evaluation: TPC-H Relative cost compared to Query-at-a-Time (batch size = 128) 100 Cost Improvement 10 16x Query-at-a-time SharedExecution 107x 1 128x $80 $240 ~$14 <$1 $227 ~$1.7 0.1 Athena BigQuery (all) BigQuery (without Q10) Again: cost (almost) independent of batch size. | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 28

  30. Shared execution techniques 2010 – 2017 1988 – 2005 2005 – 2010 Scan Work MQO Sharing Sharing row value First generation: Crescando  Index predicates to reduce work idx  Lower complexity of work  Only scans … q1 q2 q3 | | Marroquín, Müller, Makreshanski, Alonso 13.10.2018 29

  31. Shared plans Sal Emp Dep Emp Dep Emp Dep 𝜏 𝜏 sex = “m” age > 30 𝜏 ⨝ ⨝ d_id d_id Q1 Q2 ⨝ ⨝ Sal Sal Emp Emp e_id d_id 𝜏 𝜏 sex = “f” age < 50 ⨝ ⨝ e_id e_id Q3,Q4 Q1,Q2 Q3 Q4 | | Marroquín, Müller, Makreshanski, Alonso 30.05.2018 30

Recommend


More recommend