distributed olap databases
play

Distributed OLAP Databases Lecture # 24 Database Systems Andy - PowerPoint PPT Presentation

Distributed OLAP Databases Lecture # 24 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018 2 UPCO M IN G DATABASE EVEN TS Swarm64 Tech Talk Thursday November 29 th @ 12pm GHC 8102


  1. Distributed OLAP Databases Lecture # 24 Database Systems Andy Pavlo AP AP Computer Science 15-445/15-645 Carnegie Mellon Univ. Fall 2018

  2. 2 UPCO M IN G DATABASE EVEN TS Swarm64 Tech Talk → Thursday November 29 th @ 12pm → GHC 8102 ← Different Location! VoltDB Research Talk → Monday December 3 rd @ 4:30pm → GHC 8102 CMU 15-445/645 (Fall 2018)

  3. 8 O LTP VS. O LAP On-line Transaction Processing (OLTP): → Short-lived read/write txns. → Small footprint. → Repetitive operations. On-line Analytical Processing (OLAP): → Long-running, read-only queries. → Complex joins. → Exploratory queries. CMU 15-445/645 (Fall 2018)

  4. 9 BIFURCATED EN VIRO N M EN T Extract Transform Load OLTP Databases OLAP Database CMU 15-445/645 (Fall 2018)

  5. 10 DECISIO N SUPPO RT SYSTEM S Applications that serve the management, operations, and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data. Star Schema vs. Snowflake Schema CMU 15-445/645 (Fall 2018)

  6. 11 STAR SCH EM A PRODUCT_DIM CUSTOMER_DIM ID CATEGORY_NAME FIRST_NAME CATEGORY_DESC SALES_FACT LAST_NAME PRODUCT_CODE EMAIL PRODUCT_NAME PRODUCT_FK ZIP_CODE PRODUCT_DESC TIME_FK LOCATION_FK CUSTOMER_FK LOCATION_DIM TIME_DIM PRICE YEAR COUNTRY QUANTITY DAY_OF_YEAR STATE_CODE MONTH_NUM STATE_NAME MONTH_NAME ZIP_CODE DAY_OF_MONTH CITY CMU 15-445/645 (Fall 2018)

  7. 12 CAT_LOOKUP SN OWFLAKE SCH EM A CATEGORY_ID CATEGORY_NAME CATEGORY_DESC CUSTOMER_DIM PRODUCT_DIM ID SALES_FACT FIRST_NAME CATEGORY_FK LAST_NAME PRODUCT_CODE PRODUCT_FK EMAIL PRODUCT_NAME ZIP_CODE PRODUCT_DESC TIME_FK LOCATION_FK LOCATION_DIM TIME_DIM CUSTOMER_FK YEAR COUNTRY DAY_OF_YEAR STATE_FK PRICE MONTH_FK ZIP_CODE DAY_OF_MONTH CITY QUANTITY STATE_LOOKUP MONTH_LOOKUP STATE_ID MONTH_NUM STATE_CODE MONTH_NAME STATE_NAME MONTH_SEASON CMU 15-445/645 (Fall 2018)

  8. 13 STAR VS. SN OWFLAKE SCH EM A Issue #1: Normalization → Snowflake schemas take up less storage space. → Denormalized data models may incur integrity and consistency violations. Issue #2: Query Complexity → Snowflake schemas require more joins to get the data needed for a query. → Queries on star schemas will (usually) be faster. CMU 15-445/645 (Fall 2018)

  9. 14 PRO BLEM SETUP Partitions SELECT * FROM R JOIN S ON R.id = S.id P1 P2 Application Server P3 P4 CMU 15-445/645 (Fall 2018)

  10. 14 PRO BLEM SETUP Partitions SELECT * FROM R JOIN S ON R.id = S.id P1 P2 P2 P4 P3 Application Server P3 P4 CMU 15-445/645 (Fall 2018)

  11. 15 TO DAY'S AGEN DA Execution Models Query Planning Distributed Join Algorithms Cloud Systems CMU 15-445/645 (Fall 2018)

  12. 16 PUSH VS. PULL Approach #1: Push Query to Data → Send the query (or a portion of it) to the node that contains the data. → Perform as much filtering and processing as possible where data resides before transmitting over network. Approach #2: Pull Data to Query → Bring the data to the node that is executing a query that needs it for processing. CMU 15-445/645 (Fall 2018)

  13. 17 PUSH Q UERY TO DATA SELECT * FROM R JOIN S Node ON R.id = S.id P1→ ID:1-100 R ⨝ S IDs [101,200] Result: R ⨝ S Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

  14. 18 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage Page ABC ON R.id = S.id R ⨝ S IDs [101,200] Page XYZ Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

  15. 18 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage Page ABC ON R.id = S.id R ⨝ S IDs [101,200] Page XYZ Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

  16. 18 PULL DATA TO Q UERY P1→ ID:1-100 SELECT * FROM R JOIN S Node Storage ON R.id = S.id R ⨝ S IDs [101,200] Result: R ⨝ S Application Server Node P2→ ID:101-200 CMU 15-445/645 (Fall 2018)

  17. 21 FAULT TO LERAN CE Traditional distributed OLAP DBMSs were designed to assume that nodes will not fail during query execution. → If the DBMS fails during query execution, then the whole query fails. The DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover after a crash. CMU 15-445/645 (Fall 2018)

  18. 22 Q UERY PLAN N IN G All the optimizations that we talked about before are still applicable in a distributed environment. → Predicate Pushdown → Early Projections → Optimal Join Orderings But now the DBMS must also consider the location of data at each partition when optimizing CMU 15-445/645 (Fall 2018)

  19. 23 Q UERY PLAN FRAGM EN TS Approach #1: Physical Operators → Generate a single query plan and then break it up into partition-specific fragments. → Most systems implement this approach. Approach #2: SQL → Rewrite original query into partition-specific queries. → Allows for local optimization at each node. → MemSQL is the only system that I know that does this. CMU 15-445/645 (Fall 2018)

  20. 25 Q UERY PLAN FRAGM EN TS SELECT * FROM R JOIN S ON R.id = S.id SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S ON R.id = S.id ON R.id = S.id ON R.id = S.id WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300 Id:1-100 Id:101-200 Id:201-300 CMU 15-445/645 (Fall 2018)

  21. 25 Union the output of Q UERY PLAN FRAGM EN TS each join together to produce final result. SELECT * FROM R JOIN S ON R.id = S.id SELECT * FROM R JOIN S SELECT * FROM R JOIN S SELECT * FROM R JOIN S ON R.id = S.id ON R.id = S.id ON R.id = S.id WHERE R.id BETWEEN 1 AND 100 WHERE R.id BETWEEN 101 AND 200 WHERE R.id BETWEEN 201 AND 300 Id:1-100 Id:101-200 Id:201-300 CMU 15-445/645 (Fall 2018)

  22. 26 O BSERVATIO N The efficiency of a distributed join depends on the target tables' partitioning schemes. One approach is to put entire tables on a single node and then perform the join. → You lose the parallelism of a distributed DBMS. → Costly data transfer over the network. CMU 15-445/645 (Fall 2018)

  23. 27 DISTRIBUTED J O IN ALGO RITH M S To join tables R and S , the DBMS needs to get the proper tuples on the same node. Once there, it then executes the same join algorithms that we discussed earlier in the semester. CMU 15-445/645 (Fall 2018)

  24. 28 SCEN ARIO # 1 One table is replicated at every node. SELECT * FROM R JOIN S Each node joins its local data and then ON R.id = S.id sends their results to a coordinating node. :R ⨝ S P2:R ⨝ S P1 Id:1-100 Id:101-200 R (Id) R (Id) Replicated Replicated S S CMU 15-445/645 (Fall 2018)

  25. 28 SCEN ARIO # 1 One table is replicated at every node. SELECT * FROM R JOIN S Each node joins its local data and then ON R.id = S.id sends their results to a coordinating node. :R ⨝ S P1 R ⨝ S P2:R ⨝ S Id:1-100 Id:101-200 R (Id) R (Id) Replicated Replicated S S CMU 15-445/645 (Fall 2018)

  26. 29 SCEN ARIO # 2 Tables are partitioned on the join SELECT * FROM R JOIN S attribute. Each node performs the join ON R.id = S.id on local data and then sends to a node for coalescing. :R ⨝ S P2:R ⨝ S P1 Id:1-100 Id:101-200 R (Id) R (Id) Id:1-100 Id:101-200 S (Id) S (Id) CMU 15-445/645 (Fall 2018)

  27. 29 SCEN ARIO # 2 Tables are partitioned on the join SELECT * FROM R JOIN S attribute. Each node performs the join ON R.id = S.id on local data and then sends to a node for coalescing. :R ⨝ S P1 R ⨝ S P2:R ⨝ S Id:1-100 Id:101-200 R (Id) R (Id) Id:1-100 Id:101-200 S (Id) S (Id) CMU 15-445/645 (Fall 2018)

  28. 30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

  29. 30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

  30. 30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. S S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

  31. 30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. :R ⨝ S P2:R ⨝ S P1 S S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

  32. 30 SCEN ARIO # 3 Both tables are partitioned on SELECT * FROM R JOIN S different keys. If one of the tables is ON R.id = S.id small, then the DBMS broadcasts that table to all nodes. :R ⨝ S P1 R ⨝ S P2:R ⨝ S S S Id:1-100 Id:101-200 R (Id) R (Id) Val:1-50 Val:51-100 S (Val) S (Val) CMU 15-445/645 (Fall 2018)

  33. 31 SCEN ARIO # 4 Both tables are not partitioned on the SELECT * FROM R JOIN S join key. The DBMS copies the tables ON R.id = S.id by reshuffling them across nodes. Name:A-M Name:N-Z R (Name) R (Name) Val:1-50 S (Val) S (Val) Val:51-100 CMU 15-445/645 (Fall 2018)

Recommend


More recommend