analytical query processing
play

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 - PowerPoint PPT Presentation

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date and location confirmed October 22 at 7-9pm in ILC S331 2 2 MapReduce vs. DBMSs 3 Advantages of DBMSs Abstract data representation


  1. Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7

  2. Announcement • Midterm date and location confirmed • October 22 at 7-9pm in ILC S331 2 2

  3. MapReduce vs. DBMSs 3

  4. Advantages of DBMSs • Abstract data representation • Relational model • Data storage is delegated to the DBMS • Functional query language (SQL) • Queries specified as simple relational operators • Actual query execution delegated to the DBMS… • … including parallelism, distribution, pipelining etc. • Support for indexing 4 4

  5. Disadvantages of DBMSs • SQL is a limited interface for complex analytics • E.g. image analysis, creating maps • Need to define a schema for data a priori • High cost of loading data and indexing • Can be amortized only if same data and schema reused • Too complex for “one shot” analytics 5 5

  6. Advantages of MapReduce • Support for arbitrary UDFs • Support for a variety of arbitrary data formats • Simple API • Scalability 6 6

  7. Disadvantages • Many of the optimizations of DBMS must be reimplemented, for example • Indices • Query execution plans (logical + physical) • Column-based storage • Data format specifications (ProtoBuf) • Support for updates • Several efforts towards closing the gap for analytics 7 7

  8. Data Analytics 8

  9. In Situ Analytics • Data dumped on GFS/HDFS (data lake) • Some of this data is relational • Several systems to execute relational queries on HDFS data • SQL-like language • Query optimization • Columnar data representation • Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto) • We will discuss both classes 9 9

  10. Analytical Queries • Long-running, complex queries • Often aggregates • Run on read-only data or snapshots of dynamic data • Data characteristics • Tuples (rows) have many possible attributes (columns) • A row will have only a subset of attributes set 10 10

  11. Star Schema: Facts and Dimensions • Popular schema for analytics/data warehousing • Many others exist! • At the center is a large fact table • Foreign-key references to small dimension tables 11 11

  12. 12

  13. Dremel • In-situ analytics • Independent query executor (not on top of MR) • Uses columnar store 13 13

  14. Data Model: Column Families • Also called column groups, nested columns • Common to many systems, e.g. Cassandra, HBase Nested in repeated Name New record Nested in repeated Language 14 14

  15. Assembling a Row • Finite state machine 15 15

  16. Pros and Cons of Columnar Model • Pros • Compression: columns have uniform values • Less data to scan on projections (which are common) • Cons • Additional CPU load to decompress columns and rebuild rows 16 16

  17. Query Execution 17 17

  18. SparkSQL: Spark + DBMS • Extend Spark with • Simple, high-level SQL-like operators • Query optimization • No need to transfer data across systems • ETL, query processing, complex analytics in one system 19 19

  19. Architecture 20 20

  20. DataFrames • Collection of rows with homogeneous schema • Like a table in a DBMS • Can be manipulated like an RDD • DataFrame operations • Similar to Python Pandas or R data frames • Evaluated lazily (query planning is postponed) • Can optimize across multiple queries 21 21

  21. SparkSQL Query Execution 22 22

  22. Advantages • Relational structure enables query optimization • In-memory caching using columnar representation • Better compression • Mix SQL-like operators and arbitrary code • More flexible than UDFs in DBMSs • Can optimize across multiple SQL operations 23 23

  23. Catalyst • Query optimizer of SparkSQL • Rule-based optimization • Rule: find pattern and transform • Used for both logical and physical plans • Can customize rules • Code generation • Directly outputs bytecode (as opposed to interpreting a plan) • Much more CPU efficient • Flexible data sources • Can change the physical representation of DataFrames • Still use the optimizer 24 24

  24. Catalyst: Rule-Based Optimization • Apply rules to subtree until fixed point Execution tree Transformation rules 25 25

  25. Catalyst: Code Generation • Faster than interpreting a physical plan 26 26

Recommend


More recommend