Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7
Announcement • Midterm date and location confirmed • October 22 at 7-9pm in ILC S331 2 2
MapReduce vs. DBMSs 3
Advantages of DBMSs • Abstract data representation • Relational model • Data storage is delegated to the DBMS • Functional query language (SQL) • Queries specified as simple relational operators • Actual query execution delegated to the DBMS… • … including parallelism, distribution, pipelining etc. • Support for indexing 4 4
Disadvantages of DBMSs • SQL is a limited interface for complex analytics • E.g. image analysis, creating maps • Need to define a schema for data a priori • High cost of loading data and indexing • Can be amortized only if same data and schema reused • Too complex for “one shot” analytics 5 5
Advantages of MapReduce • Support for arbitrary UDFs • Support for a variety of arbitrary data formats • Simple API • Scalability 6 6
Disadvantages • Many of the optimizations of DBMS must be reimplemented, for example • Indices • Query execution plans (logical + physical) • Column-based storage • Data format specifications (ProtoBuf) • Support for updates • Several efforts towards closing the gap for analytics 7 7
Data Analytics 8
In Situ Analytics • Data dumped on GFS/HDFS (data lake) • Some of this data is relational • Several systems to execute relational queries on HDFS data • SQL-like language • Query optimization • Columnar data representation • Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto) • We will discuss both classes 9 9
Analytical Queries • Long-running, complex queries • Often aggregates • Run on read-only data or snapshots of dynamic data • Data characteristics • Tuples (rows) have many possible attributes (columns) • A row will have only a subset of attributes set 10 10
Star Schema: Facts and Dimensions • Popular schema for analytics/data warehousing • Many others exist! • At the center is a large fact table • Foreign-key references to small dimension tables 11 11
12
Dremel • In-situ analytics • Independent query executor (not on top of MR) • Uses columnar store 13 13
Data Model: Column Families • Also called column groups, nested columns • Common to many systems, e.g. Cassandra, HBase Nested in repeated Name New record Nested in repeated Language 14 14
Assembling a Row • Finite state machine 15 15
Pros and Cons of Columnar Model • Pros • Compression: columns have uniform values • Less data to scan on projections (which are common) • Cons • Additional CPU load to decompress columns and rebuild rows 16 16
Query Execution 17 17
SparkSQL: Spark + DBMS • Extend Spark with • Simple, high-level SQL-like operators • Query optimization • No need to transfer data across systems • ETL, query processing, complex analytics in one system 19 19
Architecture 20 20
DataFrames • Collection of rows with homogeneous schema • Like a table in a DBMS • Can be manipulated like an RDD • DataFrame operations • Similar to Python Pandas or R data frames • Evaluated lazily (query planning is postponed) • Can optimize across multiple queries 21 21
SparkSQL Query Execution 22 22
Advantages • Relational structure enables query optimization • In-memory caching using columnar representation • Better compression • Mix SQL-like operators and arbitrary code • More flexible than UDFs in DBMSs • Can optimize across multiple SQL operations 23 23
Catalyst • Query optimizer of SparkSQL • Rule-based optimization • Rule: find pattern and transform • Used for both logical and physical plans • Can customize rules • Code generation • Directly outputs bytecode (as opposed to interpreting a plan) • Much more CPU efficient • Flexible data sources • Can change the physical representation of DataFrames • Still use the optimizer 24 24
Catalyst: Rule-Based Optimization • Apply rules to subtree until fixed point Execution tree Transformation rules 25 25
Catalyst: Code Generation • Faster than interpreting a physical plan 26 26
Recommend
More recommend