Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7

Announcement • Midterm date and location confirmed • October 22 at 7-9pm in ILC S331 2 2

MapReduce vs. DBMSs 3

Advantages of DBMSs • Abstract data representation • Relational model • Data storage is delegated to the DBMS • Functional query language (SQL) • Queries specified as simple relational operators • Actual query execution delegated to the DBMS… • … including parallelism, distribution, pipelining etc. • Support for indexing 4 4

Disadvantages of DBMSs • SQL is a limited interface for complex analytics • E.g. image analysis, creating maps • Need to define a schema for data a priori • High cost of loading data and indexing • Can be amortized only if same data and schema reused • Too complex for “one shot” analytics 5 5

Advantages of MapReduce • Support for arbitrary UDFs • Support for a variety of arbitrary data formats • Simple API • Scalability 6 6

Disadvantages • Many of the optimizations of DBMS must be reimplemented, for example • Indices • Query execution plans (logical + physical) • Column-based storage • Data format specifications (ProtoBuf) • Support for updates • Several efforts towards closing the gap for analytics 7 7

Data Analytics 8

In Situ Analytics • Data dumped on GFS/HDFS (data lake) • Some of this data is relational • Several systems to execute relational queries on HDFS data • SQL-like language • Query optimization • Columnar data representation • Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto) • We will discuss both classes 9 9

Analytical Queries • Long-running, complex queries • Often aggregates • Run on read-only data or snapshots of dynamic data • Data characteristics • Tuples (rows) have many possible attributes (columns) • A row will have only a subset of attributes set 10 10

Star Schema: Facts and Dimensions • Popular schema for analytics/data warehousing • Many others exist! • At the center is a large fact table • Foreign-key references to small dimension tables 11 11

Dremel • In-situ analytics • Independent query executor (not on top of MR) • Uses columnar store 13 13

Data Model: Column Families • Also called column groups, nested columns • Common to many systems, e.g. Cassandra, HBase Nested in repeated Name New record Nested in repeated Language 14 14

Assembling a Row • Finite state machine 15 15

Pros and Cons of Columnar Model • Pros • Compression: columns have uniform values • Less data to scan on projections (which are common) • Cons • Additional CPU load to decompress columns and rebuild rows 16 16

Query Execution 17 17

SparkSQL: Spark + DBMS • Extend Spark with • Simple, high-level SQL-like operators • Query optimization • No need to transfer data across systems • ETL, query processing, complex analytics in one system 19 19

Architecture 20 20

DataFrames • Collection of rows with homogeneous schema • Like a table in a DBMS • Can be manipulated like an RDD • DataFrame operations • Similar to Python Pandas or R data frames • Evaluated lazily (query planning is postponed) • Can optimize across multiple queries 21 21

SparkSQL Query Execution 22 22

Advantages • Relational structure enables query optimization • In-memory caching using columnar representation • Better compression • Mix SQL-like operators and arbitrary code • More flexible than UDFs in DBMSs • Can optimize across multiple SQL operations 23 23

Catalyst • Query optimizer of SparkSQL • Rule-based optimization • Rule: find pattern and transform • Used for both logical and physical plans • Can customize rules • Code generation • Directly outputs bytecode (as opposed to interpreting a plan) • Much more CPU efficient • Flexible data sources • Can change the physical representation of DataFrames • Still use the optimizer 24 24

Catalyst: Rule-Based Optimization • Apply rules to subtree until fixed point Execution tree Transformation rules 25 25

Catalyst: Code Generation • Faster than interpreting a physical plan 26 26

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 - PowerPoint PPT Presentation

Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7 Announcement Midterm date and location confirmed October 22 at 7-9pm in ILC S331 2 2 MapReduce vs. DBMSs 3 Advantages of DBMSs Abstract data representation

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

Online Query Processing Exposure to online query processing algorithms and fundamentals A

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

Query Processing Query Processing Steps balance < 2500 ( balance ( account)) balance

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

A Generic Mapping-based Query Translation A Generic Mapping-based Query Translation from SPARQL

Information Retrieval > Query Us User er Query Words Query Words Search Personalization

Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data

Block-Quantized Kernel Matrix for Fast Spectral Embedding Kai Zhang James T. Kwok Department of

T-61.3050 Machine Learning: Basic Principles Clustering Kai Puolam aki Laboratory of Computer

Stiefel Manifolds and their Applications Pierre-Antoine Absil (UCLouvain) CESAME seminar 22

Scalable Data Processing at Network transfer rates with nCorium Compute in Memory Modules Suresh

AIMS CDT - Signal Processing Michaelmas Term 2020 Xiaowen Dong Department of Engineering Science

Processing Data from Files n So far: n Inputs : n from user n

Imagine: Media Processing with Streams Brucek Khailany et al. and a little bit of Evaluating