CS 744: Big Data Systems Shivaram Venkataraman Fall 2018

ADMINISTRIVIA - Assignment 2 grades: Tonight - Midterm review session on Nov 2 at 5pm at 1221 CS - Course Project Proposal feedback

EFFICIENT SQL ON MODERN HARDWARE

MOTIVATION Query Model - Need to handle diverse queries - Real-time streaming, temporal queries on logs, progressive queries etc. Language Integration - Support for High-level Language (HLL) - SQL Library Performance

approach 1. Temporal Logical Data Model 2. DAG of operators (Volcano, Spark, DryadLINQ etc.) 3. Performance i. Data batching ii. Columnar processing iii. Code Generation iv. Efficient Aggregation

ARCHITECHTURE

DATA MODEL, QUERY LINQ style queries Similar to SparkSQL, DryadLINQ Includes timestamp by default Support for windowing, aggregation var str = Network.ToStream(e => e.ClickTime, Latency(10secs)); var query = str.Where(e => e.UserId % 100 < 5) .Select(e => { e.AdId }) .GroupApply(e => e.AdId, s => s.Window(5min).Aggregate(w => w.Count()));

DATA BATCHING Why is batching important ? Vectorized operations, better throughput Implementing batching Group a set of events together, each having sync time Aadaptively choose batch size Insert punctuation to enforce batch gets flushed Example: Punctuation every 5min, batch contains 500 tuples Throughput is 1000 tuples/sec à 600 batches each punctuation

COLUMNAR PROCESSING: LAYOUT class DataBatch { - Separate into control, payload fields long[] SyncTime; long[] OtherTime; Bitvector BV; - BitVector to indicate absence } class UserData_Gen : DataBatch { - Each of these has columnar layout long[] col_ClickTime; long[] col_UserId; Payload generated from user struct long[] col_AdId; - }

COLUMNAR PROCESSING: OPERATORS Operators à nodes in query DAG void On(UserData_Gen batch) { Chain operators together with On() batch.BV.MakeWritable(); for (int i=0;i<batch.Count; i++) if ((batch.BV[i] == 0) && Tight-loop from code-gen !(batch.col_UserId[i] % 100<5)) batch.BitVector[i] = 1; Further optimizations: nextOperator.On(batch); Copy-on-write, } Zero-copy pointer-swing

COLUMNAR PROCESSING: OTHER Serialization - Store data in column batches - Code generation of serialization/deserialization String Handling - Bloated string representation in Java/C# - Encode multiple strings into MultiString - stringsplit, substring – operate directly on MultiString

GROUPED AGGREGATION Temporal Data Model - Each event belongs to a data window or interval - Aggregates can be stateless or stateful (more in next 3 lectures) other_time - When other_time > sync_time, represents interval - When other_time is infinity, start at sync_time - When other_time < sync_time, end at sync_time

GROUPED AGGREGATION API for user-defined aggregation functions Efficient implementation using three data structures Example for count: InitialState: () => 0L Accumulate: (oldCount, timestamp, input) => oldCount + 1 Deaccumulate: (oldCount, timestamp, input) => oldCount - 1 Difference: (leftCount, rightCount) => leftCount - rightCount ComputeResult: count => count

MAP-REDUCE on MULTI-CORE

SUMMARY Flexible SQL library to handle workload patterns Integration with high-level language Efficient execution through - Batching - Columnar processing - Code generation

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 - PowerPoint PPT Presentation

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2 grades: Tonight - Midterm review session on Nov 2 at 5pm at 1221 CS - Course Project Proposal feedback EFFICIENT SQL ON MODERN HARDWARE MOTIVATION Query

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1 -

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 With slides from Mosharaf Chowdhury

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Who am I ? New faculty in Computer

CS 744: Big Data Systems Shivaram Venkataraman Fall 2019 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1: Due Oct

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Administrivia Course Project

CS 744: Big Data Systems Shivaram Venkataraman Fall 2020 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Midterm grades up

Todays Class Carnegie Mellon Univ. Storage Models Dept. of Computer Science System

Intro to Oracle TimesTen --- By Sima Zhu Why in-memory? Basic Architecture TimesTen

Check MIB <draft-nunzi-check-mib-00.txt> Giorgio Nunzi, Juergen Quittek, Marcus Brunner,

P o s t g r e S Q L a s a C o l u m n a r S t o r e DCPUG May 2014 Reston, VA Stephen Frost

- Lessons learned - What is ? Project of Museum fr Naturkunde Berlin + AWI + University

reionisation: in the context of small-scale structure Sownak Bose sownak.bose@cfa.harvard.edu

Traps and Faults Traps and Faults Review: Mode and Space Review: Mode and Space C A B data

C R E S A N E W M E T H O D T O WA R D S M E A S U R I N G T H E - M A S S S E