CS 744: Big Data Systems Shivaram Venkataraman Fall 2018

ADMINISTRIVIA - Assignment 2, Midterm grades this week - Course Projects: round 2 meetings next Friday - Next Tuesday: Guest speaker for first part

WHAT WE KNOW SO FAR

CONTINUOUS OPERATOR MODEL Long-lived operators Mutable State Distributed Checkpoints High overhead for Fault Recover Stragglers ? Driver Control Message Naiad Network Transfer Task

GOALS 1. Scalability to hundreds of nodes 2. Minimal cost beyond base processing (no replication) 3. Second-scale latency 4. Second-scale recovery from faults and stragglers

DISCRETIZED STREAMS

DISCRETIZED STREAMS (DSTREAMS) Approach - Use short, stateless, deterministic tasks - Store state across tasks as in-memory RDDs - Fine-grained tasks à Parallel recovery / speculation Model - Chunk inputs into a number of micro-batches - Processed via parallel operations (i.e., map, reduce, groupBy etc.) - Save intermediate state as RDD / write output to external systems

COMPUTATION MODEL: MICRO-BATCHES Micro-Batch S H U F F L E Driver Control Message Network Transfer Task

EXAMPLE pageViews = readStream(http://..., "1s") ones = pageViews.map( event =>(event.url, 1)) counts = ones.runningReduce( (a, b) => a + b)

ARCHITECHTURE

DSTREAM API Output operations save output to external database / filesystem Transformations Stateless: map, reduce, groupBy, join Stateful: window(“5s”) à RDDs with data in [0,5), [1,6), [2,7) reduceByWindow(“5s”, (a, b) => a + b) à incremental aggregation

ASSOCIATIVE, INVERTIBLE Add Subtract previous 5 previous each time and add current

OTHER ASPECTS Tracking State: streams of (Key, Event) à (Key, State) events.track( - Initialize: Create a State from the first event (key, ev) => 1, (key, st, ev) => - Update: Return new State given, old state and event ev == Exit ? - Timeout for dropping old states. null : 1, "30s”) Unifying batch and stream - Join DStream with static RDD - Attach console and query existing RDDs - Shared codebase, functions etc.

SYSTEM IMPLEMENTATION

OPTIMIZATIONS Network Communication Rewrote Spark’s data plane to use asynchronous I/O Timestep Pipelining No barrier across timesteps unless needed Tasks from the next timestep scheduled before current finishes Checkpointing Async I/O, as RDDs are immutable Forget lineage after checkpoint

FAULT TOLERANCE: PARALLEL RECOVERY Worker failure - Need to recompute state RDDs stored on worker - Re-execute tasks running on the worker Strategy - Run all independent recovery tasks in parallel - Parallelism from partitions in timestep and across timesteps

EXAMPLE pageViews = readStream(http://..., "1s") ones = pageViews.map( event =>(event.url, 1)) counts = ones.runningReduce( (a, b) => a + b)

FAULT TOLERANCE Straggler Mitigation Use speculative execution Task runs more than 1.4x longer than median task à straggler Master Recovery - At each timestep, write out graph of DStreams and Scala function objects - Workers connect to a new master and report their RDD partitions - Note: No problem if a given RDD is computed twice (determinism).

DISCUSSION/SHORTCOMINGS Expressiveness - Current API requires users to “think” in micro-batches Setting batch interval - Manual tuning. Higher batch à better throughput but worse latency Memory usage - LRU cache stores state RDDs in memory

SUMMARY Micro-batches: New approach to stream processing Higher latency for fault tolerance, straggler mitigation Unifying batch, streaming analytics

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 - PowerPoint PPT Presentation

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 2, Midterm grades this week - Course Projects: round 2 meetings next Friday - Next Tuesday: Guest speaker for first part WHAT WE KNOW SO FAR CONTINUOUS

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1 -

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 With slides from Mosharaf Chowdhury

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Waitlist/Enrollment

FLAT DATACENTER STORAGE CS 744 - Big Data Systems Fall 2018 Presenter - Arjun Balasubramanian

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Who am I ? New faculty in Computer

CS 744: Big Data Systems Shivaram Venkataraman Fall 2019 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Assignment 1: Due Oct

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 Administrivia Course Project

CS 744: Big Data Systems Shivaram Venkataraman Fall 2020 Who am I ? Assistant Professor in

CS 744: Big Data Systems Shivaram Venkataraman Fall 2018 ADMINISTRIVIA - Midterm grades up

Virtual Memory Overview / Motivation Simple Approach: Overlays

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 8: Analyzing Graphs,

Unleash Data Science Danny Bickson Co-Founder GraphLab Project History GraphLab GraphLab

Parallel Solution of PageRank Problem eero.vainikko@ut.ee Teooriapevad Ruge, 26th January

CU-Boulder Where we are, where we are going Presented by Who am I? Matt Tucker

Web Governance Committee February 23, 2016 Agenda Accessibility and Liability Analytics:

www.drupaleurope.org Essential marketing At the regional level Imre Gmelig Meijling (Chair Dutch

Beyond the HealthCare.gov fix Making better government software Paul Smith