CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2019 - PowerPoint PPT Presentation

Mar 27, 2023 •38 likes •288 views

CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Midterm grades this week - Course Projects sign up for meetings Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems

CS 744: SPARK STREAMING Shivaram Venkataraman Fall 2019
ADMINISTRIVIA - Midterm grades this week - Course Projects sign up for meetings
Applications Machine Learning SQL Streaming Graph Computational Engines Scalable Storage Systems Resource Management Datacenter Architecture
CONTINUOUS OPERATOR MODEL Long-lived operators Mutable State Distributed Checkpoints for Fault Recovery Stragglers ? Driver Control Message Naiad Network Transfer Task
CONTINUOUS OPERATORS
SPARK STREAMING: GOALS 1. Scalability to hundreds of nodes 2. Minimal cost beyond base processing (no replication) 3. Second-scale latency 4. Second-scale recovery from faults and stragglers
DISCRETIZED STREAMS (DSTREAMS)
EXAMPLE pageViews = readStream(http://..., "1s") ones = pageViews.map( event =>(event.url, 1)) counts = ones.runningReduce( (a, b) => a + b)
ARCHITECHTURE
DSTREAM API Transformations Stateless: map, reduce, groupBy, join Stateful: window(“5s”) à RDDs with data in [0,5), [1,6), [2,7) reduceByWindow(“5s”, (a, b) => a + b)
SLIDING WINDOW Add previous 5 each time
STATE MANAGEMENT Tracking State: streams of (Key, Event) à (Key, State) events.track( (key, ev) => 1, (key, st, ev) => ev == Exit ? null : 1, "30s”)
SYSTEM IMPLEMENTATION
OPTIMIZATIONS Timestep Pipelining No barrier across timesteps unless needed Tasks from the next timestep scheduled before current finishes Checkpointing Async I/O, as RDDs are immutable Forget lineage after checkpoint
FAULT TOLERANCE: PARALLEL RECOVERY Worker failure - Need to recompute state RDDs stored on worker - Re-execute tasks running on the worker Strategy - Run all independent recovery tasks in parallel - Parallelism from partitions in timestep and across timesteps
EXAMPLE pageViews = readStream(http://..., "1s") ones = pageViews.map( event =>(event.url, 1)) counts = ones.runningReduce( (a, b) => a + b)
FAULT TOLERANCE Straggler Mitigation Use speculative execution Task runs more than 1.4x longer than median task à straggler Master Recovery - At each timestep, save graph of DStreams and Scala function objects - Workers connect to a new master and report their RDD partitions - Note: No problem if a given RDD is computed twice (determinism).
DISCUSSION https://forms.gle/xUvzC1bdV7H48mTM8
If the latency bound was made to 100ms, how do you think the above figure would change? What could be the reasons for it?
Consider the pros and cons of approaches in Naiad vs Spark Streaming. What application properties would you use to decide which system to choose?
NEXT STEPS Next class: Graph processing Sign up for project check-ins!
SHORTCOMINGS? Expressiveness - Current API requires users to “think” in micro-batches Setting batch interval - Manual tuning. Higher batch à better throughput but worse latency Memory usage - LRU cache stores state RDDs in memory
COMPUTATION MODEL: MICRO-BATCHES Micro-Batch S H U F F L E Driver Control Message Network Transfer Task
SUMMARY Micro-batches: New approach to stream processing Higher latency for fault tolerance, straggler mitigation Unifying batch, streaming analytics

Recommend

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark Streaming and Spark SQL Explored Streaming API of Apache Spark on Ukko Cluster Window based Stream Content Direct Stream content

221 views • 9 slides

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

524 views • 48 slides

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust - @michaelarmbrust What is Apache Spark? Fast and general cluster computing system, interoperable with Hadoop, included in all major distros

667 views • 43 slides

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up a complete streaming environment for experimenting with Kafka, Spark streaming (PySpark), and Cassandra. It installs Kafka 0.10.2.1 Spark 2.1.1

156 views • 4 slides

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is SPARK? A sub-language of Ada 83 and 95 with particular properties that make it ideally suited to the most critical of applications: completely

848 views • 10 slides

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries included with Spark Spark MLlib Spark SQL GraphX Streaming machine structured graph learning real-time Spark Core Outline Introduction to

682 views • 40 slides

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx Streaming Spark Dataframe Spark Core (RDD) 2 Machine Learning Algorithms Supervised learning Given a set of features and labels Builds a model that

590 views • 24 slides

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? Who am I? > Project Management Committee (PMC) member of Apache Spark >

679 views • 41 slides

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark Architecture From MX to Spark MX Rich, styleable components Heavy components => Easy to use (most of the time) Spark introduces

500 views • 30 slides

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more than a name change. It It reflects enormous change for our customers fl t h f t and our business. Our ambition is to be a winning business,

667 views • 30 slides

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of Meeting: Introductions and formalities Chairmans address Managing Director update Resolutions Shareholder questions Conduct of polls Meeting

421 views • 38 slides

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green Snyder, Ph.D. LeeAnne Green Snyder, Ph.D. May 30, 2019 May 30, 2019 Acknowledgements SPARK Families SPARK Team Clinical Sites Libby Brooks,

521 views • 40 slides

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

10/05/2019 Big Data : Informatique pour les donnes et calculs massifs 7 SPARK technology Stphane Vialle Stephane.Vialle@centralesupelec.fr http://www.metz.supelec.fr/~vialle Spark Technology 1. Spark main objectives 2. RDD concepts

818 views • 39 slides

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

499 views • 36 slides

CS 744: SPARK SQL Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Assignment 2 grades this

CS 744: SPARK SQL Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Assignment 2 grades this week - Midterm details on Piazza - Course Project Proposal comments Applications Machine Learning SQL Streaming Graph Computational Engines

597 views • 24 slides

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Supervisor Trustees Joseph D. Baltz Bryan W. Kopman Larry Ryan John Theo Theobald Clerk Kristin Cross Brett Wheeler Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620 www.troytownship.com March

599 views • 57 slides

From batch to streaming to both Herman Schaaf, Senior Software Engineer A Story About me Herman

From batch to streaming to both Herman Schaaf, Senior Software Engineer A Story About me Herman Schaaf, Senior Software Engineer Data Platform Tribe The Cube From batch to streaming The Single Unified Log The Single Unified Log

412 views • 22 slides

lecture 11 MIPS registers already mentioned new today MIPS assembly language 4 - functions

lecture 11 MIPS registers already mentioned new today MIPS assembly language 4 - functions - MIPS stack February 15, 2016 main stores temporary registers before calling myfunction. Stack pointer (Here we assume variables m and n use

164 views • 4 slides

Code Generation: Intro Sebastian Hack Saarland University Compiler Construction W2015 saarland

Code Generation: Intro Sebastian Hack Saarland University Compiler Construction W2015 saarland university computer science 1 Code Generation Consists (roughly) of three parts: 1. Instruction Selection Select processor instructions for IR

197 views • 9 slides

CS 333 Introduction to Operating Systems Class 9 - Memory Management Jonathan Walpole Computer

CS 333 Introduction to Operating Systems Class 9 - Memory Management Jonathan Walpole Computer Science Portland State University Memory management Memory a linear array of bytes Holds O.S. and programs (processes) Each cell

876 views • 68 slides

ProtoDUNE-DP Electronics and DAQ LBNC Meeting, 5 December 2019 Dario Autiero SFT chimneys,

ProtoDUNE-DP Electronics and DAQ LBNC Meeting, 5 December 2019 Dario Autiero SFT chimneys, cryogenic FE electronics, Low Voltage distribution system uTCA and data network infrastructure Front-end digitization system Timing and

498 views • 30 slides

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

Moreno Baricevic Stefano Cozzini CNR-IOM DEMOCRITOS Trieste, ITALY Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a pool of users and a pool of resources, then what? some software that controls

538 views • 22 slides

Parallel Functional Programming Repa Mary Sheeran http://www.cse.chalmers.se/edu/course/pfp

Parallel Functional Programming Repa Mary Sheeran http://www.cse.chalmers.se/edu/course/pfp Amorphous Data Parallel Nested Haskell Flat Accelerate Repa Embedded Full (2 nd class) (1 st class) Slide borrowed from G. Kellers lecture

1.12k views • 85 slides

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Department of Mathematics and Computer Science Department of Mathematics and Computer Science Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort Parallel Sorting Algorithms Summary Course 01727

251 views • 6 slides