Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf - PowerPoint PPT Presentation

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley

Background MapReduce and Dryad raised level of abstraction in cluster programming by hiding scaling & faults However, these systems provide a limited programming model: acyclic data flow Can we design similarly powerful abstractions for a broader class of applications?

Spark Goals Support applications with working sets (datasets reused across parallel operations) » Iterative jobs (common in machine learning) » Interactive data mining Retain MapReduce’s fault tolerance & scalability Experiment with programmability » Integrate into Scala programming language » Support interactive use from Scala interpreter

Programming Model Resilient distributed datasets (RDDs) » Created from HDFS files or “parallelized” arrays » Can be transformed with map and filter » Can be cached across parallel operations Parallel operations on RDDs » Reduce, collect, foreach Shared variables » Accumulators (add‐only), broadcast variables

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) Worker results errors = lines.filter(_.startsWith(“ERROR”)) tasks messages = errors.map(_.split(‘\t’)(2)) Block 1 Driver cachedMsgs = messages.cache() Cached RDD Parallel operation cachedMsgs.filter(_.contains(“foo”)).count Cache 2 cachedMsgs.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Block 3

RDD Representation Each RDD object maintains lineage information that can be used to reconstruct lost partitions Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache() HdfsRDD FilteredRDD MappedRDD CachedRDD path: hdfs://… func: contains(...) func: split(…)

Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – – – + + – – – – – – target

Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

Conclusions & Future Work Spark provides a limited but efficient set of fault tolerant distributed memory abstractions » Resilient distributed datasets (RDDs) » Restricted shared variables In future work, plan to further extend this model: » More RDD transformations (e.g. shuffle) » More RDD persistence options (e.g. disk + memory) » Updatable RDDs (for incremental or streaming jobs) » Data sharing across applications

Related Work DryadLINQ » Build queries through language‐integrated SQL operations on lazy datasets » Cannot have a dataset persist across queries » No concept of shared variables for broadcast etc Pig and Hive » Query languages that can call into Java/Python/etc UDFs » No support for caching a datasets across queries OpenMP » Compiler extension for parallel loops in C++ » Annotate variables as read‐only or accumulator above loop » Cluster version exists, but not fault‐tolerant Twister and Haloop » Iterative MapReduce implementations using caching » Can’t define multiple distributed datasets, run multiple map & reduce pairs on them, or decide which operations to run next interactively

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf - PowerPoint PPT Presentation

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster programming by hiding scaling &

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce

Goals: Devise space-time DG-method for the wave equation : u tt u xx = f in Q := ] 0 , T

Lecture 18 Review: E&M, Relativity Finishing Classical Physics: Waves, E&M Timeline The

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Sambuz

Useful Links

Newsletter

Mail Us

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf - PowerPoint PPT Presentation

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica UC Berkeley Background MapReduce and Dryad raised level of abstraction in cluster programming by hiding scaling &

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce

Goals: Devise space-time DG-method for the wave equation : u tt u xx = f in Q := ] 0 , T

Lecture 18 Review: E&amp;M, Relativity Finishing Classical Physics: Waves, E&amp;M Timeline The

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

Sambuz

Useful Links

Newsletter

Mail Us

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Lecture 18 Review: E&M, Relativity Finishing Classical Physics: Waves, E&M Timeline The