Distributed Computing with Spark Reza Zadeh Thanks to Matei - PowerPoint PPT Presentation

Distributed Computing with Spark Reza Zadeh Thanks ¡to ¡Matei ¡Zaharia ¡

Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing work

Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters » Wide use in both enterprises and web industry How do we program these things?

Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult ery difficult to do at scale: » How to split problem across nodes? • Must consider network & data locality » How to deal with failures? (inevitable at scale) » Even worse: stragglers (node not failed, but slow) » Ethernet networking not fast Rarely used in commodity datacenters

Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks and where to run each task » Run parts twice fault recovery Map Reduce Biggest example: MapReduce Map Reduce Map

MapReduce Numerical Algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods Stochastic SVD Tall skinny QR Many others!

Why Use a Data Flow Engine? Ease of programming » High-level functions instead of message passing Wide deployment » More common than MPI, especially “near” data Scalability to very largest clusters » Even HPC world is now concerned about resilience Examples: Pig, Hive, Scalding, Storm

Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing » State between steps goes to distributed file system » Slow due to replication & disk storage

Example: Iterative Apps file system � file system � file system � file system � read write read write . . . . . . iter. 1 iter. 2 Input file system � result 1 query 1 read result 2 query 2 result 3 query 3 Input . . . . . . Commonly spend 90% of time doing I/O

Example: PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector Same file grouped over and over Neighbors (id, edges) Ranks (id, rank) … iteration 1 iteration 2 iteration 3

Result While MapReduce is simple, it can require asymptotically more communication or I/O

Spark Computing Engine Extends MapReduce model with primitives for efficient data sharing » “Resilient distributed datasets” Open source at Apache » Most active community in big data, with 50+ companies contributing Clean APIs in Java, Scala, Python

Resilient Distributed Datasets (RDDs) Collections of objects stored across a cluster User-controlled partitioning & storage (memory, disk, …) Automatically rebuilt on failure Known to be � urls ¡= ¡spark.textFile(“hdfs://...”) ¡ hash-partitioned records ¡= ¡urls.map(lambda ¡s: ¡(s, ¡1)) ¡ counts ¡= ¡records.reduceByKey(lambda ¡a, ¡b: ¡a ¡+ ¡b) ¡ Also known bigCounts ¡= ¡counts.filter(lambda ¡(url, ¡cnt): ¡cnt ¡> ¡10) ¡ ¡ map reduce filter bigCounts.cache() ¡ Input file bigCounts.filter( ¡ ¡ ¡lambda ¡(k,v): ¡“news” ¡in ¡k).count() ¡ bigCounts.join(otherPartitionedRDD) ¡

Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » Automatically rebuilt on failure

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Cache 1 lines ¡= ¡spark.textFile(“hdfs://...”) ¡ Worker results ¡ errors ¡= ¡lines.filter(lambda ¡s: ¡s.startswith(“ERROR”)) ¡ tasks ¡ messages ¡= ¡errors.map(lambda ¡s: ¡s.split(“\t”)[2]) ¡ Block ¡1 ¡ Driver messages.cache() ¡ Action messages.filter(lambda ¡s: ¡“foo” ¡in ¡s).count() ¡ Cache 2 messages.filter(lambda ¡s: ¡“bar” ¡in ¡s).count() ¡ Worker . ¡. ¡. ¡ Cache 3 Block ¡2 ¡ Worker Result: Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data) Block ¡3 ¡

Fault Tolerance RDDs track lineage info to rebuild lost data file.map(lambda ¡rec: ¡(rec.type, ¡1)) ¡ ¡ ¡ ¡ ¡.reduceByKey(lambda ¡x, ¡y: ¡x ¡+ ¡y) ¡ ¡ ¡ ¡ ¡.filter(lambda ¡(type, ¡count): ¡count ¡> ¡10) ¡ map reduce filter Input file

Partitioning RDDs know their partitioning functions file.map(lambda ¡rec: ¡(rec.type, ¡1)) ¡ Known to be � ¡ ¡ ¡ ¡.reduceByKey(lambda ¡x, ¡y: ¡x ¡+ ¡y) ¡ hash-partitioned ¡ ¡ ¡ ¡.filter(lambda ¡(type, ¡count): ¡count ¡> ¡10) ¡ Also known map reduce filter Input file

Logistic Regression data ¡= ¡spark.textFile(...).map(readPoint).cache() ¡ ¡ w ¡= ¡numpy.random.rand(D) ¡ ¡ for ¡i ¡ in ¡range(iterations): ¡ ¡ ¡ ¡ ¡gradient ¡= ¡data.map(lambda ¡p: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1 ¡/ ¡(1 ¡+ ¡exp(-‑p.y ¡* ¡w.dot(p.x)))) ¡* ¡p.y ¡* ¡p.x ¡ ¡ ¡ ¡ ¡).reduce(lambda ¡a, ¡b: ¡a ¡+ ¡b) ¡ ¡ ¡ ¡ ¡w ¡-‑= ¡gradient ¡ ¡ print ¡“Final ¡w: ¡%s” ¡% ¡w ¡

Logistic Regression Results 4000 3500 110 s / iteration ime (s) Running Time (s) 3000 2500 Running T Hadoop 2000 1500 Spark 1000 500 first iteration 80 s 0 further iterations 1 s 1 5 10 20 30 Number of Iterations Number of Iterations

PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing partitionBy Neighbors (id, edges) Ranks (id, rank) … join join join

PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing partitionBy Neighbors (id, edges) same � node Ranks (id, rank) … join join join

PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing partitionBy Neighbors (id, edges) Ranks (id, rank) … join join join

PageRank Code # ¡RDD ¡of ¡(id, ¡neighbors) ¡pairs ¡ links ¡= ¡spark.textFile(...).map(parsePage) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡.partitionBy(128).cache() ¡ ¡ ranks ¡= ¡links.mapValues(lambda ¡v: ¡1.0) ¡ ¡# ¡RDD ¡of ¡(id, ¡rank) ¡ ¡ for ¡i ¡ in ¡range(ITERATIONS): ¡ ¡ ¡ ¡ ¡ranks ¡= ¡links.join(ranks).flatMap( ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡lambda ¡(id, ¡(links, ¡rank)): ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡[(d, ¡rank/links.size) ¡for ¡d ¡in ¡links] ¡ ¡ ¡ ¡ ¡).reduceByKey(lambda ¡a, ¡b: ¡a ¡+ ¡b) ¡

PageRank Results 200 171 ime per iteration (s) Time per iteration (s) Hadoop 150 Basic Spark 100 72 Spark + Controlled 50 Partitioning 23 0

Alternating Least Squares B T = R A 1. Start with random A 1 , B 1 2. Solve for A 2 to minimize ||R – A 2 B 1 T || 3. Solve for B 2 to minimize ||R – A 2 B 2 T || 4. Repeat until convergence

ALS on Spark B T = R A Cache 2 copies of R in memory, one partitioned by rows and one by columns Keep A & B partitioned in corresponding way Operate on blocks to lower communication

ALS Results 5000 4208 4000 ime (s) otal Time (s) Mahout / Hadoop 3000 Spark (Scala) Total T 2000 GraphLab (C++) 1000 481 297 0

Benefit for Users Same engine Same engine performs data extraction, model training and interactive queries Separate engines parse query train DFS DFS DFS DFS DFS DFS … read write read write read write Spark parse query train DFS read DFS

Spark Community Most active open source community in big data 200+ 200+ developers, 50+ 50+ companies contributing Contributors in past year 150 100 50 Giraph Storm 0

Distributed Computing with Spark Reza Zadeh Thanks to Matei - PowerPoint PPT Presentation

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing work Problem Data

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Statistics Netherlands - Coding occupations Coding occupations The new coding process Hendrika

Constant-time programming in FaCT Sunjay Cauligi , UC San Diego Fraser Brown, Ranjit Jhala, Brian

Criteria and metrics for thresholded AU detection Jeff Girard and Jeff Cohn University of

Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for

Advances in Programming Languages APL1: Whats so important about language? Ian Stark School

An Incomplete History of Computation Charles Babbage 1791-1871 Ada Lovelace 1815-1852 Lucasian

Integrating Acceleration Devices using CometCloud Thomas Beach School of Computer Science &

High Performance Computing (HPC) at UL Present and Future Challenges Sbastien Varrette, PhD

Distributed Computing with Spark Reza Zadeh Thanks to Matei - PowerPoint PPT Presentation

Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing work Problem Data

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

AI and Predictive Analytics in Data-Center Environments Distributed Computing using Spark An

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Statistics Netherlands - Coding occupations Coding occupations The new coding process Hendrika

Constant-time programming in FaCT Sunjay Cauligi , UC San Diego Fraser Brown, Ranjit Jhala, Brian

Criteria and metrics for thresholded AU detection Jeff Girard and Jeff Cohn University of

Ingest and Dissemination with DAITSS Presented by Randy Fischer, Programmer, Florida Center for

Advances in Programming Languages APL1: Whats so important about language? Ian Stark School

An Incomplete History of Computation Charles Babbage 1791-1871 Ada Lovelace 1815-1852 Lucasian

Integrating Acceleration Devices using CometCloud Thomas Beach School of Computer Science &amp;

High Performance Computing (HPC) at UL Present and Future Challenges Sbastien Varrette, PhD

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Integrating Acceleration Devices using CometCloud Thomas Beach School of Computer Science &