Parallel Programming with Spark Qin Liu The Chinese University of - PowerPoint PPT Presentation

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1

Previously on Parallel Programming OpenMP: an API for writing multi-threaded applications • A set of compiler directives and library routines for parallel application programmers • Greatly simplifies writing multi-threaded programs in Fortran and C/C++ • Standardizes last 20 years of symmetric multiprocessing (SMP) practice 2

Compute π using Numerical Integration Let F ( x ) = 4 / (1 + x 2 ) 4 4 � 1 π = F ( x ) dx 3 . 5 3 . 5 0 F ( x ) = 4 / (1 + x 2 ) Approximate the integral as a sum of rectangles: 3 3 N 2 . 5 2 . 5 � F ( x i )∆ x ≈ π i =0 2 2 where each rectangle has width ∆ x and height F ( x i ) at the 0 0 . 5 1 x middle of interval i 3

Example: π Program with OpenMP 1 #include <stdio.h> 2 #include <omp.h> // header 3 const long N = 100000000; 4 #define NUM_THREADS 4 // #threads 5 int main () { 6 double sum = 0.0; 7 double delta_x = 1.0 / (double) N; 8 omp_set_num_threads ( NUM_THREADS ); // set #threads 9 #pragma omp parallel for reduction (+: sum) // parallel for 10 for (int i = 0; i < N; i++) { 11 double x = (i+0.5) * delta_x; 12 sum += 4.0 / (1.0 + x*x); 13 } 14 double pi = delta_x * sum; 15 printf("pi is %f\n", pi); 16 } 4

Example: π Program with OpenMP 1 #include <stdio.h> 2 #include <omp.h> // header 3 const long N = 100000000; 4 #define NUM_THREADS 4 // #threads 5 int main () { 6 double sum = 0.0; 7 double delta_x = 1.0 / (double) N; 8 omp_set_num_threads ( NUM_THREADS ); // set #threads 9 #pragma omp parallel for reduction (+: sum) // parallel for 10 for (int i = 0; i < N; i++) { 11 double x = (i+0.5) * delta_x; 12 sum += 4.0 / (1.0 + x*x); 13 } 14 double pi = delta_x * sum; 15 printf("pi is %f\n", pi); 16 } How to parallelize the π program on distributed clusters? 4

Outline Why Spark? Spark Concepts Tour of Spark Operations Job Execution Spark MLlib 5

Why Spark? 6

Apache Hadoop Ecosystem Component Hadoop Resource Manager YARN Storage HDFS Batch MapReduce Streaming Flume Columnar Store HBase SQL Query Hive Machine Learning Mahout Graph Giraph Interactive Pig 7

Apache Hadoop Ecosystem Component Hadoop Resource Manager YARN Storage HDFS Batch MapReduce Streaming Flume Columnar Store HBase SQL Query Hive Machine Learning Mahout Graph Giraph Interactive Pig ... mostly focused on large on-disk datasets: great for batch but slow 7

Many Specialized Systems MapReduce doesn’t compose well for large applications, and so specialized systems emerged as workarounds Component Hadoop Specialized Resource Manager YARN Storage HDFS RAMCloud Batch MapReduce Streaming Flume Storm Columnar Store HBase SQL Query Hive Machine Learning Mahout DMLC Graph Giraph PowerGraph Interactive Pig 8

Goals A new ecosystem • leverages current generation of commodity hardware • provides fault tolerance and parallel processing at scale • easy to use and combines SQL, Streaming, ML, Graph, etc. • compatible with existing ecosystems 9

Berkeley Data Analytics Stack being built by AMPLab to make sense of Big Data 1 Component Hadoop Specialized BDAS Resource Manager YARN Mesos Storage HDFS RAMCloud Tachyon Batch MapReduce Spark Streaming Flume Storm Streaming Columnar Store HBase Parquet SQL Query Hive SparkSQL Approximate SQL BlinkDB Machine Learning Mahout DMLC MLlib Graph Giraph PowerGraph GraphX Interactive Pig built-in 1 https://amplab.cs.berkeley.edu/software/ 10

Spark Concepts 11

What is Spark? Fast and expressive cluster computing system compatible with Hadoop • Works with many storage systems: local FS, HDFS, S3, SequenceFile, ... 12

What is Spark? Fast and expressive cluster computing system compatible with Hadoop • Works with many storage systems: local FS, HDFS, S3, SequenceFile, ... Improves efficency through: As much as 30x faster • In-memory computing primitives • General computation graphs 12

What is Spark? Fast and expressive cluster computing system compatible with Hadoop • Works with many storage systems: local FS, HDFS, S3, SequenceFile, ... Improves efficency through: As much as 30x faster • In-memory computing primitives • General computation graphs Improves usability through rich Scala/Java/Python APIs and interactive shell Often 2-10x less code 12

Main Abstraction - RDDs Goal: work with distributed collections as you would with local ones 13

Main Abstraction - RDDs Goal: work with distributed collections as you would with local ones Concept: resilient distributed datasets (RDDs) • Immutable collections of objects spread across a cluster 13

Main Abstraction - RDDs Goal: work with distributed collections as you would with local ones Concept: resilient distributed datasets (RDDs) • Immutable collections of objects spread across a cluster • Built through parallel transformations ( map , filter , ...) 13

Main Abstraction - RDDs Goal: work with distributed collections as you would with local ones Concept: resilient distributed datasets (RDDs) • Immutable collections of objects spread across a cluster • Built through parallel transformations ( map , filter , ...) • Automatically rebuilt on failure 13

Main Abstraction - RDDs Goal: work with distributed collections as you would with local ones Concept: resilient distributed datasets (RDDs) • Immutable collections of objects spread across a cluster • Built through parallel transformations ( map , filter , ...) • Automatically rebuilt on failure • Controllable persistence (e.g. caching in RAM) for reuse 13

Main Primitives Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects 14

Main Primitives Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects Transformations (e.g. map , filter , reduceByKey , join ) • Lazy operations to build RDDs from other RDDs 14

Main Primitives Resilient distributed datasets (RDDs) • Immutable, partitioned collections of objects Transformations (e.g. map , filter , reduceByKey , join ) • Lazy operations to build RDDs from other RDDs Actions (e.g. collect , count , save ) • Return a result or write it to storage 14

Learning Spark Download the binary package and uncompress it 15

Learning Spark Download the binary package and uncompress it Interactive Shell ( easist way ): ./bin/pyspark • modified version of Scala/Python interpreter • runs as an app on a Spark cluster or can run locally 15

Learning Spark Download the binary package and uncompress it Interactive Shell ( easist way ): ./bin/pyspark • modified version of Scala/Python interpreter • runs as an app on a Spark cluster or can run locally Standalone Programs: ./bin/spark-submit <program> • Scala, Java, and Python This talk: mostly Python 15

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns DEMO: 1 lines = sc.textFile("hdfs ://...") #load from HDFS 2 3 # transformation 4 errors = lines.filter(lambda s: s. startswith ("ERROR")) 5 6 # transformation 7 messages = errors.map(lambda s: s.split(’\t’)[1]) 8 9 messages.cache () 10 11 # action; compute messages now 12 messages.filter(lambda s: "life" in s).count () 13 14 # action; reuse cached messages 15 messages.filter(lambda s: "work" in s).count () 16

RDD Fault Tolerance RDDs track the series of transformations used to build them (their lineage ) to recompute lost data msgs = sc.textFile("hdfs ://...") .filter(lambda s: s.startswith("ERROR")) .map(lambda s: s.split(’\t’)[1]) 17

Spark vs. MapReduce • Spark keeps intermediate data in memory • Hadoop only supports map and reduce, which may not be efficient for join, group, ... • Programming in Spark is easier 18

Tour of Spark Operations 19

Spark Context • Main entry point to Spark functionality • Created for you in Spark shell as variable sc • In standalone programs, you’d make your own: 1 from pyspark import SparkContext 2 3 sc = SparkContext (appName=" ExampleApp ") 20

Creating RDDs • Turn a local collection into an RDD rdd = sc.parallelize([1, 2, 3]) • Load text file from local FS, HDFS, or other storage systems sc.textFile("file:///path/file.txt") sc.textFile("hdfs://namenode:9000/file.txt") • Use any existing Hadoop InputFormat sc.hadoopFile(keyClass, valClass, inputFmt, conf) 21

Basic Transformations nums = sc.parallelize ([1, 2, 3]) # Pass each element through a function squares = nums.map(lambda x: x*x) # => {1, 4, 9} # Keep elements passing a predicate even = squares.filter(lambda x: x%2 == 0) # => {4} # Map each element to zero or more others nums.flatMap(lambda x: range(x)) # => {0, 0, 1, 0, 1, 2} 22

Parallel Programming with Spark Qin Liu The Chinese University of - PowerPoint PPT Presentation

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on Parallel Programming OpenMP: an API for writing multi-threaded applications A set of compiler directives and library routines for parallel

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Recent Developments for the MX Beamline Control Toolkit William M. Lavender Illinois Institute

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning

An Anomaly Detection Mechanism for IEC 60870-5-104 Panagioti s Sari gianni dis Uni versi

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall

Hadronic particles made of many vector mesons Luis Roca (in collaboration with

About Nuclear Change II UNIT 7 DAY 2 What are we going to learn today? Types of Nuclear Changes

t stt ( ) ttr

Radiative corrections to the binding energy for a spin 1 / 2 charged particle (Toulon 2014)

Sambuz

Useful Links

Newsletter

Mail Us

Parallel Programming with Spark Qin Liu The Chinese University of - PowerPoint PPT Presentation

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on Parallel Programming OpenMP: an API for writing multi-threaded applications A set of compiler directives and library routines for parallel

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Cluster Basics Hana Sevcikova University of Washington DataCamp Parallel Programming in R

Spark &amp; sparklyr part II Spark &amp; sparklyr part II Programming for Statistical

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Recent Developments for the MX Beamline Control Toolkit William M. Lavender Illinois Institute

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning

An Anomaly Detection Mechanism for IEC 60870-5-104 Panagioti s Sari gianni dis Uni versi

Data Mining Linear &amp; nonlinear classifiers Hamid Beigy Sharif University of Technology Fall

Hadronic particles made of many vector mesons Luis Roca (in collaboration with

About Nuclear Change II UNIT 7 DAY 2 What are we going to learn today? Types of Nuclear Changes

t stt ( ) ttr

Radiative corrections to the binding energy for a spin 1 / 2 charged particle (Toulon 2014)

Sambuz

Useful Links

Newsletter

Mail Us

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark & sparklyr part II Spark & sparklyr part II Programming for Statistical

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall