Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendell’s Spark slides

Parallel Processing using Spark+Hadoop • Hadoop: Distributed file system that connects machines. • Mapreduce: parallel programming style built on a Hadoop cluster • Spark: Berkeley design of Mapreduce programming • Given a file treated as a big list  A file may be divided into multiple parts (splits). • Each record (line) is processed by a Map function,  produces a set of intermediate key/value pairs. • Reduce: combine a set of values for the same key

>>> words = 'The quick brown fox jumps over the lazy dog'.split() Python Examples and List Comprehension for i in [5, 4, 3, 2, 1] : >>> lst = [3, 1, 4, 1, 5] print i >>> lst.append(2) >>> len(lst) print 'Blastoff!' 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>> [1]+ [2]  [1,2] >>>M = [x for x in S if x % 2 == 0] >>> lst[0] ->3 >>> S = [x**2 for x in range(10)] [0,1,4,9,16,…,81] Python tuples >>> num=(1, 2, 3, 4) >>> words =‘hello lazy dog'.split() >>> num +(5)  >>> stuff = [(w.upper(), len(w)] for w in words] (1,2,3,4, 5)  [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)] >>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified

Python map/reduce a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= l ambda x: len(x) L = map( f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113

Mapreduce programming with SPAK: key concept Write programs in terms of operations on implicitly distributed datasets (RDD) RDD RDD RDD: Resilient Distributed Datasets RDD • Like a big list: RDD  Collections of objects spread Operations across a cluster, stored in RAM or • Transformations on Disk (e.g. map, filter, • Built through parallel groupBy) transformations • Make sure • Automatically rebuilt on input/output match failure

MapReduce vs Spark RDD RDD RDD RDD Spark operates on RDD Map and reduce tasks operate on key-value pairs

Language Support Python Standalone Programs • Python, Scala, & Java lines = sc.tex lines = sc.textFile(.. tFile(...) .) lines. lines.filter( lambda s: “ERROR” in s ). ).count count() () Interactive Shells Scala • Python & Scala val val lines = sc.textFile(...) lines.filter( x => x.contains(“ERROR”) ).count() Performance • Java & Scala are faster Java due to static typing • …but Python is often fine JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains( “error” ); } }).count();

Spark Context and Creating RDDs #St #Star art w t wit ith s h sc c – Sp Spark arkCo Conte ntext xt as as Main entry point to Spark functionality # Tu Turn rn a a Py Pyth thon on co coll llec ecti tion on in into to an an RD RDD > sc. sc.pa paral ralle leliz lize( e([1, [1, 2, 2, 3 3]) ]) # L # Loa oad t d tex ext f t fil ile f e from rom l loca ocal l FS, FS, H HDFS DFS, o , or r S3 S3 > sc. sc.te textF xtFil ile( e( “file.txt” ) > sc. sc.te textF xtFil ile( e( “directory/*.txt” ) > sc. sc.te textF xtFil ile( e( “hdfs://namenode:9000/path/file” )

Spark Architecture

Basic Transformations > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # P # Pas ass e s eac ach e h ele lemen ment t t thr hroug ough h a f a fun uncti ction on > squ squar ares es = = num nums. s.map map(lam lambd bda x a x: x : x*x *x) ) // // {1 {1, 4 , 4, , 9} 9} # K # Kee eep e p ele lemen ments ts pa passi ssing ng a a pr predi edica cate te > even ven = sq square ares.fil filter er(lambd mbda x: x: x % % 2 == == 0) ) // {4 {4} #read a text file and count number of lines containing error lines = sc.textFile(“file.log”) lines.filter( lambda s: “ERROR” in s ).count()

Basic Actions > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # Ret # R etrie rieve ve RD RDD D con conten tents ts as as a a lo loca cal c l coll ollec ectio tion > num nums. s.col colle lect ct() () # = # => [ > [1, 1, 2, 2, 3 3] # R # Ret eturn urn f firs irst t K e K elem lemen ents ts > num nums. s.tak take(2) (2) # = # => [ > [1, 1, 2] 2] # Co Count n t numbe mber of of elem lements ts > num nums. s.cou count nt() () # = # => 3 > 3 # M # Mer erge ge el eleme ement nts w s with ith a an a n ass ssoci ociat ative ive fu func nctio tion > num nums. s.red reduc uce(la lambd mbda a x, x, y: y: x x + y + y) ) # = # => 6 > 6 # Wr Write e e eleme ements t s to a a text xt file ile > num nums. s.sav saveA eAsTe sText xtFil File( “hdfs://file.txt” )

Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java : Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

Some Key-Value Operations pets = s pets = sc. c.parallel paralleliz ize( e( > [( [( “cat” , 1 , 1), ( “dog” , 1), ( , 1), ( “cat” , 2)]) ]) pets. pets.reduce duceByKey(lamb lambda x, y: x y: x + y) > # => => {(cat, 3 t, 3), (dog, 1 g, 1)} pets. pets.groupB oupByKey() () # => => {(cat, [ t, [1, 2]), (d , (dog, [1])} ])} > pets.sortBy pets. rtByKey() () # => => {(cat, 1 t, 1), (cat, 2 t, 2), (dog, 1 g, 1)} > yKey also automatically implements reduceB duceByKey combiners on the map side

Example: Word Count lines nes = = sc.te sc.textF xtFile ile( “hamlet.txt” ) > count unts = s = line lines. s.flat latMap Map( lambda line: line.split(“ ”) ) > .map ap(lambd lambda w a word ord: (wo : (word, rd, 1) 1)) .redu educeByK ceByKey ey(lam lambda x bda x, y , y: x : x + y + y) “to” (to, 1) (be, 1)(be,1) (be,2) “be” (be, 1) “to be or” (not, 1) (not, 1) “or” (or, 1) “not” (not, 1) (or, 1) (or, 1) “to” (to, 1) “not to be” (to, 2) (to, 1)(to,1) “be” (be, 1)

Other Key-Value Operations vi visit sits s = s = sc.p c.paral ralle leliz lize([ e([ ( “index.html” , “1.2.3.4” ), ), > ( “about.html” , “3.4.5.6” ), ), ( “index.html” , “1.3.3.1” ) ]) ) ]) pa pageN geNam ames es = s = sc.pa .para ralle lleliz lize([ ([ ( “index.html” , , “Home” ), ), > ( “about.html” , , “About” ) ]) ) ]) visit vi sits. s.join join(pa (pageN geNam ames) es) > # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) vi visit sits. s.cogroup cogroup(pa (page geNam Names) es) > # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))

Under The Hood: DAG Scheduler • General task B: A: graphs • Automatically F: Stage 1 groupBy pipelines functions E: C: D: • Data locality aware join • Partitioning map filter Stage 2 aware Stage 3 to avoid shuffles = RDD = cached partition

Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)

More RDD Operators sample • map ap • reduce educe take • filter ilter • count ount first • groupBy roupBy • fold old partitionBy • sort ort • reduceBy educeByKey Key mapWith • union nion • groupByK roupByKey ey pipe • join oin • cogroup ogroup save ... • leftOute eftOuterJoi rJoin • cross ross • rightOut ightOuterJo erJoin in • zip ip

Interactive Shell • The Fastest Way to Learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • OR Can run locally

… or a Standalone Application import sy impo t sys fr from om py pysp spar ark k im impor ort t Sp Spar arkC kCont ntex ext if _ if __name name__ = __ == " = "__ma __main__ in__": ": sc = c = Spar SparkCo kContex ntext( t( “local” , , “WordCount” , sys sys.argv argv[0], [0], None None) lines ines = s = sc.t c.textF extFile( ile(sys sys.arg .argv[1] v[1]) count ounts = s = lin lines. es.flatM latMap ap( lambda s: s.split(“ ”) ) ) \ .map ap(lamb ambda w da word: ord: (w (word, ord, 1) 1)) ) \ .reduc educeBy eByKey Key(lamb lambda da x, y x, y: x : x + y + y) count ounts. s.save saveAsTex sTextFil tFile(sys.a ys.argv[ rgv[2]) 2])

Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks What hat is is Sp

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Validation for Distributed Systems with Apache Spark & Beam Melinda Seckington Now

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Multiple graphs and composable queries in Cypher for Apache Spark Max Kieling openCypher

Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center IBM

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

Your Program on Apache Spark GTC 2017 Kazuaki Ishizaki + , Madhusudanan Kandasamy * , Gita

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Storm Christopher Little Apache Storm Alternatives Storm Hadoop Spark Streaming

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache

Analyzing Weather Data with Apache Spark Jeremie Juban Tom Kunicki Introduction Who we

Apache Spark CS240A Winter 2016. T Yang Some of them are based on - PowerPoint PPT Presentation

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides Parallel Processing using Spark+Hadoop Hadoop: Distributed file system that connects machines. Mapreduce: parallel programming style built on

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks 1 What is Sp Spark?

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Introduction to Apache Spark Slides from: Patrick Wendell - Databricks What hat is is Sp

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Apache Spark CS240A T Yang Some of them are based on P. Wendells Spark slides Parallel

ROOT4J / SPARK-ROOT: ROOT I/O for JVM and Applications for Apache Spark V. Khristenko 1 J.

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Validation for Distributed Systems with Apache Spark &amp; Beam Melinda Seckington Now

Starting with Apache Spark, Best Practices and Learning from the Field Felix Cheung, Principal

Multiple graphs and composable queries in Cypher for Apache Spark Max Kieling openCypher

Apache SystemML Declarative Machine Learning Luciano Resende IBM | Spark Technology Center IBM

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Predicting Share Prices in Real-Time with Apache Spark and Apache Ignite MANUEL MOURATO Summary

Your Program on Apache Spark GTC 2017 Kazuaki Ishizaki + , Madhusudanan Kandasamy * , Gita

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Storm Christopher Little Apache Storm Alternatives Storm Hadoop Spark Streaming

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache

Analyzing Weather Data with Apache Spark Jeremie Juban Tom Kunicki Introduction Who we

Validation for Distributed Systems with Apache Spark & Beam Melinda Seckington Now

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark