Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendell’s Spark slides
Parallel Processing using Spark+Hadoop • Hadoop: Distributed file system that connects machines. • Mapreduce: parallel programming style built on a Hadoop cluster • Spark: Berkeley design of Mapreduce programming • Given a file treated as a big list A file may be divided into multiple parts (splits). • Each record (line) is processed by a Map function, produces a set of intermediate key/value pairs. • Reduce: combine a set of values for the same key
>>> words = 'The quick brown fox jumps over the lazy dog'.split() Python Examples and List Comprehension for i in [5, 4, 3, 2, 1] : >>> lst = [3, 1, 4, 1, 5] print i >>> lst.append(2) >>> len(lst) print 'Blastoff!' 5 >>> lst.sort() >>> lst.insert(4,"Hello") >>> [1]+ [2] [1,2] >>>M = [x for x in S if x % 2 == 0] >>> lst[0] ->3 >>> S = [x**2 for x in range(10)] [0,1,4,9,16,…,81] Python tuples >>> num=(1, 2, 3, 4) >>> words =‘hello lazy dog'.split() >>> num +(5) >>> stuff = [(w.upper(), len(w)] for w in words] (1,2,3,4, 5) [ (‘HELLO’, 5) (‘LAZY’, 4) , (‘DOG’, 4)] >>>numset=set([1, 2, 3, 2]) Duplicated entries are deleted >>>numset=frozenset([1, 2,3]) Such a set cannot be modified
Python map/reduce a = [1, 2, 3] b = [4, 5, 6, 7] c = [8, 9, 1, 2, 3] f= l ambda x: len(x) L = map( f, [a, b, c]) [3, 4, 5] g=lambda x,y: x+y reduce(g, [47,11,42,13]) 113
Mapreduce programming with SPAK: key concept Write programs in terms of operations on implicitly distributed datasets (RDD) RDD RDD RDD: Resilient Distributed Datasets RDD • Like a big list: RDD Collections of objects spread Operations across a cluster, stored in RAM or • Transformations on Disk (e.g. map, filter, • Built through parallel groupBy) transformations • Make sure • Automatically rebuilt on input/output match failure
MapReduce vs Spark RDD RDD RDD RDD Spark operates on RDD Map and reduce tasks operate on key-value pairs
Language Support Python Standalone Programs • Python, Scala, & Java lines = sc.tex lines = sc.textFile(.. tFile(...) .) lines. lines.filter( lambda s: “ERROR” in s ). ).count count() () Interactive Shells Scala • Python & Scala val val lines = sc.textFile(...) lines.filter( x => x.contains(“ERROR”) ).count() Performance • Java & Scala are faster Java due to static typing • …but Python is often fine JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains( “error” ); } }).count();
Spark Context and Creating RDDs #St #Star art w t wit ith s h sc c – Sp Spark arkCo Conte ntext xt as as Main entry point to Spark functionality # Tu Turn rn a a Py Pyth thon on co coll llec ecti tion on in into to an an RD RDD > sc. sc.pa paral ralle leliz lize( e([1, [1, 2, 2, 3 3]) ]) # L # Loa oad t d tex ext f t fil ile f e from rom l loca ocal l FS, FS, H HDFS DFS, o , or r S3 S3 > sc. sc.te textF xtFil ile( e( “file.txt” ) > sc. sc.te textF xtFil ile( e( “directory/*.txt” ) > sc. sc.te textF xtFil ile( e( “hdfs://namenode:9000/path/file” )
Spark Architecture
Spark Architecture
Basic Transformations > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # P # Pas ass e s eac ach e h ele lemen ment t t thr hroug ough h a f a fun uncti ction on > squ squar ares es = = num nums. s.map map(lam lambd bda x a x: x : x*x *x) ) // // {1 {1, 4 , 4, , 9} 9} # K # Kee eep e p ele lemen ments ts pa passi ssing ng a a pr predi edica cate te > even ven = sq square ares.fil filter er(lambd mbda x: x: x % % 2 == == 0) ) // {4 {4} #read a text file and count number of lines containing error lines = sc.textFile(“file.log”) lines.filter( lambda s: “ERROR” in s ).count()
Basic Actions > num nums s = s = sc. c.par paral allel lelize ize([ ([1, 1, 2, 2, 3] 3]) # Ret # R etrie rieve ve RD RDD D con conten tents ts as as a a lo loca cal c l coll ollec ectio tion > num nums. s.col colle lect ct() () # = # => [ > [1, 1, 2, 2, 3 3] # R # Ret eturn urn f firs irst t K e K elem lemen ents ts > num nums. s.tak take(2) (2) # = # => [ > [1, 1, 2] 2] # Co Count n t numbe mber of of elem lements ts > num nums. s.cou count nt() () # = # => 3 > 3 # M # Mer erge ge el eleme ement nts w s with ith a an a n ass ssoci ociat ative ive fu func nctio tion > num nums. s.red reduc uce(la lambd mbda a x, x, y: y: x x + y + y) ) # = # => 6 > 6 # Wr Write e e eleme ements t s to a a text xt file ile > num nums. s.sav saveA eAsTe sText xtFil File( “hdfs://file.txt” )
Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java : Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
Some Key-Value Operations pets = s pets = sc. c.parallel paralleliz ize( e( > [( [( “cat” , 1 , 1), ( “dog” , 1), ( , 1), ( “cat” , 2)]) ]) pets. pets.reduce duceByKey(lamb lambda x, y: x y: x + y) > # => => {(cat, 3 t, 3), (dog, 1 g, 1)} pets. pets.groupB oupByKey() () # => => {(cat, [ t, [1, 2]), (d , (dog, [1])} ])} > pets.sortBy pets. rtByKey() () # => => {(cat, 1 t, 1), (cat, 2 t, 2), (dog, 1 g, 1)} > yKey also automatically implements reduceB duceByKey combiners on the map side
Example: Word Count lines nes = = sc.te sc.textF xtFile ile( “hamlet.txt” ) > count unts = s = line lines. s.flat latMap Map( lambda line: line.split(“ ”) ) > .map ap(lambd lambda w a word ord: (wo : (word, rd, 1) 1)) .redu educeByK ceByKey ey(lam lambda x bda x, y , y: x : x + y + y) “to” (to, 1) (be, 1)(be,1) (be,2) “be” (be, 1) “to be or” (not, 1) (not, 1) “or” (or, 1) “not” (not, 1) (or, 1) (or, 1) “to” (to, 1) “not to be” (to, 2) (to, 1)(to,1) “be” (be, 1)
Other Key-Value Operations vi visit sits s = s = sc.p c.paral ralle leliz lize([ e([ ( “index.html” , “1.2.3.4” ), ), > ( “about.html” , “3.4.5.6” ), ), ( “index.html” , “1.3.3.1” ) ]) ) ]) pa pageN geNam ames es = s = sc.pa .para ralle lleliz lize([ ([ ( “index.html” , , “Home” ), ), > ( “about.html” , , “About” ) ]) ) ]) visit vi sits. s.join join(pa (pageN geNam ames) es) > # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) vi visit sits. s.cogroup cogroup(pa (page geNam Names) es) > # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
Under The Hood: DAG Scheduler • General task B: A: graphs • Automatically F: Stage 1 groupBy pipelines functions E: C: D: • Data locality aware join • Partitioning map filter Stage 2 aware Stage 3 to avoid shuffles = RDD = cached partition
Setting the Level of Parallelism All the pair RDD operations take an optional second parameter for number of tasks > words.reduceByKey(lambda x, y: x + y, 5) > words.groupByKey(5) > visits.join(pageViews, 5)
More RDD Operators sample • map ap • reduce educe take • filter ilter • count ount first • groupBy roupBy • fold old partitionBy • sort ort • reduceBy educeByKey Key mapWith • union nion • groupByK roupByKey ey pipe • join oin • cogroup ogroup save ... • leftOute eftOuterJoi rJoin • cross ross • rightOut ightOuterJo erJoin in • zip ip
Interactive Shell • The Fastest Way to Learn Spark • Available in Python and Scala • Runs as an application on an existing Spark Cluster… • OR Can run locally
… or a Standalone Application import sy impo t sys fr from om py pysp spar ark k im impor ort t Sp Spar arkC kCont ntex ext if _ if __name name__ = __ == " = "__ma __main__ in__": ": sc = c = Spar SparkCo kContex ntext( t( “local” , , “WordCount” , sys sys.argv argv[0], [0], None None) lines ines = s = sc.t c.textF extFile( ile(sys sys.arg .argv[1] v[1]) count ounts = s = lin lines. es.flatM latMap ap( lambda s: s.split(“ ”) ) ) \ .map ap(lamb ambda w da word: ord: (w (word, ord, 1) 1)) ) \ .reduc educeBy eByKey Key(lamb lambda da x, y x, y: x : x + y + y) count ounts. s.save saveAsTex sTextFil tFile(sys.a ys.argv[ rgv[2]) 2])
Recommend
More recommend