programming mapreduce in mathematica
play

Programming MapReduce in Mathematica Paul-Jean Letourneau Data - PowerPoint PPT Presentation

Programming MapReduce in Mathematica Paul-Jean Letourneau Data Scientist, Wolfram Research Commercial Users of Functional Programming Sept 22, 2013 2 cufp-2013-talk-slides.nb personal analytics cufp-2013-talk-slides.nb 3 4


  1. Programming MapReduce in Mathematica Paul-Jean Letourneau Data Scientist, Wolfram Research Commercial Users of Functional Programming Sept 22, 2013

  2. 2 cufp-2013-talk-slides.nb personal analytics

  3. cufp-2013-talk-slides.nb 3

  4. 4 cufp-2013-talk-slides.nb experimental computation

  5. cufp-2013-talk-slides.nb 5

  6. 6 cufp-2013-talk-slides.nb bioinformatics

  7. cufp-2013-talk-slides.nb 7 genomics

  8. 8 cufp-2013-talk-slides.nb distributed computation

  9. cufp-2013-talk-slides.nb 9 overview core principles of Mathematica examples programming MapReduce with Mathematica

  10. 10 cufp-2013-talk-slides.nb the fundamental principles 1. everything is an expression 2. expressions are transformed until they stop changing 3. transformation rules are patterns

  11. cufp-2013-talk-slides.nb 11 1. everything is an expression expressions are data structures Mathematica expression: head [ arg1, arg2, ...] LISP expr: (head arg1 arg2 ...)

  12. 12 cufp-2013-talk-slides.nb 1. everything is an expression FullForm 1 + 1 2 FullForm @ Unevaluated @ 1 + 1 DD Unevaluated @ Plus @ 1, 1 DD FullForm @ Unevaluated @ 1 + 1 - 3 a DD Unevaluated @ Plus @ 1, 1, Times @ - 1, Times @ 3, a DDDD

  13. cufp-2013-talk-slides.nb 13 1. everything is an expression ... with lots of syntactic sugar Ò + 1 & ê ü Range @ 10 D 8 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 < FullForm @ Unevaluated @ Ò + 1 & ê ü Range @ 10 DDD Unevaluated @ Map @ Function @ Plus @ Slot @ 1 D , 1 DD , Range @ 10 DDD

  14. 14 cufp-2013-talk-slides.nb 2. expressions are transformed until they stop changing definitions are rules Clear @ a D ; a = 1; a 1

  15. cufp-2013-talk-slides.nb 15 2. expressions are transformed until they stop changing rules transform expressions: infinite evaluation OwnValues @ a D 8 HoldPattern @ a D ß 1 < a êê Trace 8 a, 1 < Clear @ b D ; a = 1; a + b + 1 êê Trace 88 a, 1 < , 1 + b + 1, 2 + b < b = 2; a + b + 1 êê Trace 88 a, 1 < , 8 b, 2 < , 1 + 2 + 1, 4 <

  16. 16 cufp-2013-talk-slides.nb 3. rules are patterns rules have patterns a = 1; OwnValues @ a D 8 HoldPattern @ a D ß 1 <

  17. cufp-2013-talk-slides.nb 17 3. rules are patterns functions are rules Clear @ f, g, a, b D ; f @ x_Integer D : = x + 1 DownValues @ f D êê Column HoldPattern @ f @ x_Integer DD ß x + 1 Head @ 1 D Integer f @ 1 D 2 f @ "a" D f @ a D Head @ "a" D String

  18. 18 cufp-2013-talk-slides.nb 3. rules are patterns ordering of rules f @ 1 D : = 1000 DownValues @ f D êê Column HoldPattern @ f @ 1 DD ß 1000 HoldPattern @ f @ x_Integer DD ß x + 1 f ê ü 8 0, 1, 2, 3, 4, 5 < 8 1, 1000, 3, 4, 5, 6 <

  19. cufp-2013-talk-slides.nb 19 program as data expressions are immutable 10 = 1 Set::setraw : Cannot assign to raw object 10. à 1 Plus @ 1, 1 D = 3 Set::write : Tag Plus in 1 + 1 is Protected. à 3 a = 10 10 a = 1 1

  20. 20 cufp-2013-talk-slides.nb program as data homoiconicity: expressions ARE the data structure Clear @ a D ; TreeForm @ Unevaluated @ 1 + 1 - 3 a DD Plus 2 Times - 3 a

  21. cufp-2013-talk-slides.nb 21 examples Fibonacci sequence fib @ n_ D : = fib @ n D = fib @ n - 2 D + fib @ n - 1 D ; fib @ 1 D = 1; fib @ 2 D = 1; Table @ fib @ n D , 8 n, 1, 10 <D 8 1, 1, 2, 3, 5, 8, 13, 21, 34, 55 < ListLogLogPlot @ Table @ fib @ n D , 8 n, 1, 100 <DD 10 20 10 16 10 12 10 8 10 4 2 5 10 20 50 100

  22. 22 cufp-2013-talk-slides.nb examples scrape a web page Grid ü Partition @ Show @ Import üÒ , ImageSize Ø 50 D & ê ü Union ü Flatten ü Table @ Cases @ Import @ "http: êê cufp.org ê conference ê sessions ê 2013?page = " <> IntegerString ü n, "XMLObject" D , s_String ê ; StringMatchQ @ s, RegularExpression @ ". * \\.jpg" DD , Infinity D , 8 n, 0, 3 <D , 5, 5, 1, 8<D

  23. cufp-2013-talk-slides.nb 23 examples “everything is a one-liner in Mathematica ... for a sufficiently long line.” (Theo Gray) Show @ ImageAssemble @ Round @ Rescale @ ImageData @ i = Nest @ Darker, ImageResize @ ExampleData @8 "TestImage", "Elaine" <D , 50 D , 3 DDD 9 D ê . n_Integer ß Nest @ Lighter, i, n DD , ImageSize Ø 400 D

  24. 24 cufp-2013-talk-slides.nb gateway drug ... ... to declaritive programming y = 0; For @ i = 1, i § 10, i ++ , y += i^2 D ; y 385 Fold @ Ò 1 + Ò 2^2 &, 0, Range @ 10 DD 385

  25. cufp-2013-talk-slides.nb 25 advanced topics scoping evaluation control MathLink protocol

  26. 26 cufp-2013-talk-slides.nb MapReduce MapReduce in a nutshell

  27. cufp-2013-talk-slides.nb 27 HadoopLink WordCount textRaw = Import @ "http: êê www.gutenberg.org ê cache ê epub ê 1342 ê pg1342.txt" D ; StringTake @ textRaw, 200 D The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away o Reverse ü SortBy @ Tally @ StringSplit @ textRaw, RegularExpression @ " @ \\W_ D + " DDD , Last D êê Short 88 the, 4218 < , 8 to, 4187 < , 8 of, 3705 < , á 7101 à , 8 10, 1 < , 8 000, 1 <<

  28. 28 cufp-2013-talk-slides.nb HadoopLink create key-value pairs paras = StringSplit @ textRaw, RegularExpression @ "\n 8 2, < " DD ; paraPairs = Transpose @8 paras, Table @ 1, 8 Length ü paras <D<D ; Grid @8 Ò < , Frame Ø All, Background Ø 88 LightGreen, LightRed <<D & ê ü paraPairs @@ 1 ;; 4 DD êê Column The Project Gutenberg EBook of Pride and Prejudice, by Jane Austen 1 This eBook is for the use of anyone anywhere at no cost and with 1 almost no restrictions whatsoever. You may copy it, give it away or re - use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: Pride and Prejudice 1 Author: Jane Austen 1

  29. cufp-2013-talk-slides.nb 29 HadoopLink export to the Hadoop filesystem << HadoopLink $$link = OpenHadoopLink @ "fs.default.name" Ø "hdfs: êê hadoopheadlx.wolfram.com:8020", "mapred.job.tracker" Ø "hadoopheadlx.wolfram.com:8021" D ; inputfile @ "pap" D = " ê user ê paul - jean ê hadooplink ê pap - paras.seq"; DFSExport @ $$link, inputfile @ "pap" D , paraPairs, "SequenceFile" D ê user ê paul - jean ê hadooplink ê pap - paras.seq Grid @ Partition @ Names @ "HadoopLink` * " D , 4 D , Alignment Ø Left, BaseStyle Ø 8 FontSize Ø 14 <D DFSAbsoluteFileName DFSCloseSequenceStream DFSCopyDirectory DFSCopyFile DFSCopyFromLocal DFSCopyToLocal DFSCreateDirectory DFSDeleteDirectory DFSDeleteFile DFSDirectoryQ DFSExport DFSFileByteCount DFSFileDate DFSFileExistsQ DFSFileNames DFSFileQ DFSFileType DFSImport DFSOpenSequenceStream DFSReadList DFSRenameDirectory DFSRenameFile DFSSequenceStream HadoopLink HadoopMapReduceJob IncrementCounter OpenHadoopLink Yield

  30. 30 cufp-2013-talk-slides.nb HadoopLink mapper WordCountMapper = Function @8 k, v < , With @8 words = ToLowerCase ê ü StringSplit @ k, RegularExpression @ " @ \\W_ D + " DD< , Yield @ Ò , 1 D & ê ü words D D ;

  31. cufp-2013-talk-slides.nb 31 HadoopLink reducer SumReducer = Function @8 k, vs < , Module @ 8 sum = 0 < , While @ vs ü hasNext @D , sum += vs ü next @D D ; Yield @ k, sum D D D ;

  32. 32 cufp-2013-talk-slides.nb HadoopLink run the job inputfile @ "pap" D = " ê user ê paul - jean ê hadooplink ê pap - paras.seq"; outputdir @ "pap" D = " ê user ê paul - jean ê hadooplink ê pap - wordcount"; HadoopMapReduceJob @ $$link, "pap wordcount", inputfile @ "pap" D , outputdir @ "pap" D , WordCountMapper, SumReducer D

  33. cufp-2013-talk-slides.nb 33 HadoopLink control flow

  34. 34 cufp-2013-talk-slides.nb genome search engine prep data mtseq = GenomeData @8 "Mitochondrion", 8 1, - 1 <<D ; StringTake @ mtseq, 30 D GATCACAGGTCTATCACCCTATTAACCACT querybases = "GCACACACACA"; StringPosition @ mtseq, querybases D 88 515, 525 <<

  35. cufp-2013-talk-slides.nb 35 genome search engine create key-value pairs mtchars = Characters @ mtseq D ; mtbases = Transpose @8 mtchars, Range ü Length ü mtchars <D ; Grid @8 Ò < , Frame Ø All, Background Ø 88 LightGreen, LightRed <<D & ê ü mtbases @@ 1 ;; 20 DD 9 G 1 , A 2 , T 3 , C 4 , A 5 , C 6 , A 7 , G 8 , G 9 , T 10 , C 11 , T 12 , A 13 , T 14 , C 15 , A 16 , C 17 , C 18 , C 19 , T 20 =

  36. 36 cufp-2013-talk-slides.nb genome search engine mapper querybases = "GCACACACACA"; GenomeSearchMapper @ qchunks : 8 __String <D : = Function @8 base, genomepos < , Module @8 pos, querypositions < , querypositions = Flatten ü Position @ qchunks, base D ; With @ 8 querypos = Ò < , Yield @ genomepos - H querypos - 1 L , querypos D D & ê ü querypositions D D

  37. cufp-2013-talk-slides.nb 37 genome search engine mapper 507 C 1 G 508 C 2 C 509 T 1 G 3 A 510 A 2 C 4 C 511 C 1 G 3 A 5 A 512 C 2 C 4 C 6 C 513 C 1 G 3 A 5 A 7 A 514 A 2 C 4 C 6 C 8 C 515 G 1 G 3 A 5 A 7 A 9 A 516 C 2 C 4 C 6 C 8 C 10 C 517 A 3 A 5 A 7 A 9 A 11 A 518 C 4 C 6 C 8 C 10 C 519 A 5 A 7 A 9 A 11 A 520 C 6 C 8 C 10 C 521 A 7 A 9 A 11 A 522 C 8 C 10 C 523 A 9 A 11 A 524 C 10 C 525 A 11 A 526 C 527 C

Recommend


More recommend