mapreduce
play

MapReduce February 13, 2020 Data Science CSCI 1951A Brown - PowerPoint PPT Presentation

MapReduce February 13, 2020 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Josh Levin, Diane Mutako, Sol Zitter 1 Announcements Project Pitch Presentations SQL Grades, late handins Questions? Concerns?


  1. why hello world ! how Input oh hi there hello world there , the hell are world world ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4 Map Phase (hello, 1) (oh, 1) (why, 1) (world, 1) (world, 1) (hi, 1) (hello, 1) (!, 1) (there, 1) (there, 1) (how, 1) (world, 1) (,, 1) (the, 1) (world, 1) (hell, 1) (are, 1) (ya, 1) Shuffle Phase (“Group By”) (hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (hello, 1) (world, 1) (there, 1) Guarantees (world, 1) (world, 1) same key Use for e.g. uniquing, processed Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reduce Phase sorting, etc. together (hello, 2) (world, 4) (oh, 1) (hi, 1) (there, 2) 29

  2. why hello world ! how oh hi there hello world there , the hell are world world ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4 (hello, 1) (oh, 1) (why, 1) (world, 1) (world, 1) (hi, 1) (hello, 1) (!, 1) (there, 1) (there, 1) (how, 1) (world, 1) (,, 1) (the, 1) (world, 1) (hell, 1) (are, 1) (ya, 1) (hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (hello, 1) (world, 1) (there, 1) (world, 1) (world, 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 (hello, 2) (world, 4) (oh, 1) (hi, 1) (there, 2) 30

  3. why hello world ! how oh hi there hello world there , the hell are world world ya ? Mapper 1 Mapper 2 Mapper 3 Mapper 4 (hello, 1) (oh, 1) (why, 1) (world, 1) (world, 1) (hi, 1) (hello, 1) (!, 1) (there, 1) (there, 1) (how, 1) (world, 1) (,, 1) (the, 1) (world, 1) (hell, 1) (are, 1) (ya, 1) (hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (hello, 1) (world, 1) (there, 1) (world, 1) (world, 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 (hi, 1) (hello, 2) (world, 4) (oh, 1) 31 (there, 2)

  4. world ! how ya ? why hello oh hi hello world there , the hell are there world world Mapper 7 Mapper 1 Mapper 2 Mapper 3 Mapper 4 Mapper 5 Mapper 6 Mapper 7 (there, 1) (world, 1) (the, 1) (hello, 1) (oh, 1) (there, 1) (why, 1) (ya, 1) (,, 1) (!, 1) (hell, 1) (world, 1) (hi, 1) (world, 1) (hello, 1) (?, 1) (world, 1) (how, 1) (are, 1) (hello, 1) (world, 1) (oh, 1) (hi, 1) (there, 1) (hello, 1) (world, 1) (there, 1) (world, 1) (world, 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 (hi, 1) (hello, 2) (world, 4) (oh, 1) 32 (there, 2)

  5. Map Reduce //define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output) 33

  6. Warning: Map Reduce Code Snippets/ Pseudocode //define your mapper function(s) (Don’t assume this def MapFn: (String, String) -> (String, Int) { will look exactly like TODO; this in the hw) } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; } //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output) 34

  7. Map Reduce //define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; table } DocID Text 1 hello world //define your reduce function(s) 2 oh hi there world def ReduceFn:(String, List(Int)) -> (String, Int){ why hello there , 3 TODO; world } world ! how the 4 hell are ya ? //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output) 35

  8. Map Reduce //define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; table output } DocID Text Word Count hello 2 1 hello world //define your reduce function(s) world 4 2 oh hi there world def ReduceFn:(String, List(Int)) -> (String, Int){ oh 1 why hello there , 3 TODO; world hi 1 } world ! how the there 2 4 hell are ya ? //define your pipeline Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output) 36

  9. Map Reduce //define your mapper function(s) def MapFn: (String, String) -> (String, Int) { TODO; } //define your reduce function(s) def ReduceFn:(String, List(Int)) -> (String, Int){ TODO; Lots of data types: } String, Int, Float, Tuples thereof //define your pipeline Table< String, String > table = read(table_path) Table< String, Int > output = table.MapFn().ReduceFn(); write(output) 37

  10. Map Reduce // enumerate occurrences of each word, with // count of 1 def MapFn: (String, String) -> (String, Int) { for w in input.value().split(){ emit(w, 1); } } 38

  11. Map Reduce // enumerate occurrences of each word, with // count of 1 def MapFn: (String, String ) -> (String, Int) { for w in input.value ().split(){ emit(w, 1); } String } 39

  12. Map Reduce // sum the total counts of each word def ReduceFn:(String, List(Int)) -> (String, Int){ sum = 0; for c in input.value(){ sum += c; } emit(input.key(), sum); } 40

  13. Map Reduce // sum the total counts of each word def ReduceFn:(String, List(Int)) -> (String, Int){ sum = 0; list of ints (counts) for c in input.value(){ sum += c; } emit(input.key(), sum); } 41

  14. Map Reduce // sum the total counts of each word def ReduceFn:(String, List(Int)) -> (String, Int){ sum = 0; list of ints (counts) for c in input.value(){ sum += c; } the word emit(input.key(), sum); } 42

  15. Find the number of occurrences of each word? // enumerate occurrences of each word Input: String // with count of 1 def MapFn: (String, String) -> (String, Int) { for w in input.split(){ emit(w, 1); } } Map: output (word, 1) for every word. // sum the total counts of each word def ReduceFn:(String, List(Int)_ -> (String, Int){ emit(input.key(), sum([c for c in input.value()])); } Reduce: Sum counts // define your pipeline def main() { for each word Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn().ReduceFn(); write(output) } 43

  16. (non)Clicker Question! Find the number of unique documents that each word occurs in? 44

  17. (non)Clicker Question! Find the number of unique documents that each word occurs in? // enumerate occurrences of each word // with count of 1 def MapFn1: String -> (String, Int) { ??? } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) } 45

  18. (non)Clicker Question! Find the number of unique documents that each word occurs in? No using sets! // enumerate occurrences of each word // with count of 1 (use reducers def MapFn1: String -> (String, Int) { ??? instead) } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) } 46

  19. (non)Clicker Question! Find the number of unique documents that each word occurs in? No using sets! // enumerate occurrences of each word // with count of 1 (use reducers def MapFn1: String -> (String, Int) { ??? instead) } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) } 47

  20. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? 48

  21. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) ((D1, hello), 1) ((D4, ?), 1) 49

  22. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) ((D1, hello), 1) ((D4, ?), 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 50

  23. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) ((D1, hello), 1) ((D4, ?), 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 (hello, 1) (world, 1) (world, 1) (?, 1) 51

  24. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) ((D1, hello), 1) ((D4, ?), 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 (hello, 1) (world, 1) (world, 1) (?, 1) Reducer 1 Reducer 2 Reducer 3 52

  25. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) ((D1, hello), 1) ((D4, ?), 1) Reducer 1 Reducer 2 Reducer 3 Reducer 4 (hello, 1) (world, 1) (world, 1) (?, 1) Reducer 1 Reducer 2 Reducer 3 (hello, 2) (world, 4) (?, 1) 53

  26. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) Why can’ t we use mappers ((D1, hello), 1) ((D4, ?), 1) for this step? Reducer 1 Reducer 2 Reducer 3 Reducer 4 (hello, 1) (world, 1) (world, 1) (?, 1) Reducer 1 Reducer 2 Reducer 3 (hello, 2) (world, 4) (?, 1) 54

  27. D1 D2 D3 D4 hello world, why hello world ! how oh hi, hi just saying there , the hell are there world hello world ya ? ? ? Mapper Mapper Mapper Mapper ((D1, hello), 1) ((D4, world), 1) ((D1, world), 1) … …. …. … ((D4, ?), 1) Why can’ t we use mappers ((D1, hello), 1) ((D4, ?), 1) for this step? Reducer 1 Reducer 2 Reducer 3 Reducer 4 (hello, 1) (world, 1) (world, 1) (?, 1) Same keys won’ t necessarily get processed together… Reducer 1 Reducer 2 Reducer 3 (hello, 2) (world, 4) (?, 1) 55

  28. (non)Clicker Question! Find the number of unique documents that each word occurs in? // enumerate occurrences of each word // with count of 1 def MapFn1: String -> (String, Int) { ??? } def ReduceFn1: (String, List(Int)) -> (String, Int) { ??? } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { ??? } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().ReduceFn1().ReduceFn2(); write(output) } 56

  29. // enumerate occurrences of each word // with count of 1 Find the number of unique documents that each word def MapFn1: (String, String) -> ((String, String), Int) { for w in input.value().split(){ occurs in? emit((input.key(), w), 1) } } def ReduceFn1: (String, List(Int)) -> (String, Int) { emit(input.key()[1], 1) } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().MapFn2().ReduceFn(); write(output) } 57

  30. // enumerate occurrences of each word // with count of 1 Find the number of unique documents that each word def MapFn1: (String, String) -> ((String, String), Int) { for w in input.value().split(){ occurs in? emit((input.key(), w), 1) } ignore the value list! (“unique”) } def ReduceFn1: (String, List(Int)) -> (String, Int) { emit(input.key()[1], 1) } // sum the total counts of each word def ReduceFn2: (String, List(Int)) -> (String, Int) { sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } // define your pipeline def main() { Table<String, String> table = read(table_path) Table<String, Int> output = table.MapFn1().MapFn2().ReduceFn(); write(output) } 58

  31. Clicker Question! 59

  32. Find the number of unique documents that each word occurs in? // enumerate occurrences // of each word with count of 1 def MapFn1: { for w in input.value().split(){ emit((input.key(), w), 1) } } def ReduceFn1: { emit(input.key()[1], 1) } // sum the total counts // of each word def ReduceFn2:{ sum = 0; for (w, c) in input{ sum += c; } emit(w, sum); } 60

  33. Find the number of unique documents that each word occurs in? // enumerate occurrences // enumerate occurrences // of each word with count of 1 // of each word with count of 1 def MapFn1: { def MapFn1: { for w in input.value().split(){ for w in input.value().split(){ emit((input.key(), w), 1) emit(input.key(), w) } } } } def ReduceFn1: { def ReduceFn1: { emit(input.key()[1], 1) for w in input.value(){emit(w, 1)} } } // sum the total counts // sum the total counts // of each word // of each word def ReduceFn2:{ def ReduceFn2:(S, I) -> (S, I){ sum = 0; sum = 0; for (w, c) in input{ sum += c; } for (w, c) in input{ sum += c; } emit(w, sum); emit(w, sum); } } 61

  34. Clicker Question! Find the number of unique documents that each word occurs in? // enumerate occurrences // enumerate occurrences // of each word with count of 1 // of each word with count of 1 def MapFn1: { def MapFn1: { for w in input.value().split(){ for w in input.value().split(){ emit((input.key(), w), 1) emit(input.key(), w) } } } } def ReduceFn1: { def ReduceFn1: { emit(input.key()[1], 1) for w in input.value(){emit(w, 1)} } } // sum the total counts // sum the total counts // of each word // of each word def ReduceFn2:{ def ReduceFn2:(S, I) -> (S, I){ sum = 0; sum = 0; for (w, c) in input{ sum += c; } for (w, c) in input{ sum += c; } emit(w, sum); emit(w, sum); } } Do these produce the same output? (a)Yes (b) No 62

  35. Clicker Question! Find the number of unique documents that each word occurs in? // enumerate occurrences // enumerate occurrences // of each word with count of 1 // of each word with count of 1 def MapFn1: { def MapFn1: { for w in input.value().split(){ for w in input.value().split(){ emit((input.key(), w), 1) emit(input.key(), w) } } } } def ReduceFn1: { def ReduceFn1: { emit(input.key()[1], 1) for w in input.value(){emit(w, 1)} } } // sum the total counts // sum the total counts // of each word // of each word def ReduceFn2:{ def ReduceFn2:(S, I) -> (S, I){ sum = 0; sum = 0; for (w, c) in input{ sum += c; } for (w, c) in input{ sum += c; } emit(w, sum); emit(w, sum); } } Do these produce the same output? (a)Yes (b) No 63

  36. Clicker Question! Find the number of unique documents that each word occurs in? // enumerate occurrences // enumerate occurrences // of each word with count of 1 // of each word with count of 1 def MapFn1: { def MapFn1: { for w in input.value().split(){ for w in input.value().split(){ emit((input.key(), w), 1) emit(input.key(), w) } } } } def ReduceFn1: { def ReduceFn1: { emit(input.key()[1], 1) for w in input.value(){emit(w, 1)} } } // sum the total counts // sum the total counts unique // of each word // of each word def ReduceFn2:{ def ReduceFn2:(S, I) -> (S, I){ documents a sum = 0; sum = 0; for (w, c) in input{ sum += c; } for (w, c) in input{ sum += c; } word occurs in emit(w, sum); emit(w, sum); } } Do these produce the same output? (a)Yes (b) No 64

  37. Clicker Question! Find the number of unique documents that each word occurs in? // enumerate occurrences // enumerate occurrences // of each word with count of 1 // of each word with count of 1 def MapFn1: { def MapFn1: { for w in input.value().split(){ for w in input.value().split(){ emit((input.key(), w), 1) emit(input.key(), w) ??? } } } } def ReduceFn1: { def ReduceFn1: { emit(input.key()[1], 1) for w in input.value(){emit(w, 1)} } } // sum the total counts // sum the total counts unique // of each word // of each word def ReduceFn2:{ def ReduceFn2:(S, I) -> (S, I){ documents a sum = 0; sum = 0; for (w, c) in input{ sum += c; } for (w, c) in input{ sum += c; } word occurs in emit(w, sum); emit(w, sum); } } Do these produce the same output? (a)Yes (b) No 65

  38. Clicker Question! Input K: V def ReduceFn1: (S, S) -> (S, I) { Doc1 : here are some words for w in input.value(){ Doc2: words words words emit(w, 1) Doc3: here are words } } def MapFn1: (S, S) -> (S, S) { def ReduceFn2:(S, I) -> (S, I){ for w in input.value().split(){ sum = 0; emit(input.key(), w) for (w, c) in input{ } sum += c; } } emit(w, sum); } What will this produce? (a) here:2, are:2, some: 1, words:3 (b) here:2, are:2, some: 1, words:5 (c) here: 1, are: 1, some: 1, words: 1 66

  39. Clicker Question! Input K: V def ReduceFn1: (S, S) -> (S, I) { Doc1 : here are some words for w in input.value(){ Doc2: words words words emit(w, 1) Doc3: here are words } } def MapFn1: (S, S) -> (S, S) { def ReduceFn2:(S, I) -> (S, I){ for w in input.value().split(){ sum = 0; emit(input.key(), w) for (w, c) in input{ } sum += c; } } emit(w, sum); } What will this produce? (a) here:2, are:2, some: 1, words:3 (b) here:2, are:2, some: 1, words:5 (c) here: 1, are: 1, some: 1, words: 1 67

  40. Clicker Question! Input K: V def ReduceFn1: (S, S) -> (S, I) { Doc1 : here are some words for w in input.value(){ Doc2: words words words emit(w, 1) Doc3: here are words } } def MapFn1: (S, S) -> (S, S) { def ReduceFn2:(S, I) -> (S, I){ for w in input.value().split(){ sum = 0; emit(input.key(), w) for (w, c) in input{ } sum += c; } } emit(w, sum); } Reducer is by DocId only, so just counts total occurrences What will this produce? (a) here:2, are:2, some: 1, words:3 (b) here:2, are:2, some: 1, words:5 (c) here: 1, are: 1, some: 1, words: 1 68

  41. Other MapReduce Functions • Sort • Unique • Sample • First • Filter • Join 69

  42. Other MapReduce Functions • Sort • Unique • Sample • First • Filter • Join 70

  43. Other MapReduce Functions • Sort • Unique • Joins are usually computed “under the hood” by most MR • Sample implementations (like in SQL) • First • But you can imagine having to do them yourself… • Filter • Join 71

  44. Real Life Application 72

  45. Real Life Application Is Charles Mingus a composer ? 73

  46. Real Life Application Is Charles Mingus a composer ? “Mingus is a composer ” 74

  47. Real Life Application Is Charles Mingus a composer ? “Mingus is a composer ” 75

  48. Real Life Application Is Charles Mingus a 1950s American jazz composer ? “Mingus is a 1950s American jazz composer ” 76

  49. Real Life Application Is Charles Mingus a 1950s American jazz composer ? 77

  50. Real Life Application Is Charles Mingus a 1950s American jazz composer ? … if Mingus is a composer worthy of our attention, it must be because… “Mingus is a 1950s American jazz composer ” Mingus dominated the scene back in the 1950s and 1960s. Mingus was truly a product of America in all its historic complexities… A virtuoso bassist and composer, Mingus irrevocably changed the face of jazz … 78

  51. Real Life Application ComposerX dominated the scene back in the 1950s and 1960s. ComposerX is a 1950s composer. 79

  52. Real Life Application Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus 80

  53. Joins Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Desired output: Subject Predicate Object Categories Person, US_Presidents, Barack Obama won the electoral vote Huffington_Post_Columnists Person, Kamala Lopez wrote an op-ed for HuffPo Huffington_Post_Columnists, Actor 81

  54. Joins Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Desired output: Subject Predicate Object Categories Person, US_Presidents, Barack Obama won the electoral vote Huffington_Post_Columnists Person, Kamala Lopez wrote an op-ed for HuffPo Huffington_Post_Columnists, Actor 82

  55. Joins Facts Categories Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Select * from Facts, Categories Where Subject == Entity GroupBy Subject 83

  56. Joins Facts Categories Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Select * from Facts, Categories Where Subject == Entity GroupBy Subject 84

  57. Joins Facts Categories Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Select * from Facts, Categories Where Subject == Entity GroupBy Subject Key: String Value: (list_of((String, String, String), list_of((String, String)) 85

  58. Joins Facts Categories Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Select * from Facts, Categories Where Subject == Entity GroupBy Subject Entity Key: String Value: (list_of((String, String, String), list_of((String, String)) 86

  59. Joins Facts Categories Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus Select * from Facts, Categories Where Subject == Entity All the facts GroupBy Subject for that entity Key: String Value: ( list_of((String, String, String) , list_of((String, String)) 87

  60. Joins Facts Categories Subject Predicate Object Category Entity Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez Charles Mingus wrote jazz Person Charles Mingus Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez US Presidents Barack Obama Jazz Composers Charles Mingus All the Select * from Facts, Categories Where Subject == Entity categories for GroupBy Subject that entity Key: String Value: (list_of((String, String, String), list_of((String, String) ) 88

  61. Joins // rekey table by entity Facts Categories def MapFn1: (String, Obj) -> (String, Obj) { emit(input.value().entity(), input.value()) Subject Predicate Object Category Entity } Barack Obama won the electoral vote Person Barack Obama Kamala Lopez wrote an op-ed for HuffPo Person Kamala Lopez // rekey table by subject Charles Mingus wrote jazz Person Charles Mingus def MapFn2: (String, Obj) -> (String, Obj) { Barack Obama opposed the appropriations bill Huffington Post Columnists Barack Obama emit(input.value().subject(), input.value()) Barack Obama listens to jazz Huffington Post Columnists Kamala Lopez } US Presidents Barack Obama Jazz Composers Charles Mingus // define your pipeline All the def main() { Select * from Facts, Categories Table<String, Obj> cats = read(table1_path).MapFn1() Where Subject == Entity categories for Table<String, Obj> facts = read(table2_path).MapFn2() GroupBy Subject output = cats.join(facts) .MapFn3(. . . that entity Key: String Value: (list_of((String, String, String), list_of((String, String) ) 89

  62. Bottlenecks! 90

  63. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Sent1 Sent2 SentM … Mappers: (DocID, Sent) -> (Word, Count) Word1 Word2 WordK … Reducers: (Word, Count) -> Word, sum(Count) ✔ 91

  64. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Sent1 Sent2 SentM … Clicker Question! Mappers: (DocID, Sent) -> (Word, Count) Word1 Word2 WordK … Reducers: (Word, Count) -> Word, sum(Count) ✔ 92

  65. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Clicker Question! Sent1 Sent2 SentM … In the best-case scenario, how much Mappers: (DocID, Sent) -> (Word, Count) parallelization could we get here (maximum number of mappers)? Word1 Word2 WordK … (a) N (b) log(N) Reducers: (Word, Count) -> Word, sum(Count) (c) 5 ✔ 93

  66. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Clicker Question! Sent1 Sent2 SentM … In the best-case scenario, how much Mappers: (DocID, Sent) -> (Word, Count) parallelization could we get here (maximum number of mappers)? Word1 Word2 WordK … (a) N (b) log(N) Reducers: (Word, Count) -> Word, sum(Count) (c) 5 ✔ 94

  67. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Sent1 Sent2 SentM … Mappers: (DocID, Sent) -> (Word, Count) Clicker Question! Word1 Word2 WordK … How about here? (a) N Reducers: (Word, Count) -> Word, sum(Count) (b) M ✔ (c) N*M 95

  68. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Sent1 Sent2 SentM … Mappers: (DocID, Sent) -> (Word, Count) Clicker Question! Word1 Word2 WordK … How about here? (a) N Reducers: (Word, Count) -> Word, sum(Count) (b) M ✔ (c) N*M 96

  69. Doc1 Doc2 DocN … Mappers: (DocID, Doc) -> (DocID, Sent) Sent1 Sent2 SentM … Mappers: (DocID, Sent) -> (Word, Count) Clicker Question! Word1 Word2 WordK … How about here? (a) N Mapping doesn’ t require Reducers: (Word, Count) -> Word, sum(Count) (b) M the same keys to route ✔ (c) N*M to the same machine. 97

  70. Clicker Question! Which is (likely to be) faster? (b) (a) Mapper1: Mapper: (DocID, Doc) -> (DocID, Sent) (DocID, Doc) -> (Word, Count) Reducer: Mapper2: (Word, Count) -> Word, (DocID, Sent) -> (Word, Count) sum(Count) Reducer: (Word, Count) -> Word, (c) They are the same sum(Count) 98

  71. ) e c n e t n Clicker Question! ) e d S r ( o f W o ( _ f t o s _ i t l s i = l Which is (likely to be) faster? c = o D e c n e t n e S (b) (a) Mapper1: Mapper: (DocID, Doc) -> (DocID, Sent) (DocID, Doc) -> (Word, Count) Reducer: Mapper2: (Word, Count) -> Word, (DocID, Sent) -> (Word, Count) sum(Count) Reducer: (Word, Count) -> Word, (c) They are the same sum(Count) 99

  72. Clicker Question! Which is (likely to be) faster? (b) (a) Mapper1: Mapper: (DocID, Doc) -> (DocID, Sent) (DocID, Doc) -> (Word, Count) Reducer: Mapper2: (Word, Count) -> Word, (DocID, Sent) -> (Word, Count) sum(Count) Reducer: (Word, Count) -> Word, (c) They are the same sum(Count) 100

Recommend


More recommend