mapreduce design patterns
play

MapReduce Design Patterns This section is based on the book by Jimmy - PDF document

MapReduce Design Patterns This section is based on the book by Jimmy Lin Now lets look at important program design and Chris Dyer patterns for MapReduce. Programmer can control program execution only through implementation of


  1. MapReduce Design Patterns • This section is based on the book by Jimmy Lin Now let’s look at important program “design and Chris Dyer patterns” for MapReduce. • Programmer can control program execution only through implementation of mapper, reducer, combiner, and partitioner • No explicit synchronization primitives • So how can a programmer control execution and data flow? 1 2 Taking Control of MapReduce (1) Local Aggregation • Store and communicate partial results through • Reduce size of intermediate results passed complex data structures for keys and values from mappers to reducers • Run appropriate initialization code at beginning of task – Important for scalability: recall Amdahl’s Law and termination code at end of task • Various options using combiner function and • Preserve state in mappers and reducers across multiple input splits and intermediate keys, respectively ability to preserve mapper state across • Control sort order of intermediate keys to control multiple inputs processing order at reducers • For example, consider Word Count with the • Control set of keys assigned to a reducer • Use “driver” program document-based version of Map 3 4 Word Count Baseline Algorithm Tally Counts Per Document map(docID a, doc d) H = new hashMap for all term t in doc d do H{t} ++ reduce (term t, counts [c1, c2,…]) for all term t in H do map(docID a, doc d) sum = 0 Emit(term t, count H{t}) for all term t in doc d do for all count c in counts do Emit(term t, count 1) sum += c • Same Reduce function as before Emit(term t, count sum); • Limitation: Map only aggregates counts within a single document • Problem: frequent terms are emitted many • Depending on split size and document size, a Map task might receive many documents times with count 1 • Can we aggregate across all documents in the same Map task? 5 6

  2. Tally Counts Across Documents Design Pattern for Local Aggregation • Data structure H is a private member • In-mapper combining Class Mapper { of the Mapper class – Done by preserving state across map calls in the same task initialize() { – Local to a single task, i.e., does not H = new hashMap • Advantages over using combiners introduce task synchronization issues } – Combiner does not guarantee if, when or how often it is • Initialize is called when the task starts, executed map(docID a, doc d) { i.e., before all map calls for all term t in doc d do – Combiner combines data after it was generated, in- – Configure() in old API H{t} ++ mapper combining avoids generating it! – Setup() in new API } • Drawbacks • Close is called after the last document – Introduces complexity and hence probability for bugs close() { from the Map task has been for all term t in H do processed – Higher memory consumption for managing state Emit(term t, count H{t}) – Close() in old API • Might have to write memory-management code to page data to } disk – Cleanup() in new API } 7 8 (2) Counting of Combinations Pairs Design Pattern map(docID a, doc d) • Needed for computing correlations, for all term w in doc d do for all term u NEAR w do associations, confusion matrix (how many w v u Emit(pair (w, u), count 1) times does a classifier confuse Y i with Y j ) w reduce (pair p, counts [c1, c2,…]) • Co-occurrence matrix for a text corpus: how v sum = 0 for all count c in counts do u many times do two terms appear near each sum += c Emit(pair p, count sum) other • Can use combiner or in-mapper combining • Main idea: compute partial counts for some • Good: easy to implement and understand combinations, then aggregate them • Bad: huge intermediate-key space – At what granularity should Map work? – Quadratic in number of distinct terms 9 10 Stripes Design Pattern Note About Stripes Map Code map(docID a, doc d) • Pairs’ Map code only needs a single sequential scan of the for all term w in doc d do document, keeping the current term w and a “sliding H = new hashMap window” of the nearby terms to its left and right w v u for all term u NEAR w do H{u} ++ Emit(term w, stripe H) • Stripes can do the same, but then it does not aggregate w counts across multiple occurrences of the same term w in v reduce (term w, stripes [H1, H2,…]) document d, i.e., would mostly produce counts of 1 in the Hout = new hashMap u hash map for all stripe H in stripes do Hout = ElementWiseSum(Hout, H) • To aggregate across all occurrences of w in d, Stripes would Emit(term w, stripe Hout) have to repeatedly scan the document, once for each • Can use combiner or in-mapper combining distinct term w in d • Good: much smaller intermediate-key space – Could create an index to find repeated occurrences of w faster – Linear in number of distinct terms • Or use a two-dim. hash map H[w][u] in the Map function, • Bad: more difficult to implement, Map needs to hold entire stripe in allowing a single-scan solution at higher memory cost memory 11 12

  3. Pairs versus Stripes Pairs versus Stripes (cont.) • With combiner or in-mapper combining, Map • Without combiner or in-mapper combining, would produce about the same amount of Pairs could produce significantly more mapper data in both cases output – Two-dimensional index Pairs[w][u] with per-task – ((w,u),1) per pair for Pairs, versus per-document counts for each pair (w,u) is the same as one- aggregates for Stripes dimensional index of one-dimensional indexes • …but it would need a lot less memory (Stripes[w])[u] • …and would also require about the same – Pairs essentially needs no extra storage beyond amount of memory to store the two- the current “window” of nearby words, while dimensional count data structure Stripes has to store the hash map H 13 14 Pairs versus Stripes (cont.) Beyond Pairs and Stripes • Does the number of keys matter? • In general, it is not clear which approach is better – Assume we use the same number of tasks, then Pairs just – Some experiments indicate stripes win for co- assigns more keys per task – Master works with tasks, hence no conceptual difference occurrence matrix computation between Pairs and Stripes • Pairs and Stripes are special cases of shapes for • More fine-grained keys of Pairs allow more flexibility in assigning keys to tasks covering the entire matrix – Pairs can emulate Stripes’ row -wise key assignment to tasks – Could use sub-stripes, or partition matrix horizontally – Stripes cannot emulate all Pairs assignments, e.g., “checkerboard” pattern for two tasks and vertically into more square-like shapes etc. • Greater number of distinct keys per task in Pairs tends to • Can also be applied to higher-dimensional arrays increase sorting cost, even if total data size is the same • Will see interesting version of this idea for joins 15 16 (3) Relative Frequencies Bird Probabilities Using Stripes • Important for data mining • Use species as intermediate key • E.g., for each species and color, estimate the – One stripe per species, e.g., stripe[N.C.] – (stripe[species])[color] stores f(species, color) probability of the color for that species • Map: for each observation of (species S, color C) – Probability of Northern Cardinal being red: P(color = red | species = N.C.) in an observation event, increment (stripe[S])[C] • Count f(N.C.) = the frequency of observations for N.C. – Output (S, stripe[S]) (marginal) • Reduce: for each species S, add all stripes for S • Count f(N.C., red) = the frequency of observations for red N.C.’s ( joint event) – Result: stripeSum[S] with total counts for each color • Estimate P(red | N.C.) as f(N.C., red) / f(N.C.) for S • Similarly: normalize word co-occurrence vector – Can get f(S) by adding all color-counts in stripeSum[S] for word w – Emit (stripeSum[S])[C] / f(S) for each color C 17 18

Recommend


More recommend