MapReduce Design Patterns • This section is based on the book by Jimmy Lin Now let’s look at important program “design and Chris Dyer patterns” for MapReduce. • Programmer can control program execution only through implementation of mapper, reducer, combiner, and partitioner • No explicit synchronization primitives • So how can a programmer control execution and data flow? 1 2 Taking Control of MapReduce (1) Local Aggregation • Store and communicate partial results through • Reduce size of intermediate results passed complex data structures for keys and values from mappers to reducers • Run appropriate initialization code at beginning of task – Important for scalability: recall Amdahl’s Law and termination code at end of task • Various options using combiner function and • Preserve state in mappers and reducers across multiple input splits and intermediate keys, respectively ability to preserve mapper state across • Control sort order of intermediate keys to control multiple inputs processing order at reducers • For example, consider Word Count with the • Control set of keys assigned to a reducer • Use “driver” program document-based version of Map 3 4 Word Count Baseline Algorithm Tally Counts Per Document map(docID a, doc d) H = new hashMap for all term t in doc d do H{t} ++ reduce (term t, counts [c1, c2,…]) for all term t in H do map(docID a, doc d) sum = 0 Emit(term t, count H{t}) for all term t in doc d do for all count c in counts do Emit(term t, count 1) sum += c • Same Reduce function as before Emit(term t, count sum); • Limitation: Map only aggregates counts within a single document • Problem: frequent terms are emitted many • Depending on split size and document size, a Map task might receive many documents times with count 1 • Can we aggregate across all documents in the same Map task? 5 6
Tally Counts Across Documents Design Pattern for Local Aggregation • Data structure H is a private member • In-mapper combining Class Mapper { of the Mapper class – Done by preserving state across map calls in the same task initialize() { – Local to a single task, i.e., does not H = new hashMap • Advantages over using combiners introduce task synchronization issues } – Combiner does not guarantee if, when or how often it is • Initialize is called when the task starts, executed map(docID a, doc d) { i.e., before all map calls for all term t in doc d do – Combiner combines data after it was generated, in- – Configure() in old API H{t} ++ mapper combining avoids generating it! – Setup() in new API } • Drawbacks • Close is called after the last document – Introduces complexity and hence probability for bugs close() { from the Map task has been for all term t in H do processed – Higher memory consumption for managing state Emit(term t, count H{t}) – Close() in old API • Might have to write memory-management code to page data to } disk – Cleanup() in new API } 7 8 (2) Counting of Combinations Pairs Design Pattern map(docID a, doc d) • Needed for computing correlations, for all term w in doc d do for all term u NEAR w do associations, confusion matrix (how many w v u Emit(pair (w, u), count 1) times does a classifier confuse Y i with Y j ) w reduce (pair p, counts [c1, c2,…]) • Co-occurrence matrix for a text corpus: how v sum = 0 for all count c in counts do u many times do two terms appear near each sum += c Emit(pair p, count sum) other • Can use combiner or in-mapper combining • Main idea: compute partial counts for some • Good: easy to implement and understand combinations, then aggregate them • Bad: huge intermediate-key space – At what granularity should Map work? – Quadratic in number of distinct terms 9 10 Stripes Design Pattern Note About Stripes Map Code map(docID a, doc d) • Pairs’ Map code only needs a single sequential scan of the for all term w in doc d do document, keeping the current term w and a “sliding H = new hashMap window” of the nearby terms to its left and right w v u for all term u NEAR w do H{u} ++ Emit(term w, stripe H) • Stripes can do the same, but then it does not aggregate w counts across multiple occurrences of the same term w in v reduce (term w, stripes [H1, H2,…]) document d, i.e., would mostly produce counts of 1 in the Hout = new hashMap u hash map for all stripe H in stripes do Hout = ElementWiseSum(Hout, H) • To aggregate across all occurrences of w in d, Stripes would Emit(term w, stripe Hout) have to repeatedly scan the document, once for each • Can use combiner or in-mapper combining distinct term w in d • Good: much smaller intermediate-key space – Could create an index to find repeated occurrences of w faster – Linear in number of distinct terms • Or use a two-dim. hash map H[w][u] in the Map function, • Bad: more difficult to implement, Map needs to hold entire stripe in allowing a single-scan solution at higher memory cost memory 11 12
Pairs versus Stripes Pairs versus Stripes (cont.) • With combiner or in-mapper combining, Map • Without combiner or in-mapper combining, would produce about the same amount of Pairs could produce significantly more mapper data in both cases output – Two-dimensional index Pairs[w][u] with per-task – ((w,u),1) per pair for Pairs, versus per-document counts for each pair (w,u) is the same as one- aggregates for Stripes dimensional index of one-dimensional indexes • …but it would need a lot less memory (Stripes[w])[u] • …and would also require about the same – Pairs essentially needs no extra storage beyond amount of memory to store the two- the current “window” of nearby words, while dimensional count data structure Stripes has to store the hash map H 13 14 Pairs versus Stripes (cont.) Beyond Pairs and Stripes • Does the number of keys matter? • In general, it is not clear which approach is better – Assume we use the same number of tasks, then Pairs just – Some experiments indicate stripes win for co- assigns more keys per task – Master works with tasks, hence no conceptual difference occurrence matrix computation between Pairs and Stripes • Pairs and Stripes are special cases of shapes for • More fine-grained keys of Pairs allow more flexibility in assigning keys to tasks covering the entire matrix – Pairs can emulate Stripes’ row -wise key assignment to tasks – Could use sub-stripes, or partition matrix horizontally – Stripes cannot emulate all Pairs assignments, e.g., “checkerboard” pattern for two tasks and vertically into more square-like shapes etc. • Greater number of distinct keys per task in Pairs tends to • Can also be applied to higher-dimensional arrays increase sorting cost, even if total data size is the same • Will see interesting version of this idea for joins 15 16 (3) Relative Frequencies Bird Probabilities Using Stripes • Important for data mining • Use species as intermediate key • E.g., for each species and color, estimate the – One stripe per species, e.g., stripe[N.C.] – (stripe[species])[color] stores f(species, color) probability of the color for that species • Map: for each observation of (species S, color C) – Probability of Northern Cardinal being red: P(color = red | species = N.C.) in an observation event, increment (stripe[S])[C] • Count f(N.C.) = the frequency of observations for N.C. – Output (S, stripe[S]) (marginal) • Reduce: for each species S, add all stripes for S • Count f(N.C., red) = the frequency of observations for red N.C.’s ( joint event) – Result: stripeSum[S] with total counts for each color • Estimate P(red | N.C.) as f(N.C., red) / f(N.C.) for S • Similarly: normalize word co-occurrence vector – Can get f(S) by adding all color-counts in stripeSum[S] for word w – Emit (stripeSum[S])[C] / f(S) for each color C 17 18
Recommend
More recommend