10/14/2011 Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all term w in doc d do for all term w in doc d do for all term u NEAR w do H = new hashMap Emit(pair (w, u), count 1) for all term u NEAR w do H{u} ++ w v u w v u Emit(term w, stripe H) w w reduce (pair p, counts [c1, c2,…]) v v sum = 0 reduce (term w, stripes [H1, H2,…]) u for all count c in counts do u Hout = new hashMap sum += c for all stripe H in stripes do Hout = ElementWiseSum(Hout, H) Emit(pair p, count sum) Emit(term w, stripe Hout) • Can use combiner or in-mapper combining • Can use combiner or in-mapper combining • • Good: easy to implement and understand Good: much smaller intermediate-key space – Linear in number of distinct terms • Bad: huge intermediate-key space (shuffling/sorting cost!) • Bad: more difficult to implement, Map needs to hold entire stripe in – Quadratic in number of distinct terms memory 204 205 Beyond Pairs and Stripes (3) Relative Frequencies • Important for data mining • In general, it is not clear which approach is better • E.g., for each species and color, compute – Some experiments indicate stripes win for co- probability of color for that species occurrence matrix computation – Probability of Northern Cardinal being red, P(color = • Pairs and stripes are special cases of shapes for red | species = N.C.) covering the entire matrix • Count f(N.C.), the frequency of observations for N.C. (marginal) – Could use sub-stripes, or partition matrix horizontally • Count f(N.C., red), the frequency of observations for red N.C.’s ( joint event) and vertically into more square-like shapes etc. • P(red | N.C.) = f(N.C., red) / f(N.C.) • Can also be applied to higher-dimensional arrays • Similarly: normalize word co-occurrence vector • Will see interesting version of this idea for joins for word w by dividing it by w’s frequency 206 207 Bird Probabilities Using Stripes Discussion, Part 1 • Use species as intermediate key • Stripe is great fit for relative frequency – One stripe per species, e.g., stripe[N.C.] computation • (stripe[species])[color] stores f(species, color) • All values for computing the final result are in • Map: for each observation of (species S, color C) in an the stripe observation event, increment (stripe[S])[C] – Output (S, stripe[S]) • Any smaller unit would miss some of the joint • Reduce: for each species S, add all stripes for S events needed for computing f(S), the – Result: stripeSum[S] with total counts for each color for S marginal for the species – Can get f(S) by adding all stripeSum[S] values together • So, this would be a problem for the pairs – Get probability P(color = C | species = S) as (stripeSum[S])[C] / f(S) pattern 208 209 1
10/14/2011 Bird Probabilities Using Pairs Pairs-Based Solution, Take 1 • Make sure all values f(S, color) for the same • Intermediate key is (species, color) species end up in the same reduce task • Map produces partial counts for each species- – Define custom partitioning function on species color combination in input • Maintain state across different keys in same • Reduce can compute f(species, color), the reduce task total count of each species-color combination • This essentially simulates the stripes approach • But: cannot compute marginal f(S) in the reduce task, creating big reduce tasks – Reduce needs to sum f(S, color) for all colors for when there are many colors species S • Can we do better? 210 211 Discussion, Part 2 Bird Probabilities Using Pairs, Take 2 • Map: for each observation event, emit ((species S, color C), • Pairs-based algorithm would work better, if 1) and ((species S, dummyColor), 1) for each species-color marginal f(S) was known already combination encountered • Use custom partitioner that partitions based on the species – Reducer computes f(species, color) and then outputs component only f(species, color) / f(species) • Use custom key comparator such that (S, dummyColor) is • We can compute the species marginals f(species) before all (S, C) for real colors C – Reducer computes f(S) before the f(S, C) in a separate MapReduce job first • Reducer keeps f(S) in state for duration of entire task • Better: fold this into a single MapReduce job – Reducer then computes f(S, C) for each C, outputting f(S, C) / f(S) – Problem: easy to compute f(S) from all f(S, color), but • Advantage: avoids having to manage all colors for a species how do we compute f(S) before knowing f(S, color)? together 212 213 Order Inversion Design Pattern (4) Secondary Sorting • Occurs surprisingly often during data analysis • Recall the weather data: for simplicity assume • Solution 1: use complex data structures that bring the observations are (date, stationID, temperature) right results together • Goal: for each station, create a time series of – Array structure used by stripes pattern • Solution 2: turn synchronization into ordering problem temperature measurements – Key sort order enforces computation order • Per-station data: use stationID as intermediate – Partitioner for key space assigns appropriate partial results to each reduce task key – Reducer maintains task-level state across Reduce • Problem: reducers receive huge number of (date, invocations – Works for simpler pairs pattern, which uses simpler data temp) pairs for each station structures and requires less reducer memory – Have to be sorted by user code 214 215 2
10/14/2011 Can Hadoop Do The Sorting? Design Pattern Summary • Use (stationID, date) as intermediate key • In-mapper combining: do work of combiner in – Problem: records for the some station might end up in different mapper reduce tasks – Solution: custom partitioner, using only stationID component of • Pairs and stripes: for keeping track of joint key for partitioning events • General value-to-key conversion design pattern – To partition by X and then sort each X-group by Y, make (X, Y) • Order inversion: convert sequencing of the key computation into sorting problem – Define key comparator to order by composite key (X, Y) – Define partitioner and grouping comparator for (X, Y) to • Value-to-key conversion: scalable solution for consider only X for partitioning and grouping secondary sorting, without writing own sort • Grouping part is necessary if all dates for a station should be processed in the same Reduce invocation (otherwise each station- code date combination ends up in a different Reduce invocation) 216 217 Tools for Synchronization Issues and Tradeoffs • Number of key-value pairs • Cleverly-constructed data structures for key – Object creation overhead and values to bring data together – Time for sorting and shuffling pairs across the network • Preserving state in mappers and reducers, • Size of each key-value pair together with capability to add initialization – (De-)serialization overhead and termination code for entire task • Local aggregation • Sort order of intermediate keys to control – Opportunities to perform local aggregation vary order in which reducers process keys – Combiners can make a big difference – Combiners vs. in-mapper combining • Custom partitioner to control which reducer – RAM vs. disk vs. network processes which keys 218 219 Joins in MapReduce • Data sets S={s 1 ,..., s |S| } and T={t 1 ,..., t |T| } Now that we have seen important design • Find all pairs (s i , t j ) that satisfy some predicate patterns and MapReduce algorithms for simpler problems, let’s look at some more • Examples complex problems. – Pairs of similar or complementary function summaries – Facebook and Twitter posts by same user or from same location • Typical goal: minimize job completion time 220 221 3
Recommend
More recommend