Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) - PDF document

10/14/2011 Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all term w in doc d do for all term w in doc d do for all term u NEAR w do H = new hashMap Emit(pair (w, u), count 1) for all term u NEAR w do H{u} ++ w v u w v u Emit(term w, stripe H) w w reduce (pair p, counts [c1, c2,…]) v v sum = 0 reduce (term w, stripes [H1, H2,…]) u for all count c in counts do u Hout = new hashMap sum += c for all stripe H in stripes do Hout = ElementWiseSum(Hout, H) Emit(pair p, count sum) Emit(term w, stripe Hout) • Can use combiner or in-mapper combining • Can use combiner or in-mapper combining • • Good: easy to implement and understand Good: much smaller intermediate-key space – Linear in number of distinct terms • Bad: huge intermediate-key space (shuffling/sorting cost!) • Bad: more difficult to implement, Map needs to hold entire stripe in – Quadratic in number of distinct terms memory 204 205 Beyond Pairs and Stripes (3) Relative Frequencies • Important for data mining • In general, it is not clear which approach is better • E.g., for each species and color, compute – Some experiments indicate stripes win for co- probability of color for that species occurrence matrix computation – Probability of Northern Cardinal being red, P(color = • Pairs and stripes are special cases of shapes for red | species = N.C.) covering the entire matrix • Count f(N.C.), the frequency of observations for N.C. (marginal) – Could use sub-stripes, or partition matrix horizontally • Count f(N.C., red), the frequency of observations for red N.C.’s ( joint event) and vertically into more square-like shapes etc. • P(red | N.C.) = f(N.C., red) / f(N.C.) • Can also be applied to higher-dimensional arrays • Similarly: normalize word co-occurrence vector • Will see interesting version of this idea for joins for word w by dividing it by w’s frequency 206 207 Bird Probabilities Using Stripes Discussion, Part 1 • Use species as intermediate key • Stripe is great fit for relative frequency – One stripe per species, e.g., stripe[N.C.] computation • (stripe[species])[color] stores f(species, color) • All values for computing the final result are in • Map: for each observation of (species S, color C) in an the stripe observation event, increment (stripe[S])[C] – Output (S, stripe[S]) • Any smaller unit would miss some of the joint • Reduce: for each species S, add all stripes for S events needed for computing f(S), the – Result: stripeSum[S] with total counts for each color for S marginal for the species – Can get f(S) by adding all stripeSum[S] values together • So, this would be a problem for the pairs – Get probability P(color = C | species = S) as (stripeSum[S])[C] / f(S) pattern 208 209 1

10/14/2011 Bird Probabilities Using Pairs Pairs-Based Solution, Take 1 • Make sure all values f(S, color) for the same • Intermediate key is (species, color) species end up in the same reduce task • Map produces partial counts for each species- – Define custom partitioning function on species color combination in input • Maintain state across different keys in same • Reduce can compute f(species, color), the reduce task total count of each species-color combination • This essentially simulates the stripes approach • But: cannot compute marginal f(S) in the reduce task, creating big reduce tasks – Reduce needs to sum f(S, color) for all colors for when there are many colors species S • Can we do better? 210 211 Discussion, Part 2 Bird Probabilities Using Pairs, Take 2 • Map: for each observation event, emit ((species S, color C), • Pairs-based algorithm would work better, if 1) and ((species S, dummyColor), 1) for each species-color marginal f(S) was known already combination encountered • Use custom partitioner that partitions based on the species – Reducer computes f(species, color) and then outputs component only f(species, color) / f(species) • Use custom key comparator such that (S, dummyColor) is • We can compute the species marginals f(species) before all (S, C) for real colors C – Reducer computes f(S) before the f(S, C) in a separate MapReduce job first • Reducer keeps f(S) in state for duration of entire task • Better: fold this into a single MapReduce job – Reducer then computes f(S, C) for each C, outputting f(S, C) / f(S) – Problem: easy to compute f(S) from all f(S, color), but • Advantage: avoids having to manage all colors for a species how do we compute f(S) before knowing f(S, color)? together 212 213 Order Inversion Design Pattern (4) Secondary Sorting • Occurs surprisingly often during data analysis • Recall the weather data: for simplicity assume • Solution 1: use complex data structures that bring the observations are (date, stationID, temperature) right results together • Goal: for each station, create a time series of – Array structure used by stripes pattern • Solution 2: turn synchronization into ordering problem temperature measurements – Key sort order enforces computation order • Per-station data: use stationID as intermediate – Partitioner for key space assigns appropriate partial results to each reduce task key – Reducer maintains task-level state across Reduce • Problem: reducers receive huge number of (date, invocations – Works for simpler pairs pattern, which uses simpler data temp) pairs for each station structures and requires less reducer memory – Have to be sorted by user code 214 215 2

10/14/2011 Can Hadoop Do The Sorting? Design Pattern Summary • Use (stationID, date) as intermediate key • In-mapper combining: do work of combiner in – Problem: records for the some station might end up in different mapper reduce tasks – Solution: custom partitioner, using only stationID component of • Pairs and stripes: for keeping track of joint key for partitioning events • General value-to-key conversion design pattern – To partition by X and then sort each X-group by Y, make (X, Y) • Order inversion: convert sequencing of the key computation into sorting problem – Define key comparator to order by composite key (X, Y) – Define partitioner and grouping comparator for (X, Y) to • Value-to-key conversion: scalable solution for consider only X for partitioning and grouping secondary sorting, without writing own sort • Grouping part is necessary if all dates for a station should be processed in the same Reduce invocation (otherwise each station- code date combination ends up in a different Reduce invocation) 216 217 Tools for Synchronization Issues and Tradeoffs • Number of key-value pairs • Cleverly-constructed data structures for key – Object creation overhead and values to bring data together – Time for sorting and shuffling pairs across the network • Preserving state in mappers and reducers, • Size of each key-value pair together with capability to add initialization – (De-)serialization overhead and termination code for entire task • Local aggregation • Sort order of intermediate keys to control – Opportunities to perform local aggregation vary order in which reducers process keys – Combiners can make a big difference – Combiners vs. in-mapper combining • Custom partitioner to control which reducer – RAM vs. disk vs. network processes which keys 218 219 Joins in MapReduce • Data sets S={s 1 ,..., s |S| } and T={t 1 ,..., t |T| } Now that we have seen important design • Find all pairs (s i , t j ) that satisfy some predicate patterns and MapReduce algorithms for simpler problems, let’s look at some more • Examples complex problems. – Pairs of similar or complementary function summaries – Facebook and Twitter posts by same user or from same location • Typical goal: minimize job completion time 220 221 3

Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) - PDF document

10/14/2011 Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all term w in doc d do for all term w in doc d do for all term u NEAR w do H = new hashMap Emit(pair (w, u), count 1) for all term u NEAR w

Magnetic Excitations of Stripes E. W. Carlson D. X. Yao D. K. Campbell Stripes: Why? HTSC

Algorithms for MapReduce Combiners Partition and Sort Pairs vs Stripes 1 Assignment 1 released

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction & History Pairs of

Biorthogonal Filter Pairs und Wavelets WTBV January 20, 2016 WTBV Biorthogonal Filter Pairs und

Line segment intersection Find all pairs of intersecting line segments. Find all pairs of

MATH 105: Finite Mathematics 1-2: Pairs of Lines Prof. Jonathan Duncan Walla Walla College

Cours : Dynamique Non-Lin eaire Laurette TUCKERMAN laurette@pmmh.espci.fr Rolls, stripes and

White tigers Jonica Farrell White fur black stripes White tigers are majestic and beautiful!

LIGHTER CAN STILL BE DARK: MODELING COMPARATIVE COLOR TERMS OLIVIA WINN, SMARANDA MURESAN

Queensnake recovery, distribution and stewardship in Huron County 1 2 Queensnake ( Regina

BODY AND SOUND ANDREW BROOKS MA FINE ART www . ajb-art . com THREE VERTICAL STRIPES, Live

Case History Periodic Impulse Detection Vertical stripes indicate periodic impulse events

CENTERLINE RUMBLE STRIPES VTrans Highway Division Ken Robie Project Delivery Bureau Director Bruce

I-210 Pavement Rehabilitation Project R 210 R Replace traffic stripes and pavement markers

EXPERIMENTAL INVESTIGATION ON ULTIMATE STRENGTH OF CORRODED WEB-CORE SANDWICH PANEL STRIPES J.

Com Compatible patible Pairs Pair Compatible Pairs for addition and subtraction are numbers

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Software Reliability Engineering: An Introduction SE 350 Software Process & Product Quality

Population Mean and Standard Deviation In a population with N members Population mean: = x 1 +

Multiple-cell cavity for high mass axion dark matter search 3 rd Workshop on Microwave Cavities

Disclosures RECOGNITION AND MANAGEMENT Research : NIH, Great Lakes Neurotechnologies, and the

CS 478 - Learning Rules 1 If (Color = Red) and (Shape = round) then Class is A If (Color = Blue)

Eye-tracking Evidence for Frequency and Integration Cost Effects in Corpus Data Vera Demberg 1 ,

Sambuz

Useful Links

Newsletter

Mail Us

Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) - PDF document

10/14/2011 Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all term w in doc d do for all term w in doc d do for all term u NEAR w do H = new hashMap Emit(pair (w, u), count 1) for all term u NEAR w

Magnetic Excitations of Stripes E. W. Carlson D. X. Yao D. K. Campbell Stripes: Why? HTSC

Algorithms for MapReduce Combiners Partition and Sort Pairs vs Stripes 1 Assignment 1 released

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction &amp; History Pairs of

Biorthogonal Filter Pairs und Wavelets WTBV January 20, 2016 WTBV Biorthogonal Filter Pairs und

Line segment intersection Find all pairs of intersecting line segments. Find all pairs of

MATH 105: Finite Mathematics 1-2: Pairs of Lines Prof. Jonathan Duncan Walla Walla College

Cours : Dynamique Non-Lin eaire Laurette TUCKERMAN laurette@pmmh.espci.fr Rolls, stripes and

White tigers Jonica Farrell White fur black stripes White tigers are majestic and beautiful!

LIGHTER CAN STILL BE DARK: MODELING COMPARATIVE COLOR TERMS OLIVIA WINN, SMARANDA MURESAN

Queensnake recovery, distribution and stewardship in Huron County 1 2 Queensnake ( Regina

BODY AND SOUND ANDREW BROOKS MA FINE ART www . ajb-art . com THREE VERTICAL STRIPES, Live

Case History Periodic Impulse Detection Vertical stripes indicate periodic impulse events

CENTERLINE RUMBLE STRIPES VTrans Highway Division Ken Robie Project Delivery Bureau Director Bruce

I-210 Pavement Rehabilitation Project R 210 R Replace traffic stripes and pavement markers

EXPERIMENTAL INVESTIGATION ON ULTIMATE STRENGTH OF CORRODED WEB-CORE SANDWICH PANEL STRIPES J.

Com Compatible patible Pairs Pair Compatible Pairs for addition and subtraction are numbers

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Software Reliability Engineering: An Introduction SE 350 Software Process &amp; Product Quality

Population Mean and Standard Deviation In a population with N members Population mean: = x 1 +

Multiple-cell cavity for high mass axion dark matter search 3 rd Workshop on Microwave Cavities

Disclosures RECOGNITION AND MANAGEMENT Research : NIH, Great Lakes Neurotechnologies, and the

CS 478 - Learning Rules 1 If (Color = Red) and (Shape = round) then Class is A If (Color = Blue)

Eye-tracking Evidence for Frequency and Integration Cost Effects in Corpus Data Vera Demberg 1 ,

Sambuz

Useful Links

Newsletter

Mail Us

Ascent sequences avoiding pairs of Lara Pudwell patterns Introduction & History Pairs of

Software Reliability Engineering: An Introduction SE 350 Software Process & Product Quality