CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] September 26, 2019 L10.1 Dept. Of Computer Science , Colorado State University Frequently asked questions from the previous class survey L10. 2 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.1 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Topics covered in this lecture ¨ MapReduce L10. 3 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA M AP R EDUCE CS555: Distributed Systems [Fall 2019] September 26, 2019 L10.4 Dept. Of Computer Science , Colorado State University L10.2 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University MapReduce: Topics that we will cover ¨ Why? ¨ What it is and what it is not? ¨ The core framework and original Google paper ¨ Development of simple programs using Hadoop ¤ The dominant MapReduce implementation L10. 5 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce ¨ It’s a framework for processing data residing on a large number of computers ¨ Very powerful framework ¤ Excellent for some problems ¤ Challenging or not applicable in other classes of problems L10. 6 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.3 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University What is MapReduce? ¨ More a framework than a tool ¨ You are required to fit (some folks shoehorn it) your solution into the MapReduce framework ¨ MapReduce is not a feature, but rather a constraint L10. 7 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA What does this constraint mean? ¨ It makes problem solving easier and harder ¨ Clear boundaries for what you can and cannot do ¤ You actually need to consider fewer options than what you are used to ¨ But solving problems with constraints requires planning and a change in your thinking L10. 8 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.4 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University But what does this get us? ¨ Tradeoff of being confined to the MapReduce framework? ¤ Ability to process data on a large number of computers ¤ But, more importantly, without having to worry about concurrency, scale, fault tolerance, and robustness L10. 9 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA A challenge in writing MapReduce programs ¨ Design ! ¤ Good programmers can produce bad software due to poor design ¤ Good programmers can produce bad MapReduce algorithms ¨ Only in this case your mistakes will be amplified ¤ Your job may be distributed on 100s or 1000s of machines and operating on a Petabyte of data L10. 10 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.5 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University MapReduce: Origins of the design ¨ Process crawled data and logs of web requests ¨ Several computations work on this raw data to compute derived data ¤ Inverted indices ¤ Representation of graph structure of web documents ¤ Pages crawled per host ¤ Most frequent queries in a day … L10. 11 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Most computations are conceptually straightforward ¨ But data is large ¨ Computations must be scalable ¤ Distributed across thousands of machines ¤ To complete in a reasonable amount of time L10. 12 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.6 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Complexity of managing distributed computations can … ¨ Obscure simplicity of original computation ¨ Contributing factors: ① How to parallelize computation ② Distribute the data ③ Handle failures L10. 13 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce was developed to cope with this complexity ¨ Express simple computations ¨ Hide messy details of ¤ Parallelization ¤ Data distribution ¤ Fault tolerance ¤ Load balancing L10. 14 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.7 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University MapReduce ¨ Programming model ¨ Associated implementation for ¤ Processing & Generating large data sets L10. 15 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Programming model ¨ Computation takes a set of input key/value pairs ¨ Produces a set of output key/value pairs ¨ Express the computation as two functions: ¤ Map ¤ Reduce L10. 16 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.8 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Map ¨ Takes an input pair ¨ Produces a set of intermediate key/value pairs L10. 17 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce library ¨ Groups all intermediate values with the same intermediate key ¨ Passes them to the Reduce function L10. 18 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.9 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Reduce function ¨ Accepts intermediate key I and ¤ Set of value s for that key ¨ Merge these value s together to get ¤ Smaller set of value s L10. 19 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Counting number occurrences of each word in a large collection of documents map (String key, String value) //key: document name //value: document contents for each word w in value EmitIntermediate( w , “ 1 ”) L10. 20 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.10 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Counting number occurrences of each word in a large collection of documents reduce (String key, Iterator values) //key: a word //value: a list of counts int result = 0; for each v in values result += ParseInt( v ); Emit(AsString( result result )); Sums together all counts emitted for a particular word L10. 21 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA MapReduce specification object contains ¨ Names of ¤ Input ¤ Output ¨ Tuning parameters L10. 22 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.11 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Map and reduce functions have associated types drawn from different domains map map (k1, v1) à list(k2, v2) reduce (k2, list(v2)) à list(v2) reduce L10. 23 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA What’s passed to-and-from user-defined functions ¨ String s ¨ User code converts between ¤ String ¤ Appropriate type s L10. 24 CS555: Distributed Systems [Fall 2019] September 26, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L10.12 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA
Recommend
More recommend