MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Presented by: Chaochao Yan 04/25/2018
MapReduce A programming model and an associated implementation for processing and generating large data sets.
Motivation: Large Scale Data Processing ❏ Want to process lots of data (>1 PB data) ❏ Want to use hundreds or thousands of CPUs ❏ Want to make this easy
MapReduce ❏ Automatic parallelization & distribution ❏ Fault-tolerant ❏ Provides status and monitoring tools ❏ Clean abstraction for programmers
Programming model ❏ Input & Output: each a set of key/value pairs ❏ Divide and conquer similar ❏ Programmer specifies two functions: map(in_key, in_value) -> list(out_key, intermediate_value) reduce(out_key, list(intermediate_value)) -> list(out_value)
WordCount Pseudo-code map(String input_key, String input_value): // input_key: document name, input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word, output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
Illustrated WordCount “see”: [“1”, “1”] Picture from http://ranger.uta.edu/~sjiang/CSE6350-spring-18/lecture-8.pdf
Distributed and Parallel Computing ❏ map() functions run in distributed and parallel, creating different intermediate values from different input data sets ❏ reduce() functions also run in distributed and parallel, each working on a different output key ❏ All values are processed independently
Implementation Overview ❏ 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory ❏ Commodity networking hardware is used ❏ Storage is on local IDE disks ❏ GFS: distributed file system manages data ❏ Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
High-level MapReduce Pipeline Picture from http://mapreduce-tutorial.blogspot.com/2011/04/mapreduce-data-flow.html
High-level MapReduce Pipeline Picture from https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0008.html
Question 1 Use Figure 1 to explain a MR program’s execution.
Picture 1 from Google MapReduce Paper, OSDI04
Question 2 Describe how MR handles worker and master failures
Fault Tolerance ❏ Detect failure via periodic heartbeats ❏ Worker Failure Map and reduce tasks in progress are rescheduled ❏ Completed map tasks are rescheduled (data on local disk) ❏ Completed reduce tasks do not need to be re-executed (data on GFS) ❏ ❏ Master Failure abort the computation ❏
Question 3 Compared with traditional parallel programming models, such as multithreading and MPI, what are major advantages of MapReduce? Easy to use, scalability, and reliability
Comparison with Traditional Models Picture from http://ranger.uta.edu/~sjiang/CSE6350-spring-18/lecture-8.pdf
Locality ❏ Master program divides up tasks based on location of data: tries to have map() tasks on the same machine as physical data or as “near” as possible. ❏ Map task inputs are divided into 16-64 MB blocks, Google File System chunk size is 64 MB.
Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines ❏ Minimizes time for fault recovery ❏ Can pipeline shuffling with map execution ❏ Better dynamic load balancing
Task Granularity And Pipelining ❏ Picture from https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0009.html
Question 4 The implementation of MapReduce enforces a barrier between the Map and Reduce phases, i.e., no reducers can proceed until all mappers have completed their assigned workload. For higher efficiency, is it possible for a reducer to start its execution earlier, and why? (clue: think of availability of inputs to reducers)
Backup Tasks Slow workers significantly delay completion time ❏ Other jobs consuming resources on machine ❏ Bad disks w/ soft errors transfer data slowly ❏ Weird things: processor caches disabled (!!) Solution: Near end of phase, schedule backup tasks ❏ Whichever one finishes first "wins"
Sort Performance ❏ 10^10 100-byte records(1TB data, 1800 nodes)
Refinement ❏ Sorting guarantees within each reduce partition ❏ ❏ Combiner Reduce in advance ❏ Useful for saving network bandwidth ❏ ❏ User-defined counters Useful for debug ❏
Recommend
More recommend