Shuffle Phase Executed only in the case of one or more reducers Transfers data between the mappers and reducers Groups records by their keys to ensure local processing in the reduce phase 01/23/2018 15
Shuffle Phase … Map 1 Map 2 Map 3 Map M … Reduce 1 Reduce 2 Reduce N 01/23/2018 16
Shuffle Phase (Map-side) Map i k A k v k v k v k v 0 0 0 0 k v k v k v k v k v k v k v k v k v k v Input Split Partition k v k v k v k v map k v 1 k v k v k v k v 1 1 k v k v k v k v 1 k v k v k v k v k v k v k v k v N-1 k v k v k v k v N-1 N-1 N-1 k v k Z k v k v k v … Reduce 1 Reduce 2 Reduce N 01/23/2018 17
Shuffle Phase (Reduce-side) k v … k v Map 1 Map 2 Map 3 Map M k v Reduce j Copy part 1 part 2 part 3 part M Sort k v k v k v Reduce k v k v k v k v 01/23/2018 18
Reduce Phase Apply the reduce function to each group of similar keys k 1 v reduce k 1 v k 2 v reduce k 2 v k 3 v k 3 v reduce k 3 v output reduce k … v k N v k N v k N v reduce k N v k N v 01/23/2018 19
Output Writing Materializes the final output to disk All results are from one process (mapper/reducer) are stored in a subdirectory An OutputFormat is used to Create any files in the output directory Write the output records one-by-one to the output Merge the results from all the tasks (if needed) While the output writing runs in parallel, the final commit step runs on a single machine 01/23/2018 20
MapReduce Examples Input: A log file Filter Aggregation Conversion 01/23/2018 21
Advanced Issues Map failures Reduce failures Straggler problem Custom keys and values Efficient sorting on serialized data Pipeline MapReduce jobs 01/23/2018 22
Recommend
More recommend