Hadoop Map Reduce 1
MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2
Logical View of MapReduce During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑤 Input Intermediate Output Data 𝑙 1 , 𝑤 1 𝑙 2 , 𝑤 2 𝑙 3 , 𝑤 3 𝑙 1 , 𝑤 1 𝑙 2 , 𝑤 2 𝑙 3 , 𝑤 3 Map Reduce … … … 𝑙 1 , 𝑤 1 𝑙 2 , 𝑤 2 𝑙 3 , 𝑤 3 3
Map and Reduce Functions Map Function Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙 1 , 𝑤 1 → ⟨𝑙 2 , 𝑤 2 ⟩ Reduce Function Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙 2 , 𝑤 2 → { 𝑙 3 , 𝑤 3 } 4
Functional Programming MapReduce is functional programming Both map and reduce functions are memoryless/stateless They cannot keep an internal state They cannot remember previous records They cannot be randomized Why? To allow Hadoop to parallelize the execution Execute them out-of-order Rerun failing tasks 5
Overview MR Driver Program MR Job Developer Master node Slave nodes 6
Job Execution Overview Driver Job Job Map Shuffle Reduce Cleanup submission preparation 7
Job Submission Execution location: Driver node A driver machine should have the following Compatible Hadoop binaries Cluster configuration files Network access to the master node Collects job information from the user Input and output paths Map, reduce, and any other functions Any additional user configuration Packages all this in a Hadoop Configuration 8
Hadoop Configuration Key: String Value: String Input hdfs://user/eldawy/README.txt Output hdfs://user/eldawy/wordcount Mapper edu.ucr.cs.cs167.eldawy.WordCount … Reducer … JAR File User-defined User-defined Serialized over network Master node 9
Job Preparation Runs on the master node Gets the job ready for parallel execution Collects the JAR file that contains the user- defined functions, e.g., Map and Reduce Writes the JAR and configuration to HDFS to be accessible by the executors Looks at the input file(s) to decide how many map tasks are needed Makes some sanity checks Finally, it pushes the BRB (Big Red Button) 10
Job Preparation Master node Configuration HDFS InputFormat#getSplits() FileInputSplit Split 1 Mapper 1 Path Split 2 Mapper 2 Start JAR File .. .. End Split M Mapper M 11
Map Phase Runs in parallel on worker nodes 𝑁 Mappers: Read the input Apply the map function Apply the combine function (if configured) Store the map output There is no guaranteed ordering for processing the input splits 12
Map Phase Master node … Input Splits IS 1 IS 2 IS 3 IS 4 IS 5 IS M (Map tasks) 13
Map Task Reads the job configuration and task information (mainly, InputSplit) Instantiates an object of the Mapper class Instantiates a record reader for the assigned input split Calls Mapper#setup(Context) Reads records one-by-one from the record reader and passes them to the map function The map function writes the output to the context 14
MapContext Keeps track of which input split is being read and which records are being processed Holds all the job configuration and some additional information about the map task Materializes the map output 15
Map Output What really happens to the map output? It depends on the number of reducers 0 reducers: Map output is written directly to HDFS as the final answer 1+ reducers: Map output is passed to the shuffle phase 16
Shuffle Phase Executed only in the case of one or more reducers Transfers data between the mappers and reducers Groups records by their keys to ensure local processing in the reduce phase 17
Shuffle Phase … Map 1 Map 2 Map 3 Map M … Reduce 1 Reduce 2 Reduce N 18
Shuffle Phase (Map-side) Map i k A k v k v k v k v 0 0 0 0 k v k v k v k v k v k v k v k v k v k v Input Split Partition k v k v k v k v k v 1 map k v k v k v k v 1 1 k v k v k v k v 1 k v k v k v k v k v k v k v k v N-1 k v k v k v k v N-1 N-1 N-1 k v k Z k v k v k v … Reduce 1 Reduce 2 Reduce N 19
Shuffle Phase (Reduce-side) k v … k v Map 1 Map 2 Map 3 Map M k v Reduce j Copy part 1 part 2 part 3 part M Sort k v k v k v Reduce k v k v k v k v 20
Reduce Phase Apply the reduce function to each group of similar keys k 1 v reduce k 1 v k 2 v reduce k 2 v k 3 v k 3 v reduce k 3 v output reduce k … v k N v k N v k N v reduce k N v k N v 21
Output Writing Materializes the final output to disk All results are from one process (mapper/reducer) are stored in a subdirectory An OutputFormat is used to Create any files in the output directory Write the output records one-by-one to the output Merge the results from all the tasks (if needed) While the output writing runs in parallel, the final commit step runs on a single machine 22
Advanced Issues Map failures Reduce failures Straggler problem Custom keys and values Efficient sorting on serialized data Pipeline MapReduce jobs 23
Recommend
More recommend