Data-intensive programming MapReduce Timo Aaltonen Department of Pervasive Computing
Outline • MapReduce
Original Map and Reduce • map(foo, i) – Apply function foo to every item of i and return a list of the results – Example: map(square, [1, 2, 3, 4]) = [1, 4, 9, 16] • reduce( bar , i) – Apply two argument function bar • cumulatively to the items of i, • from left to right • to reduce the values in i to a single value. – Example: reduce(sum, [1, 4, 9]) = (((1+4)+9)+16) = 30
MapReduce • MapReduce is a programming model for distributed processing of large data sets • Scales nearly linearly – Twice as many nodes -> twice as fast – Achieved by exploiting data locality • Computation is moved close to the data • Simple programming model – Programmer only needs to write two functions: Map and Reduce
Map & Reduce • Map maps input data to (key, value) pairs • Reduce processes the list of values for a given key • The MapReduce framework (such as Hadoop) takes care of the rest – Distributes the job among nodes – Moves the data to and from the nodes – Handles node failures – etc.
SHUFFLE MAP REDUCE Node A Sheena Node 1 Sheena Sheena 1 Sheena 21 is is 1 1 a a 1 punk punk 1 Node B is rocker rocker 1 is 23 Sheena Sheena 1 Node 2 1 is is 1 a a 1 Node C punk punk 1 a rocker rocker 1 Node 3 a 31 1 now now 1 Well Well 1 she’s she’s 1 Node D punk a a 1 punk 37 punk punk 1 1 punk punk 1
MapReduce SHUFFLE MAP REDUCE Sheena Sheena Sheena 1 Sheena 21 is is 1 1 a a 1 punk punk 1 is rocker rocker 1 is 23 Sheena Sheena 1 1 is is 1 … … 1 a a 31 1 punk punk 37 1
MapReduce MAP REDUCE Map(k1,v1) → list(k2,v2) Reduce(k2, list(v2)) → list(v3)
Map & Reduce in Hadoop • In Hadoop, Map and Reduce functions can be written in – Java • org.apache.hadoop.mapreduce.lib – C++ using Hadoop Pipes – any language, using Hadoop Streaming • Also a number of third party programming frameworks for Hadoop MapReduce – For Java, Scala, Python, Ruby, PHP, …
Mapper Java example Output key and value types Input key and value types public class WordCount { public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String txt = value.toString(); String[] items = txt.split("\\s"); for(int i = 0; i < items.length; i++){ word = new Text( items[i]); one = new IntWritable(1); context.write(word, one); } } The Mapper input types depend on the defined InputFormat • By default TextInputFormat • Key ( LongWritable ): position in the file • Value ( Text ): the line •
Reducer Java example Input key and value types Output key and value types public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum = sum + value.get(); } context.write(key, new IntWritable(sum)); } }
Job of the MapReduce example • The main driver program configures the MapReduce job and submits it to the Hadoop YARN cluster: public static void main( String[] args) throws Exception { … Job job = new Job(); job.setJarByClass(WordCount.class); job.setJobName(”Word Count"); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class);
• Input and output value classes are defined: Input key and value types job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); Output key and value types job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
Job of the MapReduce example • FileInputFormat feeds the input splits to map instances from wc-files directory FileInputFormat.addInputPath(job, new Path(”wc-files")); • The result is written to wc-output directory: FileInputFormat.setOutputPath(job, new Path(”wc-output")); • The job waits until Hadoop runtime environment has executed the program: job.waitForCompletion(true); } // main } // class
MapReduce WordCount Demo • Building the program: % mkdir wordcount_classes % javac -classpath `hadoop classpath` \ -d wc_classes WordCount.java % jar -cvf wordcount.jar -C wordcount_classes/ . • Send WordCount to YARN: % hadoop jar wordcount.jar fi.tut.WordCount /data output
MapReduce WordCount Demo • The result: % hdfs dfs -ls output Found 2 items -rw-r--r-- 1 hduser supergroup 0 2016-09-08 11:46 output/_SUCCESS -rw-r--r-- 1 hduser supergroup 470923 2016-09-08 11:46 output/part-r-00000 % hdfs dfs -get output/part-r-00000 % tail -10 part-r-00000 zoology, 2 zu 2 À 2 à 4 çela, 1 …
Fault Tolerance and Speculative Execution • Faults are handled by restarting tasks • All managed in the background • No need to manage side-effects or process state • Speculative execution helps prevent bottlenecks
Combiners Map node 1 A 1 Reduce node for key A A 1 A 1 B 1 A Combiner A 111 A 3
Combiners • Combiner can ”compress” data on a mapper node before sending it forward • Combiner input/output types must equal the mapper output types • In Hadoop Java, Combiners use the Reducer interface job.setCombinerClass(MyReducer.class);
Reducer as a Combiner • Reducer can be used as a Combiner if it is commutative and associative – Eg. max is • max(1, 2, max(3,4,5)) = max(max(2, 4), max(1, 5, 3)) • true for any order of function applications… – Eg. avg is not • avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠ avg(avg(2, 4), avg(1, 5, 3)) = 3 • Note: if Reducer is not c&a, Combiners can still be used – The Combiner just has to be different from the Reducer and designed for the specific case
Adding a Combiner to WordCount Map Shuffle Combiner walk, 1 run, 1 run, 1 walk, 2 walk, 1
Hadoop Streaming • Map and Reduce functions can be implemented in any language with the Hadoop Streaming API • Input is read from standard input • Output is written to standard output • Input/output items are lines of the form key \t value – \t is the tabulator character • Reducer input lines are grouped by key – One reducer instance may receive multiple keys
Run Hadoop Streaming • Debug using Unix pipes: cat sample.txt | ./mapper.py | sort | ./reducer.py • On Hadoop: hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input sample.txt \ -output output \ -mapper ./mapper.py \ -reducer ./reducer.py
MapReduce Examples • These are from https://highlyscalable.wordpress.com/2012/02/01/ma preduce-patterns/ • Counting and Summing – WordCount • Filtering (“Grepping”), Parsing, and Validation – Problem : There is a set of records and it is required to collect all records that meet some condition or transform each record (independently from other records) into another representation – Solution : Mapper takes records one by one and emits accepted items or their transformed versions
MapReduce Examples • Collating – Problem Statement: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes. – Solution: Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.
Index Page A This, A Reduced output page, A This page contains, A contains text text, A This: A page: A, B Page B contains: A, B My, B text: A page, B too: B My page contains, B contains too too, B
Conclusions • Map tasks spread out the load (the logic) – may have hundreds or millions mappers – fast for large amounts of data
Recommend
More recommend