comments basic mapreduce program design
play

Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design Tasks that can be performed independently on a Programming model might appear very limited data object, large number of them: Map But, map and reduce can do anything with their Tasks that


  1. Comments Basic MapReduce Program Design • Tasks that can be performed independently on a • Programming model might appear very limited data object, large number of them: Map • But, map and reduce can do anything with their • Tasks that require combining of multiple data input objects: Reduce – Could implement a Turing machine inside… • Sometimes it is easier to start program design – …which could compute anything, but… with Map, sometimes with Reduce • Select keys and values such that the right objects – …would not result in a good parallel implementation. end up together in the same Reduce invocation • Challenge: find best MapReduce implementation • Might have to partition a complex task into for a given problem multiple MapReduce sub-tasks 22 23 Choosing M and R Grep • M = number of map tasks, R = number of reduce tasks • Find all lines matching some pattern • Larger M, R: creates smaller tasks, enabling easier load • No need to combine anything balancing and faster recovery (many small tasks from failed machine) – Reduce is not needed, i.e., just identity function • Limitation: O(M+R) scheduling decisions and O(M  R) in- • Map takes line and outputs it if it matches the memory state at master pattern – Very small tasks not worth the startup cost • Recommendation: • Map could also take an entire document and emit – Choose M so that split size is approximately 64 MB all matching lines – Choose R a small multiple of the number of workers; – Not a good idea if there is a single large document, alternatively choose R a little smaller than #workers to finish reduce phase in one “wave” but works well if there are many documents 24 25 Reverse Web-Link Graph Inverted Index • For each URL, find all pages (URLs) pointing to it • For each word, create list of documents (incoming links) (document IDs) containing it • Problem: Web page has only outgoing links • Same as reverse Web-link graph problem • Need all (anySource, P) links for each page P – “Source URL” is now “document ID” – Suggests Reduce with P as the key, source as value – “Target URL” is now “word” • Map: for page source , create all ( target , source ) • Can augment this to create list of (document pairs for each link to a target found in page ID, position) pairs for each word • Reduce: since target is key, will receive all sources – Map emits (word, (document ID, position)) while pointing to that target parsing a document 26 27 1

  2. Distributed Sorting Distributed Sorting, Revisited • Can Map do pre-sorting and Reduce the • Quicksort-style partitioning merging? • For simplicity, consider case with 2 machines – Use set of input records as Map input – Goal: each machine sorts about half of the data – Map pre-sorts it and single reducer merges them • Assuming we can find the median record, – Does not scale! assign all smaller records to machine 1, all • We need to get multiple reducers involved others to machine 2 – What should we use as the intermediate key? • Sort locally on each machine, then “concatenate” output 28 29 Partitioning Sort in MapReduce Partitioning Sort in MapReduce • Consider 2 reducers for simplicity • MapReduce has class Partitioner<KEY, VALUE> • Run MapReduce job to find approximate median of – Method int getPartition(KEY key, VALUE value, int data numPartitions) allows assigning keys to partitions – Hadoop also offers InputSampler • Example for numPartitions = 2 • Writes the keys that define the partitions, to be used by TotalOrderPartitioner – Partition 1 gets all numbers less than median • Runs on client and downloads input data splits, hence only useful – Partition 2 gets all larger numbers if data is sampled from few splits, i.e., splits themselves should contain random data samples • What about concatenating the output? • Map outputs (sortKey, record) for an input record – Not necessary, except for many small files (big files are • All sortKey < median are assigned to reduce task 1, all broken up anyway) others to reduce task 2, using a partitioner • Reduce sorts its assigned set of records • Generalizes obviously to more reducers 30 31 package org.apache.hadoop.examples; import java.io.IOException; import java.net.URI; import java.util.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; MapReduce and Key Sorting import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.IdentityMapper; import org.apache.hadoop.mapred.lib.IdentityReducer; import org.apache.hadoop.mapred.lib.InputSampler; import org.apache.hadoop.mapred.lib.TotalOrderPartitioner; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; • MapReduce environment guarantees that for /** * This is the trivial map/reduce program that does absolutely nothing each reduce task the assigned set of intermediate * other than use the framework to fragment and sort the input values. * * To run: bin/hadoop jar build/hadoop-examples.jar sort keys is processed in key order * [-m <i>maps</i>] [-r <i>reduces</i>] * [-inFormat <i>input format class</i>] * [-outFormat <i>output format class</i>] – After receiving all (key2, val2) pairs from mappers, * [-outKey <i>output key class</i>] * [-outValue <i>output value class</i>] reducer sorts them by key2, then calls Reduce on each * [-totalOrder <i>pcnt</i> <i>num samples</i> <i>max splits</i>] * <i>in-dir</i> <i>out-dir</i> (key2, list(val2)) group in order */ Sort Code in Hadoop 1.0.3 Distribution; public class Sort<K,V> extends Configured implements Tool { • Can leverage this guarantee for partitioning sort private RunningJob jobResult = null; part 1: boilerplate code static int printUsage() { System.out.println("sort [-m <maps>] [-r <reduces>] " + – Reduce simply emits the records unchanged "[-inFormat <input format class>] " + "[-outFormat <output format class>] " + – No need for user sort code in Reduce function! "[-outKey <output key class>] " + "[-outValue <output value class>] " + "[-totalOrder <pcnt> <num samples> <max splits>] " + "<input> <output>"); ToolRunner.printGenericCommandUsage(System.out); return -1; 32 33 } 2

Recommend


More recommend