MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies MapReduce Large scale problems require parallel processing Communication in parallel processing is hard MapReduce abstracts away interprocess communication User only has to identify which parts of the problem are embarrassingly parallel
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies MapReduce Input Map Input Map Reduce Input Map Reduce Input Map Reduce Input Map
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies WordCount wordcount.py import mrs class WordCount(mrs.MapReduce): def map(self, line num, line text): for word in line text.split(): yield (word, 1) def reduce(self, word, counts): yield sum(counts) name == ’ main ’: if mrs.main(WordCount)
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Iterative MapReduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Hadoop Hadoop is the most widely used open source MapReduce implementation Hadoop was designed for big data, not scientific computing Requires the use of HDFS and a dedicated cluster
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies MapReduce in Scientific Computing What does an ideal MapReduce implementation look like in the context of scientific computing?
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Development Rapid prototyping Testability Debuggability
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Development WordCount.java public class WordCount { result.set(sum); public static class TokenizerMapper context.write(key, result); extends Mapper < Object, Text, Text, IntWritable > { } private final static IntWritable one = new IntWritable(1); } private Text word = new Text(); public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public void map(Object key, Text value, Context context String[] otherArgs = new ) throws IOException, InterruptedException { GenericOptionsParser(conf, args).getRemainingArgs(); StringTokenizer itr = if (otherArgs.length != 2) { new StringTokenizer(value.toString()); System.err.println(”Usage: wordcount < in > < out > ”); while (itr.hasMoreTokens()) { System.exit(2); word.set(itr.nextToken()); } context.write(word, one); Job job = new Job(conf, ”word count”); } job.setJarByClass(WordCount. class ); } job.setMapperClass(TokenizerMapper. class ); } job.setCombinerClass(IntSumReducer. class ); public static class IntSumReducer job.setReducerClass(IntSumReducer. class ); extends Reducer < Text,IntWritable,Text,IntWritable > { job.setOutputKeyClass(Text. class ); private IntWritable result = new IntWritable(); job.setOutputValueClass(IntWritable. class ); FileInputFormat.addInputPath(job, public void reduce(Text key, Iterable < IntWritable > values, Context context new Path(otherArgs[0])); ) throws IOException, InterruptedException { FileOutputFormat.setOutputPath(job, int sum = 0; new Path(otherArgs[1])); for (IntWritable val : values) { System.exit(job.waitForCompletion( true ) ? 0 : 1); sum += val.get(); } } }
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Deployment Dedicated cluster vs. supercomputers and private cluster Work with any filesystem Work with any scheduler
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Deployment pbs-hadoop.sh # Step 1: Find the network address. # Step 4: Start daemons on the slaves. ADDR=$(/sbin/ip − o − 4 addr list ”$INTERFACE” ENV=”. $HOME/.bashrc; | sed − e ’s;ˆ. ∗ inet \ (. ∗\ )/. ∗ $; \ 1;’) export HADOOP CONF DIR=$HADOOP CONF DIR; export HADOOP LOG DIR=$HADOOP LOG DIR” # Step 2: Set up the Hadoop configuration. pbsdsh − u bash − c ”$ENV; $HADOOP datanode” & export HADOOP LOG DIR=$JOBDIR/log pbsdsh − u bash − c ”$ENV; $HADOOP tasktracker” & mkdir $HADOOP LOG DIR sleep 15 export HADOOP CONF DIR=$JOBDIR/conf # Step 5: Run the User Program cp − R $HADOOP HOME/conf $HADOOP CONF DIR $HADOOP dfs − put $INPUT $HDFS INPUT sed − e ”s/MASTER IP ADDRESS/$ADDR/g” $HADOOP jar $PROGRAM $ { ARGS[@] } − e ”s@HADOOP TMP DIR@$JOBDIR/tmp@g” \ $HADOOP dfs − get $HDFS OUTPUT $OUTPUT − e ”s/MAP TASKS/$MAP TASKS/g” \ − e ”s/REDUCE TASKS/$REDUCE TASKS/g” \ # Step 6: Stop daemons on the slaves and master. − e ”s/TASKS PER NODE/$TASKS PER NODE/g” \ kill %2 # kill tasktracker < $HADOOP HOME/conf/hadoop − site.xml \ kill %1 # kill datanode > $HADOOP CONF DIR/hadoop − site.xml $HADOOP HOME/bin/hadoop − daemon.sh stop jobtracker $HADOOP HOME/bin/hadoop − daemon.sh stop namenode # Step 3: Start daemons on the master. HADOOP=”$HADOOP HOME/bin/hadoop” $HADOOP namenode − format # format the hdfs $HADOOP HOME/bin/hadoop − daemon.sh start namenode $HADOOP HOME/bin/hadoop − daemon.sh start jobtracker
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Other Issues Iterative performance Fault tolerance Interoperability
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies What is Mrs? Aims to be a simple to use MapReduce framework Implemented in pure Python Designed with scientific computing in mind
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Why Python? Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing One downside: GIL
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Iterative MapReduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Iterative MapReduce: ReduceMap Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Automatic Serialization Serialization happens every time a tasks communicates with another machine Mrs automatically handles this with pickle Hadoop requires Writable classes everywhere
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Debugging: Run Modes Serial Mock Parallel Parallel
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Debugging: Random Number Generators Seeding random number generators makes results reproducible Need different seed for each task Mrs has random function which lets you create a random number generator with an arbitrary number of offset parameters ex. rand = self.random(id, iter)
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Performance and Case Studies Interpreter overhead does not preclude good performance for Mrs. We demonstrate on three different problems: Halton Sequence: CPU bound benchmark Particle Swarm Optimization: CPU bound application Walk Analysis: IO bound application
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Performance and Case Studies Optimization Story: Make sure you have the right algorithm Careful profiling Run with PyPy Rewrite critical path in C
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Monte Carlo Pi Estimation Halton Sequence Monte Carlo algorithm for computing the value 0 . 5 of π by generating random points in a square 0 . 25 Very little data, but computationally intense 0 We can control how − 0 . 25 much computation each map task performs − 0 . 5 − 0 . 5 − 0 . 25 0 0 . 25 0 . 5
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Monte Carlo Pi Estimation Mrs using pure Python 120 Hadoop (Java) Mrs (PyPy) 100 Mrs (cPython) Time (seconds) 80 60 40 20 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 Points Per Map Task
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Monte Carlo Pi Estimation Python with inner loop in C (using ctypes) 120 Hadoop (Java) Mrs (cPython) 100 Time (seconds) 80 60 40 20 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 Points Per Map Task
MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Particle Swarm Optimization 40 Inspired by simulations of flocking birds 30 Particles interact while exploring 20 Map: motion and function evaluation 10 Reduce: communication CPU bound problem 0 0 2 4 6 8 10
Recommend
More recommend