Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff - PowerPoint PPT Presentation

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies MapReduce Large scale problems require parallel processing Communication in parallel processing is hard MapReduce abstracts away interprocess communication User only has to identify which parts of the problem are embarrassingly parallel

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies MapReduce Input Map Input Map Reduce Input Map Reduce Input Map Reduce Input Map

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies WordCount wordcount.py import mrs class WordCount(mrs.MapReduce): def map(self, line num, line text): for word in line text.split(): yield (word, 1) def reduce(self, word, counts): yield sum(counts) name == ’ main ’: if mrs.main(WordCount)

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Iterative MapReduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Hadoop Hadoop is the most widely used open source MapReduce implementation Hadoop was designed for big data, not scientific computing Requires the use of HDFS and a dedicated cluster

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies MapReduce in Scientific Computing What does an ideal MapReduce implementation look like in the context of scientific computing?

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Development Rapid prototyping Testability Debuggability

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Development WordCount.java public class WordCount { result.set(sum); public static class TokenizerMapper context.write(key, result); extends Mapper < Object, Text, Text, IntWritable > { } private final static IntWritable one = new IntWritable(1); } private Text word = new Text(); public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public void map(Object key, Text value, Context context String[] otherArgs = new ) throws IOException, InterruptedException { GenericOptionsParser(conf, args).getRemainingArgs(); StringTokenizer itr = if (otherArgs.length != 2) { new StringTokenizer(value.toString()); System.err.println(”Usage: wordcount < in > < out > ”); while (itr.hasMoreTokens()) { System.exit(2); word.set(itr.nextToken()); } context.write(word, one); Job job = new Job(conf, ”word count”); } job.setJarByClass(WordCount. class ); } job.setMapperClass(TokenizerMapper. class ); } job.setCombinerClass(IntSumReducer. class ); public static class IntSumReducer job.setReducerClass(IntSumReducer. class ); extends Reducer < Text,IntWritable,Text,IntWritable > { job.setOutputKeyClass(Text. class ); private IntWritable result = new IntWritable(); job.setOutputValueClass(IntWritable. class ); FileInputFormat.addInputPath(job, public void reduce(Text key, Iterable < IntWritable > values, Context context new Path(otherArgs[0])); ) throws IOException, InterruptedException { FileOutputFormat.setOutputPath(job, int sum = 0; new Path(otherArgs[1])); for (IntWritable val : values) { System.exit(job.waitForCompletion( true ) ? 0 : 1); sum += val.get(); } } }

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Deployment Dedicated cluster vs. supercomputers and private cluster Work with any filesystem Work with any scheduler

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Ease of Deployment pbs-hadoop.sh # Step 1: Find the network address. # Step 4: Start daemons on the slaves. ADDR=$(/sbin/ip − o − 4 addr list ”$INTERFACE” ENV=”. $HOME/.bashrc; | sed − e ’s;ˆ. ∗ inet \ (. ∗\ )/. ∗ $; \ 1;’) export HADOOP CONF DIR=$HADOOP CONF DIR; export HADOOP LOG DIR=$HADOOP LOG DIR” # Step 2: Set up the Hadoop configuration. pbsdsh − u bash − c ”$ENV; $HADOOP datanode” & export HADOOP LOG DIR=$JOBDIR/log pbsdsh − u bash − c ”$ENV; $HADOOP tasktracker” & mkdir $HADOOP LOG DIR sleep 15 export HADOOP CONF DIR=$JOBDIR/conf # Step 5: Run the User Program cp − R $HADOOP HOME/conf $HADOOP CONF DIR $HADOOP dfs − put $INPUT $HDFS INPUT sed − e ”s/MASTER IP ADDRESS/$ADDR/g” $HADOOP jar $PROGRAM $ { ARGS[@] } − e ”s@HADOOP TMP DIR@$JOBDIR/tmp@g” \ $HADOOP dfs − get $HDFS OUTPUT $OUTPUT − e ”s/MAP TASKS/$MAP TASKS/g” \ − e ”s/REDUCE TASKS/$REDUCE TASKS/g” \ # Step 6: Stop daemons on the slaves and master. − e ”s/TASKS PER NODE/$TASKS PER NODE/g” \ kill %2 # kill tasktracker < $HADOOP HOME/conf/hadoop − site.xml \ kill %1 # kill datanode > $HADOOP CONF DIR/hadoop − site.xml $HADOOP HOME/bin/hadoop − daemon.sh stop jobtracker $HADOOP HOME/bin/hadoop − daemon.sh stop namenode # Step 3: Start daemons on the master. HADOOP=”$HADOOP HOME/bin/hadoop” $HADOOP namenode − format # format the hdfs $HADOOP HOME/bin/hadoop − daemon.sh start namenode $HADOOP HOME/bin/hadoop − daemon.sh start jobtracker

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Other Issues Iterative performance Fault tolerance Interoperability

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies What is Mrs? Aims to be a simple to use MapReduce framework Implemented in pure Python Designed with scientific computing in mind

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Why Python? Python is nearly ubiquitous Mrs needs no dependencies outside of standard library Familiarity and readability Easy interoperability Debugging and testing One downside: GIL

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Iterative MapReduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce Input Map Reduce Map Reduce

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Iterative MapReduce: ReduceMap Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap Input Map ReduceMap ReduceMap

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Automatic Serialization Serialization happens every time a tasks communicates with another machine Mrs automatically handles this with pickle Hadoop requires Writable classes everywhere

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Debugging: Run Modes Serial Mock Parallel Parallel

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Debugging: Random Number Generators Seeding random number generators makes results reproducible Need different seed for each task Mrs has random function which lets you create a random number generator with an arbitrary number of offset parameters ex. rand = self.random(id, iter)

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Performance and Case Studies Interpreter overhead does not preclude good performance for Mrs. We demonstrate on three different problems: Halton Sequence: CPU bound benchmark Particle Swarm Optimization: CPU bound application Walk Analysis: IO bound application

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Performance and Case Studies Optimization Story: Make sure you have the right algorithm Careful profiling Run with PyPy Rewrite critical path in C

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Monte Carlo Pi Estimation Halton Sequence Monte Carlo algorithm for computing the value 0 . 5 of π by generating random points in a square 0 . 25 Very little data, but computationally intense 0 We can control how − 0 . 25 much computation each map task performs − 0 . 5 − 0 . 5 − 0 . 25 0 0 . 25 0 . 5

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Monte Carlo Pi Estimation Mrs using pure Python 120 Hadoop (Java) Mrs (PyPy) 100 Mrs (cPython) Time (seconds) 80 60 40 20 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 Points Per Map Task

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Monte Carlo Pi Estimation Python with inner loop in C (using ctypes) 120 Hadoop (Java) Mrs (cPython) 100 Time (seconds) 80 60 40 20 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10 11 Points Per Map Task

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Particle Swarm Optimization 40 Inspired by simulations of flocking birds 30 Particles interact while exploring 20 Map: motion and function evaluation 10 Reduce: communication CPU bound problem 0 0 2 4 6 8 10

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff - PowerPoint PPT Presentation

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012 MapReduce MapReduce

Our presentation will begin soon. Mrs. Hevia Mrs. Andersen Mrs. McEntee Sra. Auon Mrs. Reed

Hello Friends!! Mrs. Wren (with Bailey) Mrs. Merrill (with Kobe and Zoe) Mrs. McGarry Mrs.

Back To School Second Grade 2018- 2019 Welcome to Second Grade! Mrs. Stumpfl Mrs. Keats Mrs.

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

History History Department Mrs Conway, Ms Hagan, Mrs Ireland, Mrs McCabe, Mrs Mole, Mrs

Work hard to be your very best! Mrs Steyn Mr Evans Mrs Robbins Y4C Y4W Y4B Mrs Wall Mrs

Class of 2021 Mrs. Chancey Co-F Department Chair Mrs. Figarella K-M Mrs. Norgan Se-Z Mr.

Mrs. Tucker Mrs. Thompson Mrs. Doe Mrs. Gutierrez Mrs. Gutierrez Fun Facts: This is my 21st

Mrs: High Performance MapReduce for Iterative and Asynchronous Algorithms in Python Jeff Lund ,

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Curriculum Presentation Wednesday 18 th September 2019 Mrs Heaney Mrs McGrath Mrs Pini

Class of - 2020 Mrs. Chancey Co-F Department Chair Mrs. Figarella K-M Mr. Jones N-Sc Mrs.

Kindergarten Parent Orientation 2017-2018 Kindergarten T eam Mrs. Thomas Mrs. Mackey Mrs.

Mrs. Jennifer E. Kim (Choi) Mrs. April Sommer (Clark) Mrs. Helen Gray Mrs. Julianne Smith (Tela)

Welcome Meet the Year 2 Team 2A Mrs.Wasway TA Mrs. Melloy TA - Miss. Fendt 2B Mrs

More on Graph Rewriting With Contextual Refinement Berthold Hoffmann, Universitt Bremen

A sub-linear method for computing columns of functions of sparse matrices Kyle Kloster and David

W4231: Analysis of Algorithms A Network d 11/3/1999 a 4 2 3 1 4 2 Cuts and Flow s c

Dealing with missing values part 2 Applied Multivariate Statistics Spring 2012 Overview

TEACHING MATH TE TEACH CHING NG MATH TH IS AS EASY AS Is as easy as 1-2-3 One Rule Two

Attacks in code based cryptography: a survey, new results and open problems J.-P. Tillich Inria,

Subregular Complexity and Machine Learning Jeffrey Heinz Linguistics Department Institute for

Approximating the Virtual Network Embedding Problem: Theory and Practice 2 5 3 AC B 2 2 2