Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design • Tasks that can be performed independently on a • Programming model might appear very limited data object, large number of them: Map • But, map and reduce can do anything with their • Tasks that require combining of multiple data input objects: Reduce – Could implement a Turing machine inside… • Sometimes it is easier to start program design – …which could compute anything, but… with Map, sometimes with Reduce • Select keys and values such that the right objects – …would not result in a good parallel implementation. end up together in the same Reduce invocation • Challenge: find best MapReduce implementation • Might have to partition a complex task into for a given problem multiple MapReduce sub-tasks 22 23 Choosing M and R Grep • M = number of map tasks, R = number of reduce tasks • Find all lines matching some pattern • Larger M, R: creates smaller tasks, enabling easier load • No need to combine anything balancing and faster recovery (many small tasks from failed machine) – Reduce is not needed, i.e., just identity function • Limitation: O(M+R) scheduling decisions and O(M  R) in- • Map takes line and outputs it if it matches the memory state at master pattern – Very small tasks not worth the startup cost • Recommendation: • Map could also take an entire document and emit – Choose M so that split size is approximately 64 MB all matching lines – Choose R a small multiple of the number of workers; – Not a good idea if there is a single large document, alternatively choose R a little smaller than #workers to finish reduce phase in one “wave” but works well if there are many documents 24 25 Reverse Web-Link Graph Inverted Index • For each URL, find all pages (URLs) pointing to it • For each word, create list of documents (incoming links) (document IDs) containing it • Problem: Web page has only outgoing links • Same as reverse Web-link graph problem • Need all (anySource, P) links for each page P – “Source URL” is now “document ID” – Suggests Reduce with P as the key, source as value – “Target URL” is now “word” • Map: for page source , create all ( target , source ) • Can augment this to create list of (document pairs for each link to a target found in page ID, position) pairs for each word • Reduce: since target is key, will receive all sources – Map emits (word, (document ID, position)) while pointing to that target parsing a document 26 27 1

Distributed Sorting Distributed Sorting, Revisited • Can Map do pre-sorting and Reduce the • Quicksort-style partitioning merging? • For simplicity, consider case with 2 machines – Use set of input records as Map input – Goal: each machine sorts about half of the data – Map pre-sorts it and single reducer merges them • Assuming we can find the median record, – Does not scale! assign all smaller records to machine 1, all • We need to get multiple reducers involved others to machine 2 – What should we use as the intermediate key? • Sort locally on each machine, then “concatenate” output 28 29 Partitioning Sort in MapReduce Partitioning Sort in MapReduce • Consider 2 reducers for simplicity • MapReduce has class Partitioner<KEY, VALUE> • Run MapReduce job to find approximate median of – Method int getPartition(KEY key, VALUE value, int data numPartitions) allows assigning keys to partitions – Hadoop also offers InputSampler • Example for numPartitions = 2 • Writes the keys that define the partitions, to be used by TotalOrderPartitioner – Partition 1 gets all numbers less than median • Runs on client and downloads input data splits, hence only useful – Partition 2 gets all larger numbers if data is sampled from few splits, i.e., splits themselves should contain random data samples • What about concatenating the output? • Map outputs (sortKey, record) for an input record – Not necessary, except for many small files (big files are • All sortKey < median are assigned to reduce task 1, all broken up anyway) others to reduce task 2, using a partitioner • Reduce sorts its assigned set of records • Generalizes obviously to more reducers 30 31 package org.apache.hadoop.examples; import java.io.IOException; import java.net.URI; import java.util.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; MapReduce and Key Sorting import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.IdentityMapper; import org.apache.hadoop.mapred.lib.IdentityReducer; import org.apache.hadoop.mapred.lib.InputSampler; import org.apache.hadoop.mapred.lib.TotalOrderPartitioner; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; • MapReduce environment guarantees that for /** * This is the trivial map/reduce program that does absolutely nothing each reduce task the assigned set of intermediate * other than use the framework to fragment and sort the input values. * * To run: bin/hadoop jar build/hadoop-examples.jar sort keys is processed in key order * [-m maps] [-r reduces] * [-inFormat input format class] * [-outFormat output format class] – After receiving all (key2, val2) pairs from mappers, * [-outKey output key class] * [-outValue output value class] reducer sorts them by key2, then calls Reduce on each * [-totalOrder pcnt num samples max splits] * in-dir out-dir (key2, list(val2)) group in order */ Sort Code in Hadoop 1.0.3 Distribution; public class Sort<K,V> extends Configured implements Tool { • Can leverage this guarantee for partitioning sort private RunningJob jobResult = null; part 1: boilerplate code static int printUsage() { System.out.println("sort [-m <maps>] [-r <reduces>] " + – Reduce simply emits the records unchanged "[-inFormat <input format class>] " + "[-outFormat <output format class>] " + – No need for user sort code in Reduce function! "[-outKey <output key class>] " + "[-outValue <output value class>] " + "[-totalOrder <pcnt> <num samples> <max splits>] " + "<input> <output>"); ToolRunner.printGenericCommandUsage(System.out); return -1; 32 33 } 2

Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design Tasks that can be performed independently on a Programming model might appear very limited data object, large number of them: Map But, map and reduce can do anything with their Tasks that

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM slots

LBNF Update to the FNAL Physics Advisory Committee C. J. Mossey, Deputy Director for LBNF 21

RIFL: Implementing Linearizability at Large Scale and Low Latency Jiaxin Wang Motivation

Wednesday, March 9th | 1871 Who is registered to vote in Illinois? Are men and women

The Google File System Armando Fracalossi, Maurlio Schmitt, e Ricardo Fritsche OS 2008/2 -

Temporal Planning through Reduction to Satisfiability Modulo Theories Jussi Rintanen Department

Socket programming Goal: learn how to build client/server application that communicate using

Expanded Very Large Array (EVLA) SRS This file shows each sentence with "only". When the

Sambuz

Useful Links

Newsletter

Mail Us

Comments Basic MapReduce Program Design Tasks that can be performed - PDF document

Comments Basic MapReduce Program Design Tasks that can be performed independently on a Programming model might appear very limited data object, large number of them: Map But, map and reduce can do anything with their Tasks that

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Design Patterns for Efficient Graph Algorithms in MapReduce Algorithms in MapReduce Jimmy Lin and

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Counting Triangles and Modeling MapReduce Siddharth Suri Yahoo! Research Outline 2 Modeling

732A54 Big Data Analytics Lecture 10: Machine Learning with MapReduce Jose M. Pe na IDA,

Data Analytics Dan Ports, CSEP 552 Today MapReduce is it a major step backwards?

Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM slots

LBNF Update to the FNAL Physics Advisory Committee C. J. Mossey, Deputy Director for LBNF 21

RIFL: Implementing Linearizability at Large Scale and Low Latency Jiaxin Wang Motivation

Wednesday, March 9th | 1871 Who is registered to vote in Illinois? Are men and women

The Google File System Armando Fracalossi, Maurlio Schmitt, e Ricardo Fritsche OS 2008/2 -

Temporal Planning through Reduction to Satisfiability Modulo Theories Jussi Rintanen Department

Socket programming Goal: learn how to build client/server application that communicate using

Expanded Very Large Array (EVLA) SRS This file shows each sentence with &quot;only&quot;. When the

Sambuz

Useful Links

Newsletter

Mail Us

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Expanded Very Large Array (EVLA) SRS This file shows each sentence with "only". When the