large scale data mining mapreduce and beyond
play

Large-scale Data Mining: MapReduce and beyond Part 1: Basics - PowerPoint PPT Presentation

Large-scale Data Mining: MapReduce and beyond Part 1: Basics Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook Monday, August 23, 2010 Data everywhere 2 Monday, August 23, 2010 Data everywhere Flickr (3 billion


  1. Example – Programming model mapper employees.txt def getName (line): # LAST FIRST SALARY return line.split(‘\t’)[1] Smith John $90,000 Brown David $70,000 Johnson George $95,000 def addCounts (hist, name): Yates John $80,000 hist[name] = \ Miller Bill $65,000 Moore Jack $85,000 hist.get(name,default=0) + 1 Taylor Fred $75,000 return hist Smith David $80,000 Harris John $90,000 ... ... ... input = open(‘employees.txt’, ‘r’) ... ... ... intermediate = map(getName, input) Q: “What is the frequency of each first name?” 10 Monday, August 23, 2010

  2. Example – Programming model mapper employees.txt def getName (line): # LAST FIRST SALARY return line.split(‘\t’)[1] Smith John $90,000 Brown David $70,000 reducer Johnson George $95,000 def addCounts (hist, name): Yates John $80,000 hist[name] = \ Miller Bill $65,000 Moore Jack $85,000 hist.get(name,default=0) + 1 Taylor Fred $75,000 return hist Smith David $80,000 Harris John $90,000 ... ... ... input = open(‘employees.txt’, ‘r’) ... ... ... intermediate = map(getName, input) result = reduce(addCounts, \ Q: “What is the frequency intermediate, {}) of each first name?” 10 Monday, August 23, 2010

  3. Example – Programming model mapper employees.txt def getName (line): # LAST FIRST SALARY return (line.split(‘\t’)[1], 1) Smith John $90,000 Brown David $70,000 reducer Johnson George $95,000 def addCounts (hist, (name, c)): Yates John $80,000 hist[name] = \ Miller Bill $65,000 Moore Jack $85,000 hist.get(name,default=0) + c Taylor Fred $75,000 return hist Smith David $80,000 Harris John $90,000 ... ... ... input = open(‘employees.txt’, ‘r’) ... ... ... intermediate = map(getName, input) result = reduce(addCounts, \ Q: “What is the frequency intermediate, {}) of each first name?” 11 Monday, August 23, 2010

  4. Example – Programming model mapper employees.txt def getName (line): # LAST FIRST SALARY return (line.split(‘\t’)[1], 1) Smith John $90,000 Brown David $70,000 reducer Johnson George $95,000 def addCounts (hist, (name, c)): Yates John $80,000 hist[name] = \ Miller Bill $65,000 Moore Jack $85,000 hist.get(name,default=0) + c Taylor Fred $75,000 return hist Smith David $80,000 Harris John $90,000 ... ... ... input = open(‘employees.txt’, ‘r’) ... ... ... intermediate = map(getName, input) result = reduce(addCounts, \ Q: “What is the frequency Key-value iterators intermediate, {}) of each first name?” 11 Monday, August 23, 2010

  5. Example – Programming model Hadoop / Java public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } } // class FieldMapper 12 Monday, August 23, 2010

  6. Example – Programming model Hadoop / Java public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } non-boilerplate } // class FieldMapper 12 Monday, August 23, 2010

  7. Example – Programming model Hadoop / Java public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { typed… private static LongWritable ONE = new LongWritable(1); private static Text firstname = new Text(); @Override public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.toString().split(“\t”)[1]); output.collect(firstname, ONE); } non-boilerplate } // class FieldMapper 12 Monday, August 23, 2010

  8. Example – Programming model Hadoop / Java public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s); output.collect(key, sum); } } // class LongSumReducer 13 Monday, August 23, 2010

  9. Example – Programming model Hadoop / Java public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable sum = new LongWritable(); @Override public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasNext()) s += vals.next().get(); sum.set(s); output.collect(key, sum); } } // class LongSumReducer 13 Monday, August 23, 2010

  10. Example – Programming model Hadoop / Java public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient. runJob (job); return 0; } // run() public static main (String[] args) throws Exception { ToolRunner. run (new Configuration(), new HistogramJob(), args); } // main() } // class HistogramJob 14 Monday, August 23, 2010

  11. Example – Programming model Hadoop / Java public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setJobName(“Histogram”); FileInputFormat.setInputPaths(job, args[0]); job.setMapperClass(FieldMapper.class); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(LongSumReducer.class); // ... JobClient. runJob (job); return 0; } // run() public static main (String[] args) throws Exception { ToolRunner. run (new Configuration(), new HistogramJob(), args); } // main() } // class HistogramJob ~ 30 lines = 25 boilerplate (Eclipse) + 5 actual code 14 Monday, August 23, 2010

  12. MapReduce for…  Distributed clusters  Google’s original  Hadoop (Apache Software Foundation)  Hardware  SMP/CMP: Phoenix (Stanford)  Cell BE  Other  Skynet (in Ruby/DRB)  QtConcurrent  BashReduce  …many more 15 Monday, August 23, 2010

  13. MapReduce for…  Distributed clusters  Google’s original  Hadoop (Apache Software Foundation)  Hardware  SMP/CMP: Phoenix (Stanford)  Cell BE  Other  Skynet (in Ruby/DRB)  QtConcurrent  BashReduce  …many more 15 Monday, August 23, 2010

  14. Recap Quick-n-dirty script Hadoop vs ~5 lines of (non-bo on-boilerplate) code Single machine, Up to thousands of local drive machines and drives 16 Monday, August 23, 2010

  15. Recap Quick-n-dirty script Hadoop vs ~5 lines of (non-bo on-boilerplate) code Single machine, Up to thousands of local drive machines and drives What is hidden to achieve this:  Data partitioning, placement and replication  Computation placement (and replication)  Number of nodes (mappers / reducers)  … 16 Monday, August 23, 2010

  16. Recap Quick-n-dirty script Hadoop vs ~5 lines of (non-bo on-boilerplate) code Single machine, Up to thousands of local drive machines and drives What is hidden to achieve this:  Data partitioning, placement and replication  Computation placement (and replication)  Number of nodes (mappers / reducers)  … As a programmer, you don’t need to know what I’m about to show you next… 16 Monday, August 23, 2010

  17. Execution model: Flow Input file SPLIT 0 MAPPER Output file REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 PART 1 MAPPER SPLIT 3 MAPPER 17 Monday, August 23, 2010

  18. Execution model: Flow Key/value Input file iterators SPLIT 0 MAPPER Output file REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 PART 1 MAPPER SPLIT 3 MAPPER 17 Monday, August 23, 2010

  19. Execution model: Flow Key/value Input file iterators SPLIT 0 MAPPER Output file REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 PART 1 MAPPER SPLIT 3 MAPPER Sequential scan 17 Monday, August 23, 2010

  20. Execution model: Flow Key/value Input file iterators Smith John $90,000 SPLIT 0 MAPPER Output file REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 Yates John $80,000 PART 1 MAPPER SPLIT 3 MAPPER Sequential scan 17 Monday, August 23, 2010

  21. Execution model: Flow Key/value Input file iterators John 1 SPLIT 0 MAPPER Output file REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 John 1 PART 1 MAPPER SPLIT 3 MAPPER Sequential scan 17 Monday, August 23, 2010

  22. Execution model: Flow Key/value Input file iterators John 1 SPLIT 0 MAPPER Output file REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 John 1 PART 1 MAPPER SPLIT 3 MAPPER All-to-all, hash partitioning Sequential scan 17 Monday, August 23, 2010

  23. Execution model: Flow Key/value Input file iterators SPLIT 0 MAPPER Output file John 2 REDUCER SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 PART 1 MAPPER SPLIT 3 MAPPER Sort-merge All-to-all, hash partitioning Sequential scan 17 Monday, August 23, 2010

  24. Execution model: Flow Key/value Input file iterators SPLIT 0 MAPPER Output file REDUCER John 2 SPLIT 1 PART 0 MAPPER REDUCER SPLIT 2 PART 1 MAPPER SPLIT 3 MAPPER Sort-merge All-to-all, hash partitioning Sequential scan 17 Monday, August 23, 2010

  25. Execution model: Placement HOST 1 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 HOST 3 SPLIT 3 HOST 0 SPLIT 0 Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 SPLIT 4 SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 18 Monday, August 23, 2010

  26. Execution model: Placement HOST 1 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 SPLIT 3 HOST 0 SPLIT 0 Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER MAPPER SPLIT 4 SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 Computation co-located with data (as much as possible) 18 Monday, August 23, 2010

  27. Execution model: Placement HOST 1 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 SPLIT 3 HOST 0 SPLIT 0 Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER MAPPER SPLIT 4 SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 19 Monday, August 23, 2010

  28. Execution model: Placement HOST 1 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 SPLIT 3 HOST 0 SPLIT 0 REDUCER Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER MAPPER SPLIT 4 SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 19 Monday, August 23, 2010

  29. Execution model: Placement HOST 1 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 SPLIT 3 HOST 0 SPLIT 0 REDUCER Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER MAPPER SPLIT 4 SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 Rack/network-aware 19 Monday, August 23, 2010

  30. Execution model: Placement HOST 1 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 C C SPLIT 3 HOST 0 SPLIT 0 REDUCER Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER C MAPPER SPLIT 4 C SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 C COMBINER Rack/network-aware 19 Monday, August 23, 2010

  31. MapReduce Summary 20 Monday, August 23, 2010

  32. MapReduce Summary  Simple programming model  Scalable, fault-tolerant  Ideal for (pre-)processing large volumes of data 20 Monday, August 23, 2010

  33. MapReduce Summary  Simple programming model  Scalable, fault-tolerant  Ideal for (pre-)processing large volumes of data ‘However, if the data center is the computer, it leads to the even more intriguing question “What is the equivalent of the ADD instruction for a data center?” […] If MapReduce is the first instruction of the “data center computer” , I can’t wait to see the rest of the instruction set, as well as the data center programming language, the data center operating system, the data center storage systems, and more.’ – David Patterson, “ The Data Center Is The Computer ”, CACM, Jan. 2008 20 Monday, August 23, 2010

  34. Outline  Introduction  MapReduce & distributed storage  Hadoop  HBase  Pig  Cascading  Hive  Summary 21 Monday, August 23, 2010

  35. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 22 Monday, August 23, 2010

  36. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Hadoop’s stated mission (Doug Cutting interview): Core Avro Commoditize infrastructure for web-scale, data-intensive applications 22 Monday, August 23, 2010

  37. Who uses Hadoop?  Yahoo!  Facebook  Last.fm  Rackspace  Digg  Apache Nutch  … more in part 3 23 Monday, August 23, 2010

  38. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 24 Monday, August 23, 2010

  39. Hadoop HBase Pig Hive Chukwa Filesystems and I/O:  Abstraction APIs Zoo MapReduce HDFS  RPC / Persistence Keeper Core Avro 24 Monday, August 23, 2010

  40. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 25 Monday, August 23, 2010

  41. Hadoop HBase Pig Hive Chukwa Cross-language serialization:  RPC / persistence Zoo MapReduce HDFS  ~ Google ProtoBuf / FB Thrift Keeper Core Avro 25 Monday, August 23, 2010

  42. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 26 Monday, August 23, 2010

  43. Hadoop Distributed execution (batch)  Programming model HBase Pig Hive Chukwa  Scalability / fault-tolerance Zoo MapReduce HDFS Keeper Core Avro 26 Monday, August 23, 2010

  44. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 27 Monday, August 23, 2010

  45. Hadoop Distributed storage (read-opt.)  Replication / scalability HBase Pig Hive Chukwa  ~ Google filesystem (GFS) Zoo MapReduce HDFS Keeper Core Avro 27 Monday, August 23, 2010

  46. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 28 Monday, August 23, 2010

  47. Hadoop Coordination service  Locking / configuration HBase Pig Hive Chukwa  ~ Google Chubby Zoo MapReduce HDFS Keeper Core Avro 28 Monday, August 23, 2010

  48. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 29 Monday, August 23, 2010

  49. Hadoop HBase Pig Hive Chukwa Column-oriented, sparse store Zoo  Batch & random access MapReduce HDFS Keeper  ~ Google BigTable Core Avro 29 Monday, August 23, 2010

  50. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 30 Monday, August 23, 2010

  51. Hadoop HBase Pig Hive Chukwa Data flow language Zoo  Procedural SQL-inspired lang. MapReduce HDFS Keeper  Execution environment Core Avro 30 Monday, August 23, 2010

  52. Hadoop HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 31 Monday, August 23, 2010

  53. Hadoop HBase Pig Hive Chukwa Distributed data warehouse Zoo  SQL-like query language MapReduce HDFS Keeper  Data mgmt / query execution Core Avro 31 Monday, August 23, 2010

  54. Hadoop … … more HBase Pig Hive Chukwa Zoo MapReduce HDFS Keeper Core Avro 32 Monday, August 23, 2010

  55. MapReduce  Mapper: (k1, v1)  (k2, v2)[]  E.g., (void, textline : string)  (first : string, count : int)  Reducer: (k2, v2[])  (k3, v3)[]  E.g., (first : string, counts : int[])  (first : string, total : int)  Combiner: (k2, v2[])  (k2, v2)[]  Partition: (k2, v2)  int 33 Monday, August 23, 2010

  56. Mapper interface interface Mapper<K1, V1, K2, V2> {  void configure (JobConf conf); void map (K1 key, V1 value,  OutputCollector<K2, V2> out, Reporter reporter); void close();  }  Initialize in configure()  Clean-up in close()  Emit via out.collect(key,val) any time 34 Monday, August 23, 2010

  57. Reducer interface interface Reducer<K2, V2, K3, V3> {  void configure (JobConf conf); void reduce (  K2 key, Iterator<V2> values, OutputCollector<K3, V3> out, Reporter reporter);  void close(); }  Initialize in configure()  Clean-up in close()  Emit via out.collect(key,val) any time 35 Monday, August 23, 2010

  58. Some canonical examples  Histogram-type jobs:  Graph construction (bucket = edge)  K-means et al. (bucket = cluster center)  Inverted index:  Text indices  Matrix transpose  Sorting  Equi-join  More details in part 2 36 Monday, August 23, 2010

  59. Equi-joins “Reduce-side” (Smith, 7) MAP (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) MAP (Acct., 5) 37 Monday, August 23, 2010

  60. Equi-joins “Reduce-side” (Smith, 7) 7: (  , (Smith)) MAP (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) MAP (Acct., 5) 37 Monday, August 23, 2010

  61. Equi-joins “Reduce-side” (Smith, 7) 7: (  , (Smith)) MAP (Jones, 7) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) 7: (  , (Devel)) MAP (Acct., 5) 37 Monday, August 23, 2010

  62. Equi-joins “Reduce-side” (Smith, 7) 7: (  , (Smith)) MAP -OR- (Jones, 7) (7,  ): (Smith) (Brown, 7) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) (Sales, 3) (Devel, 7) 7: (  , (Devel)) MAP -OR- (Acct., 5) (7,  ): (Devel) 37 Monday, August 23, 2010

  63. Equi-joins “Reduce-side” (Smith, 7) MAP 7: (  , (Smith)) (Jones, 7) 7: (  , (Jones)) (Brown, 7) 7: (  , (Brown)) (Davis, 3) (Dukes, 5) (Black, 3) (Gruhl, 7) 7: (  , (Gruhl)) (Sales, 3) (Devel, 7) 7: (  , (Devel)) MAP (Acct., 5) 38 Monday, August 23, 2010

  64. Equi-joins “Reduce-side” (Smith, 7) MAP 7: (  , (Smith)) (Jones, 7) 7: (  , (Jones)) 7: {(  , (Smith)), (Brown, 7) 7: (  , (Brown)) (  , (Jones)), (Davis, 3) (  , (Brown)), SHUF. (  , (Gruhl)), (Dukes, 5) (  , (Devel)) } (Black, 3) (Gruhl, 7) 7: (  , (Gruhl)) (Sales, 3) (Devel, 7) 7: (  , (Devel)) MAP (Acct., 5) 38 Monday, August 23, 2010

  65. Equi-joins “Reduce-side” (Smith, 7) MAP 7: (  , (Smith)) (Jones, 7) 7: (  , (Jones)) 7: {(  , (Smith)), (Brown, 7) 7: (  , (Brown)) (  , (Jones)), (Davis, 3) (  , (Brown)), SHUF. (  , (Gruhl)), (Dukes, 5) (  , (Devel)) } (Black, 3) RED. (Gruhl, 7) 7: (  , (Gruhl)) (Smith, Devel), (Sales, 3) (Jones, Devel), (Brown, Devel), (Devel, 7) 7: (  , (Devel)) MAP (Gruhl, Devel) (Acct., 5) 38 Monday, August 23, 2010

  66. HDFS & MapReduce processes HOST 1 HOST 1 HOST 2 HOST 2 SPLIT 0 SPLIT 4 SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 HOST 3 C C SPLIT 3 HOST 0 HOST 0 SPLIT 0 REDUCER Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER C MAPPER SPLIT 4 C SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 HOST 4 HOST 6 39 Monday, August 23, 2010

  67. HDFS & MapReduce processes HOST 1 HOST 1 DATA HOST 2 HOST 2 DATA NODE SPLIT 0 SPLIT 4 NODE SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 HOST 3 C DATA C SPLIT 3 HOST 0 HOST 0 NODE DATA SPLIT 0 REDUCER Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 NODE SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER C MAPPER SPLIT 4 C SPLIT 3 Replica 2/3 Replica 2/3 HOST 5 DN HOST 4 HOST 6 DN DN 39 Monday, August 23, 2010

  68. HDFS & MapReduce processes HOST 1 HOST 1 DATA HOST 2 HOST 2 DATA NODE SPLIT 0 SPLIT 4 NODE SPLIT 3 SPLIT 2 Replica 2/3 Replica 1/3 Replica 3/3 Replica 2/3 MAPPER MAPPER HOST 3 HOST 3 C DATA C SPLIT 3 HOST 0 HOST 0 NODE DATA SPLIT 0 REDUCER Replica 1/3 SPLIT 2 SPLIT 1 Replica 3/3 NODE SPLIT 0 SPLIT 1 Replica 3/3 Replica 1/3 Replica 1/3 Replica 2/3 MAPPER C MAPPER SPLIT 4 C SPLIT 3 Replica 2/3 Replica 2/3 NAME HOST 5 DN NODE HOST 4 HOST 6 DN DN 39 Monday, August 23, 2010

Recommend


More recommend