MapReduce Andrew Crotty Alex Galakatos
What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing
Background Google project Implemented many special-purpose computations Needed an abstraction MapReduce: Simplified Data Processing on Large Clusters , OSDI 2004
Map User-defined function Takes input key/value pairs Returns intermediate key/value pairs Grouped by key and passed to Reduce
Reduce User-defined function Takes intermediate key/corresponding set of values Returns merged result (e.g., aggregates) Result is usually smaller
Example Problem : count the number of word occurrences in a very large document Solution : Map : emit each word with initial count 1 Reduce : emit aggregated counts
Word Count: Map function map(String text) { for (String word : text) { emit (word, 1); } }
Word Count: Reduce function reduce(String word, Iterator counts) { int sum = 0; for (int count : counts) { sum += count; } emit (word, sum); }
Shuffle Happens between map and reduce phases Transfer all intermediate values for particular key to single node High network load Any problems with word count?
Combiner Word count map function produces repetitive intermediate key/value pairs User can provide optional function to perform partial merging Must be commutative and associative Logic is usually same as reduce function
Execution Overview 1) Partition data 2) Map phase 3) Combiner phase (optional) 4) Shuffle data 5) Reduce phase 6) Return result
Uses Distributed search Distributed sort Large-scale indexing Log file analysis Machine learning Many more...
Advantages Simple programming model Can express many different problems Allows seamless horizontal scalability
Criticisms Lack of novelty No performance enhancements Restricted framework
DBMS Complement NOT a replacement Useful for: 1) ETL and "read once" datasets 2) Complex analytics 3) Semi-structured data 4) Quick-and-dirty analyses
Hadoop
What is Hadoop? Created in 2005 by Doug Cutting and Mike Cafarella Open-source MapReduce implementation Written in Java Supported by Apache
HDFS Distributed file system Highly scalable and fault tolerant Replication for: Availability Data locality Rack-aware
Amazon Web Services S3 EC2 Elastic MapReduce Managed Hadoop Framework Run "job flows" Much more...
Elastic MapReduce Job Flows Java jar file Streaming Hive / Pig HBase Word count (streaming) Write map and reduce functions in Python Upload input data and functions to S3 Output written to S3
Mapper Reads/writes to stdin and stdout Splits each line and emits (word, 1)
Reducer Go through sorted words and sum counts for same words
Demo
Tupleware Distributed analytics framework Supports MapReduce-style programs Machine learning/visualization use cases CPU is the bottleneck Optimize for CPU efficiency: Cache-aware Register-aware Vectorized loops
Potential Projects SQL interpreter Language bindings Visualization Comparison benchmarks Many more...
Questions?
Recommend
More recommend