[M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Frequently asked questions from the previous class survey ¨ Types of tasks MapReduce is poor at ¨ Difference between Hadoop and Spark L11. 2 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.1 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Topics covered in this lecture ¨ Hadoop ¤ Phases of Map ¤ Phases of Reduce ¤ Examples L11. 3 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA H ADOOP CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University L11.2 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Hadoop ¨ Java-based open-source implementation of MapReduce ¨ Created by Doug Cutting ¨ Origins of the name Hadoop ¤ Stuffed yellow elephant ¨ Includes HDFS [Hadoop Distributed File System] L11. 5 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Hadoop: MapReduce Dataflow Input HDFS Sort Output HDFS Copy Map split 0 Merge Reduce Part 0 HDFS Replication split 1 Map Merge Reduce Part 1 HDFS Replication split 2 Map L11. 6 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.3 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University In Hadoop a Map task has 4 phases ¨ Record reader ¨ Mapper ¨ Combiner ¨ Partitioner L11. 7 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Map task phases: Record Reader ¨ Translates input splits into records ¨ Parse data into records, but does not parse the record itself ¨ Passes the data to the mapper in the form of a key/value pair ¤ key in this context is positional information ¤ value is the chunk of data that comprises a record L11. 8 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.4 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Map task phases: Map ¨ User-provided code is executed on each key/value pair from the record reader ¨ This user-code produces zero or more new key/value pairs, called the intermediate pairs ¤ key is what the data will be grouped on and value is the information pertinent to the analysis in the reducer ¤ Choice of key/value pairs is critical and not arbitrary L11. 9 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Map task phases: Combiner ¨ Can group data in the map phase ¨ Takes the intermediate keys from the mapper and applies a user- provided method to aggregate values in the small scope of that one mapper ¨ Significantly reduces the amount of data that has to move over the network. ¤ Sending (“hello”, 3) requires fewer bytes than sending (“hello”, 1) three times over the network L11. 10 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.5 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Combiner function ¨ No guarantees on how many times Hadoop will call this on a map output record ¤ The combiner should, however, result in the same output from the reducer ¨ Combiners must be commutative and associative ¤ Sometimes they are also called distributive ¤ Commutative: Order of operands (5+2) = 2+5 n Division and subtraction are not commutative ¤ Associative: Order of operators 5 x (5x3) = (5x5)x3 ¤ Non-associative and non-commutative: Vector cross products and matrix multiplication (AB ≠ BA) respectively L11. 11 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Map task phases: Partitioner ¨ Takes the intermediate key/value pairs from the mapper (or combiner) and splits them up into shards , one shard per reducer ¨ Default: key.hashCode() % (number of reducers) ¤ Randomly distributes the keyspace evenly over the reducers ¤ But still ensures that keys with the same value in different mappers end up at the same reducer L11. 12 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.6 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Map task phases: Partitioner ¨ Partitioner can be customized (e.g. for sorting) ¤ Changing the partitioner is rarely necessary ¨ The partitioned data is written to the local file system for each map and waits to be pulled by its respective reducer L11. 13 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA In Hadoop a Reduce task has 4 phases ¨ Shuffle ¨ Sort ¨ Reducer ¨ Output format L11. 14 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.7 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Reduce task phases: Shuffle and sort ¨ Shuffle ¤ Takes the output files written by all of the partitioners and downloads them to the local machine in which the reducer is running ¨ Sort ¤ Individual data pieces are then sorted by key into one larger data list ¤ Groups equivalent keys together so that their values can be iterated over easily in the reduce task L11. 15 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Reduce task phases: Shuffle and sort ¨ This phase is not customizable and the framework handles everything automatically ¨ The only control a developer has is how the keys are sorted and grouped by specifying a custom Comparator object L11. 16 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.8 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University Reduce task phases: Reducer ¨ Takes the grouped data as input and runs a reduce function once per key grouping ¨ The function is passed the key and an iterator over all of the values associated with that key ¤ A wide range of processing can happen in this function: data can be aggregated, filtered, and combined etc. ¨ Once the reduce function is done, it sends zero or more key/value pairs to the final step, the output format ¨ N.B.: map & reduce functions will change from job to job L11. 17 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA Reduce task phases: Output format ¨ Translates the final key/value pair from the reduce function and writes it out to a file using a record writer ¨ By default: ¤ Separate the key and value with a tab ¤ Separates records with a newline character ¨ Can typically be customized to provide richer output formats ¤ But in the end, the data is written out to HDFS, regardless of format L11. 18 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.9 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University M AP R EDUCE EXAMPLE CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Word Count ¨ Word count over user-submitted comments on StackOverflow ¨ Content of the Text field will be retrieved and preprocessed ¤ Then count how many times we see each word ¨ Example record from this data set is: ¤ < row Id =" 8189677" PostId =" 6881722" Text =" Have you looked at Hadoop?" CreationDate =" 2011-07-30T07: 29: 33.343" UserId =" 831878" /> ¤ This record is the 8,189,677 th comment on Stack Overflow, and is associated with post number 6,881,722, and is by user number 831,878. L11. 20 CS555: Distributed Systems [Fall 2019] October 1, 2019 Dept. Of Computer Science , Colorado State University Professor: S HRIDEEP P ALLICKARA L11.10 S LIDES C REATED B Y : S HRIDEEP P ALLICKARA

[M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall

Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi

The The Hadoop Di adoop Dist stri ributed buted Fi File le System System Konstantin

Cy Cypher pher-based based Graph ph Pattern ttern Ma Matc tching hing in in Gradoo adoop

Big ig Dat ata a an and Had adoop oop Venkatesh Vinayakarao venkateshv@cmi.ac.in

[H ADOOP ] Whats this hullabaloo about an elephant? No, not the one named Horton Who has fun

Systems H ADOOP Distributed File System Dr. Taieb Znati Computer Science Department University

Pl Planning nning to R o Red educe uce Lo Losses: ses: Hurrica rricane ne Mitiga igation

Investigating Al Alter ernative D e Data S a Sour ources es to Red educe ce R Res

T6 August 22, 2003 2:00PM R EDUCE R ISK U SING S ECURITY QA A UTOMATION T ECHNIQUES Alexander

to to reduce educe the e the exposur xposure to e to of Workers Compensation cost costs

Energy is about a lot Important to red educe t the e load (use l e les ess ene nergy) be

[T HREAD S AFETY & M AP R EDUCE ] Are you set on reinventing the wheel? Shunning libraries

[M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado State University CS555:

Outline Motivation and Overview of Hadoop Architecture, Design & Implementation of the

Platelet reactivity outcomes Myocardial infarction Myocardial infarction Major

THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN ,

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

VHDL Basics Madhav P. Desai Tel: x7423 Office hours: Wednesday 4pm-5pm email:

rs r rr r

CHEP 2009 [paper 28] Eric Grancher eric.grancher@cern.ch CERN IT Image courtesy of

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

Resolution in First-Order Logic Basic steps for proving a conclusion S given premises Premise 1 ,

Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California

Multiresolution Modeling A Very Brief Introduction 1 Spring 2010 Multiresolution

[M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science - PDF document

CS555: Distributed Systems [Fall 2019] Dept. Of Computer Science , Colorado State University CS 555: D ISTRIBUTED S YSTEMS [M AP R EDUCE /H ADOOP ] Shrideep Pallickara Computer Science Colorado State University CS555: Distributed Systems [Fall

Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi

The The Hadoop Di adoop Dist stri ributed buted Fi File le System System Konstantin

Cy Cypher pher-based based Graph ph Pattern ttern Ma Matc tching hing in in Gradoo adoop

Big ig Dat ata a an and Had adoop oop Venkatesh Vinayakarao venkateshv@cmi.ac.in

[H ADOOP ] Whats this hullabaloo about an elephant? No, not the one named Horton Who has fun

Systems H ADOOP Distributed File System Dr. Taieb Znati Computer Science Department University

Pl Planning nning to R o Red educe uce Lo Losses: ses: Hurrica rricane ne Mitiga igation

Investigating Al Alter ernative D e Data S a Sour ources es to Red educe ce R Res

T6 August 22, 2003 2:00PM R EDUCE R ISK U SING S ECURITY QA A UTOMATION T ECHNIQUES Alexander

to to reduce educe the e the exposur xposure to e to of Workers Compensation cost costs

Energy is about a lot Important to red educe t the e load (use l e les ess ene nergy) be

[T HREAD S AFETY &amp; M AP R EDUCE ] Are you set on reinventing the wheel? Shunning libraries

[M AP R EDUCE ] Shrideep Pallickara Computer Science Colorado State University CS555:

Outline Motivation and Overview of Hadoop Architecture, Design &amp; Implementation of the

Platelet reactivity outcomes Myocardial infarction Myocardial infarction Major

THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN ,

CPSC 121: Models of Computation Module 3: Representing Values in a Computer Module 3: Coming

VHDL Basics Madhav P. Desai Tel: x7423 Office hours: Wednesday 4pm-5pm email:

rs r rr r

CHEP 2009 [paper 28] Eric Grancher eric.grancher@cern.ch CERN IT Image courtesy of

Math you need to know 1 Linear algebra Linear algebra is mainly concerned with solving equations

Resolution in First-Order Logic Basic steps for proving a conclusion S given premises Premise 1 ,

Visualizing Multi-Int Data Craig Knoblock &amp; Pedro Szekely University of Southern California

Multiresolution Modeling A Very Brief Introduction 1 Spring 2010 Multiresolution

[T HREAD S AFETY & M AP R EDUCE ] Are you set on reinventing the wheel? Shunning libraries

Outline Motivation and Overview of Hadoop Architecture, Design & Implementation of the

Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California