Interesting Implementation Details Backup tasks: • Straggler = a machine that takes unusually long time to complete one of the last tasks. Eg: – Bad disk forces frequent correctable errors (30MB/s � 1MB/s) – The cluster scheduler has scheduled other tasks on that machine • Stragglers are a main reason for slowdown • Solution : pre-emptive backup execution of the last few remaining in-progress tasks Dan Suciu -- CSEP544 Fall 2010 36
Map-Reduce Summary • Hides scheduling and parallelization details • However, very limited queries – Difficult to write more complex tasks – Need multiple map-reduce operations • Solution: PIG-Latin ! Dan Suciu -- CSEP544 Fall 2010 37
Following Slides provided by: Alan Gates, Yahoo!Research Dan Suciu -- CSEP544 Fall 2010 38
What is Pig? • An engine for executing programs on top of Hadoop • It provides a language, Pig Latin, to specify these programs • An Apache open source project http://hadoop.apache.org/pig/ - 39 -
Map-Reduce • Computation is moved to the data • A simple yet powerful programming model – Map: every record handled individually – Shuffle: records collected by key – Reduce: key and iterator of all associated values • User provides: – input and output (usually files) – map Java function – key to aggregate on – reduce Java function • Opportunities for more control: partitioning, sorting, partial aggregations, etc. - 40 -
Map Reduce Illustrated map map map map reduce reduce reduce reduce - 41 -
Map Reduce Illustrated Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? map map map map reduce reduce reduce reduce - 42 -
Map Reduce Illustrated Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? Romeo, 1 Romeo, 1 What, 1 map map map map wherefore, 1 art, 1 art, 1 thou, 1 thou, 1 hurt, 1 Romeo, 1 reduce reduce reduce reduce - 43 -
Map Reduce Illustrated Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? Romeo, 1 Romeo, 1 What, 1 map map map map wherefore, 1 art, 1 art, 1 thou, 1 thou, 1 hurt, 1 Romeo, 1 art, (1, 1) Romeo, (1, 1, 1) reduce reduce reduce reduce hurt (1), wherefore, (1) thou (1, 1) what, (1) - 44 -
Map Reduce Illustrated Romeo, Romeo, wherefore art thou Romeo? What, art thou hurt? Romeo, 1 Romeo, 1 What, 1 map map map map wherefore, 1 art, 1 art, 1 thou, 1 thou, 1 hurt, 1 Romeo, 1 art, (1, 1) Romeo, (1, 1, 1) reduce reduce reduce reduce hurt (1), wherefore, (1) thou (1, 1) what, (1) art, 2 Romeo, 3 hurt, 1 wherefore, 1 thou, 2 what, 1 - 45 -
Making Parallelism Simple • Sequential reads = good read speeds • In large cluster failures are guaranteed; Map Reduce handles retries • Good fit for batch processing applications that need to touch all your data: – data mining – model tuning • Bad fit for applications that need to find one particular record • Bad fit for applications that need to communicate between processes; oriented around independent units of work - 46 -
Why use Pig? Suppose you have Load Users Load Pages user data in one Filter by age file, website data in another, and you Join on name need to find the top Group on url 5 most visited sites Count clicks by users aged 18 - Order by clicks 25. Take top 5 - 47 -
In Map-Reduce import java.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class); import java.util.ArrayList; } lp.setOutputValueClass(Text.class); import java.util.Iterator; lp.setMapperClass(LoadPages.class); import java.util.List; // Do the cross product and collect the values FileInputFormat.addInputPath(lp, new for (String s1 : first) { Path("/ user/gates/pages")); import org.apache.hadoop.fs.Path; for (String s2 : second) { FileOutputFormat.setOutputPath(lp, import org.apache.hadoop.io.LongWritable; String outval = key + "," + s1 + "," + s2; new Path("/user/gates/tmp/indexed_pages")); import org.apache.hadoop.io.Text; oc.collect(null, new Text(outval)); lp.setNumReduceTasks(0); import org.apache.hadoop.io.Writable; reporter.setStatus("OK"); Job loadPages = new Job(lp); i mport org.apache.hadoop.io.WritableComparable; } import org.apache.hadoop.mapred.FileInputFormat; } JobConf lfu = new JobConf(MRExample.class); import org.apache.hadoop.mapred.FileOutputFormat; } lfu.s etJobName("Load and Filter Users"); import org.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class); import org.apache.hadoop.mapred.KeyValueTextInputFormat; public static class LoadJoined extends MapReduceBase lfu.setOutputKeyClass(Text.class); import org.a pache.hadoop.mapred.Mapper; implements Mapper<Text, Text, Text, LongWritable> { lfu.setOutputValueClass(Text.class); import org.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class); import org.apache.hadoop.mapred.OutputCollector; public void map( FileInputFormat.add InputPath(lfu, new import org.apache.hadoop.mapred.RecordReader; Text k, Path("/user/gates/users")); import org.apache.hadoop.mapred.Reducer; Text val, FileOutputFormat.setOutputPath(lfu, import org.apache.hadoop.mapred.Reporter; OutputColle ctor<Text, LongWritable> oc, new Path("/user/gates/tmp/filtered_users")); imp ort org.apache.hadoop.mapred.SequenceFileInputFormat; Reporter reporter) throws IOException { lfu.setNumReduceTasks(0); import org.apache.hadoop.mapred.SequenceFileOutputFormat; // Find the url Job loadUsers = new Job(lfu); import org.apache.hadoop.mapred.TextInputFormat; String line = val.toString(); import org.apache.hadoop.mapred.jobcontrol.Job; int firstComma = line.indexOf(','); JobConf join = new JobConf( MRExample.class); import org.apache.hadoop.mapred.jobcontrol.JobC ontrol; int secondComma = line.indexOf(',', first Comma); join.setJobName("Join Users and Pages"); import org.apache.hadoop.mapred.lib.IdentityMapper; String key = line.substring(firstComma, secondComma); join.setInputFormat(KeyValueTextInputFormat.class); // drop the rest of the record, I don't need it anymore, join.setOutputKeyClass(Text.class); public class MRExample { // just pass a 1 for the combiner/reducer to sum instead. join.setOutputValueClass(Text.class); public static class LoadPages extends MapReduceBase Text outKey = new Text(key); join.setMapperClass(IdentityMap per.class); implements Mapper<LongWritable, Text, Text, Text> { oc.collect(outKey, new LongWritable(1L)); join.setReducerClass(Join.class); } FileInputFormat.addInputPath(join, new public void map(LongWritable k, Text val, } Path("/user/gates/tmp/indexed_pages")); OutputCollector<Text, Text> oc, public static class ReduceUrls extends MapReduceBase FileInputFormat.addInputPath(join, new Reporter reporter) throws IOException { implements Reducer<Text, LongWritable, WritableComparable, Path("/user/gates/tmp/filtered_users")); // Pull the key out Writable> { FileOutputFormat.se tOutputPath(join, new String line = val.toString(); Path("/user/gates/tmp/joined")); int firstComma = line.indexOf(','); public void reduce( join.setNumReduceTasks(50); String key = line.sub string(0, firstComma); Text ke y, Job joinJob = new Job(join); String value = line.substring(firstComma + 1); Iterator<LongWritable> iter, joinJob.addDependingJob(loadPages); Text outKey = new Text(key); OutputCollector<WritableComparable, Writable> oc, joinJob.addDependingJob(loadUsers); // Prepend an index to the value so we know which file Reporter reporter) throws IOException { // it came from. // Add up all the values we see JobConf group = new JobConf(MRE xample.class); Text outVal = new Text("1 " + value); group.setJobName("Group URLs"); oc.collect(outKey, outVal); long sum = 0; group.setInputFormat(KeyValueTextInputFormat.class); } wh ile (iter.hasNext()) { group.setOutputKeyClass(Text.class); } sum += iter.next().get(); group.setOutputValueClass(LongWritable.class); public static class LoadAndFilterUsers extends MapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFi leOutputFormat.class); implements Mapper<LongWritable, Text, Text, Text> { } group.setMapperClass(LoadJoined.class); group.setCombinerClass(ReduceUrls.class); public void map(LongWritable k, Text val, oc.collect(key, new LongWritable(sum)); group.setReducerClass(ReduceUrls.class); OutputCollector<Text, Text> oc, } FileInputFormat.addInputPath(group, new Reporter reporter) throws IOException { } Path("/user/gates/tmp/joined")); // Pull the key out public static class LoadClicks extends MapReduceBase FileOutputFormat.setOutputPath(group, new String line = val.toString(); i mplements Mapper<WritableComparable, Writable, LongWritable, Path("/user/gates/tmp/grouped")); int firstComma = line.indexOf(','); Text> { group.setNumReduceTasks(50); String value = line.substring( firstComma + 1); Job groupJob = new Job(group); int age = Integer.parseInt(value); public void map( groupJob.addDependingJob(joinJob); if (age < 18 || age > 25) return; WritableComparable key, String key = line.substring(0, firstComma); Writable val, JobConf top100 = new JobConf(MRExample.class); Text outKey = new Text(key); OutputCollector<LongWritable, Text> oc, top100.setJobName("Top 100 sites"); // Prepend an index to the value so w e know which file Reporter reporter) throws IOException { top100.setInputFormat(SequenceFileInputFormat.class); // it came from. oc.collect((LongWritable)val, (Text)key); top100.setOutputKeyClass(LongWritable.class); Text outVal = new Text("2" + value); } top100.setOutputValueClass(Text.class); oc.collect(outKey, outVal); } top100.setOutputFormat(SequenceFileOutputF ormat.class); } public static class LimitClicks extends MapReduceBase top100.setMapperClass(LoadClicks.class); } implements Reducer<LongWritable, Text, LongWritable, Text> { top100.setCombinerClass(LimitClicks.class); public static class Join extends MapReduceBase top100.setReducerClass(LimitClicks.class); implements Reducer<Text, Text, Text, Text> { int count = 0; FileInputFormat.addInputPath(top100, new public void reduce( Path("/user/gates/tmp/grouped")); public void reduce(Text key, LongWritable key, FileOutputFormat.setOutputPath(top100, new Iterator<Text> iter, Iterator<Text> iter, Path("/user/gates/top100sitesforusers18to25")); OutputCollector<Text, Text> oc, OutputCollector<LongWritable, Text> oc, top100.setNumReduceTasks(1); Reporter reporter) throws IOException { Reporter reporter) throws IOException { Job limit = new Job(top100); // For each value, figure out which file it's from and limit.addDependingJob(groupJob); store it // Only output the first 100 records // accordingly. while (count < 100 && iter.hasNext()) { JobControl jc = new JobControl("Find top 100 sites for users List<String> first = new ArrayList<String>(); oc.collect(key, iter.next()); 18 to 25"); List<String> second = new ArrayList<String>(); count++; jc.addJob(loadPages); } jc.addJob(loadUsers); while (iter.hasNext()) { } jc.addJob(joinJob); Text t = iter.next(); } jc.addJob(groupJob); String value = t.to String(); public static void main(String[] args) throws IOException { jc.addJob(limit); if (value.charAt(0) == '1') JobConf lp = new JobConf(MRExample.class); jc.run(); first.add(value.substring(1)); lp.se tJobName("Load Pages"); } else second.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); } 170 lines of code, 4 hours to write - 48 -
In Pig Latin Users = load ‘users’ as (name, age); Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Jnd = join Fltrd by name, Pages by user; Grpd = group Jnd by url; Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks; Srtd = order Smmd by clicks desc; Top5 = limit Srtd 5; store Top5 into ‘top5sites’; 9 lines of code, 15 minutes to write - 49 -
But can it fly? - 50 -
Essence of Pig • Map-Reduce is too low a level to program, SQL too high • Pig Latin, a language intended to sit between the two: – Imperative – Provides standard relational transforms (join, sort, etc.) – Schemas are optional, used when available, can be defined at runtime – User Defined Functions are first class citizens – Opportunities for advanced optimizer but optimizations by programmer also possible - 51 -
How It Works Plan standard Logical Plan � optimizations relational algebra Script Logical Plan Logical Plan A = load Semantic Logical Parser B = filter Checks Optimizer C = group D = foreach Logical Plan Physical Logical to MapReduce To MR Physical Launcher Translator Translator Physical Plan Map-Reduce Plan Jar to Physical Plan = Map-Reduce Plan = hadoop physical operators physical operators to be executed broken into Map, Combine, and Reduce stages Reduce stages - 52 -
Cool Things We’ve Added In the Last Year • Multiquery – Ability to combine multiple group bys into a single MR job (0.3) • Merge join – If data is already sorted on join key, do join via merge in map phase (0.4) • Skew join – Hash join for data with skew in join key. Allows splitting of key across multiple reducers to handle skew. (0.4) • Zebra – Contrib project that provides columnar storage of data (0.4) • Rework of Load and Store functions to make them much easier to write (0.7, branched but not released) • Owl, a metadata service for the grid (committed, will be released in 0.8). - 53 -
Aka Fragment Replicate Join “Broakdcast Join” Pages Users - 54 -
Aka Fragment Replicate Join “Broakdcast Join” Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Pages Users - 55 -
Aka Fragment Replicate Join “Broakdcast Join” Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Pages Users - 56 -
Aka Fragment Replicate Join “Broakdcast Join” Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Map 1 Map 1 Pages Users Map 2 Map 2 - 57 -
Aka Fragment Replicate Join “Broakdcast Join” Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “replicated”; Map 1 Map 1 Pages Users Pages Users block 1 Map 2 Map 2 Pages Users block 2 - 58 -
Hash Join Pages Users - 59 -
Hash Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Pages Users - 60 -
Hash Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Pages Users - 61 -
Hash Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Map 1 User Pages Users block n Map 2 Map 2 Page block m - 62 -
Hash Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Map 1 (1, user) User Pages Users block n Map 2 Map 2 Page block m (2, name) - 63 -
Hash Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Users by name, Pages by user; Map 1 Map 1 Reducer 1 Reducer 1 (1, user) User (1, fred) Pages Users (2, fred) block n (2, fred) Map 2 Map 2 Reducer 2 Reducer 2 Page (1, jane) block m (2, jane) (2, name) (2, jane) - 64 -
Skew Join Pages Users - 65 -
Skew Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Pages Users - 66 -
Skew Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Pages Users - 67 -
Skew Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Map 1 Pages Pages Users block n Map 2 Map 2 Users block m - 68 -
Skew Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Map 1 S S Pages Pages Users P P block n Map 2 Map 2 S S Users P P block m - 69 -
Skew Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Map 1 (1, user) S S Pages Pages Users P P block n Map 2 Map 2 S S Users P P block m (2, name) - 70 -
Skew Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “skewed”; Map 1 Map 1 Reducer 1 Reducer 1 (1, user) S S Pages (1, fred, p1) Pages Users P P (1, fred, p2) block n (2, fred) Map 2 Map 2 Reducer 2 Reducer 2 S S Users (1, fred, p3) P P (1, fred, p4) block m (2, name) (2, fred) - 71 -
Merge Join Pages Users aaron aaron . . . . . . . . . . . . . . . . zach zach - 72 -
Merge Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Pages Users aaron aaron . . . . . . . . . . . . . . . . zach zach - 73 -
Merge Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Pages Users aaron aaron . . . . . . . . . . . . . . . . zach zach - 74 -
Merge Join Users = load ‘users’ as (name, age); Pages = load ‘pages’ as (user, url); Jnd = join Pages by user, Users by name using “merge”; Map 1 Map 1 Pages Users Pages Users aaron… aaron aaron aaron amr … . . . . . . . . Map 2 Map 2 . . . . Pages Users . . amy… amy . . barb … zach zach - 75 -
Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; group by age, group by age, store into store into apply UDFs apply UDFs gender gender ‘bydemo’ ‘bydemo’ load users load users filter nulls filter nulls store into store into group by state group by state apply UDFs apply UDFs ‘bystate’ ‘bystate’ - 76 -
Multi-Store Map-Reduce Plan map map filter filter split split local rearrange local rearrange local rearrange local rearrange reduce reduce demux demux package package package package foreach foreach foreach foreach - 77 -
What are people doing with Pig • At Yahoo ~70% of Hadoop jobs are Pig jobs • Being used at Twitter, LinkedIn, and other companies • Available as part of Amazon EMR web service and Cloudera Hadoop distribution • What users use Pig for: – Search infrastructure – Ad relevance – Model training – User intent analysis – Web log processing – Image processing – Incremental processing of large data sets - 78 -
What We’re Working on this Year • Optimizer rewrite • Integrating Pig with metadata • Usability – our current error messages might as well be written in actual Latin • Automated usage info collection • UDFs in python - 79 -
Research Opportunities • Cost based optimization – how does current RDBMS technology carry over to MR world? • Memory Usage – given that data processing is very memory intensive and Java offers poor control of memory usage, how can Pig be written to use memory well? • Automated Hadoop Tuning – Can Pig figure out how to configure Hadoop to best run a particular script? • Indices, materialized views, etc. – How do these traditional RDBMS tools fit into the MR world? • Human time queries – Analysts want access to the petabytes of data available via Hadoop, but they don’t want to wait hours for their jobs to finish; can Pig find a way to answer analysts question in under 60 seconds? • Map-Reduce-Reduce – Can MR be made more efficient for multiple MR jobs? • How should Pig integrate with workflow systems? • See more: http://wiki.apache.org/pig/PigJournal - 80 -
Learn More • Visit our website: http://hadoop.apache.org/pig/ • On line tutorials – From Yahoo, http://developer.yahoo.com/hadoop/tutorial/ – From Cloudera, http://www.cloudera.com/hadoop-training • A couple of Hadoop books are available that include chapters on Pig, search at your favorite bookstore • Join the mailing lists: – pig-user@hadoop.apache.org for user questions – pig-dev@hadoop.apache.com for developer issues • Contribute your work, over 50 people have so far - 81 -
Pig Latin Mini-Tutorial (will skip in class; please read in order to do homework 7) 82
Outline Based entirely on Pig Latin: A not-so- foreign language for data processing , by Olston, Reed, Srivastava, Kumar, and Tomkins, 2008 Quiz section tomorrow: in CSE 403 (this is CSE, don’t go to EE1) 83
Pig-Latin Overview • Data model = loosely typed nested relations • Query model = a sql-like, dataflow language • Execution model: – Option 1: run locally on your machine – Option 2: compile into sequence of map/reduce, run on a cluster supporting 84 Hadoop
Example • Input: a table of urls: (url, category, pagerank) • Compute the average pagerank of all sufficiently high pageranks, for each category • Return the answers only for categories with sufficiently many such pages 85
First in SQL… SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP By category HAVING COUNT(*) > 10 6 86
…then in Pig-Latin good_urls = FILTER urls BY pagerank > 0.2 groups = GROUP good_urls BY category big_groups = FILTER groups BY COUNT(good_urls) > 10 6 output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank) 87
Types in Pig-Latin • Atomic: string or number, e.g. ‘Alice’ or 55 • Tuple: (‘Alice’, 55, ‘salesperson’) • Bag: {(‘Alice’, 55, ‘salesperson’), (‘Betty’,44, ‘manager’), …} • Maps: we will try not to use these 88
Types in Pig-Latin Bags can be nested ! • {(‘a’, {1,4,3}), (‘c’,{ }), (‘d’, {2,2,5,3,2})} Tuple components can be referenced by number • $0, $1, $2, … 89
90
Loading data • Input data = FILES ! – Heard that before ? • The LOAD command parses an input file into a bag of records • Both parser (=“deserializer”) and output type are provided by user 91
Loading data queries = LOAD ‘query_log.txt’ USING myLoad( ) AS (userID, queryString, timeStamp) 92
Loading data • USING userfuction( ) -- is optional – Default deserializer expects tab-delimited file • AS type – is optional – Default is a record with unnamed fields; refer to them as $0, $1, … • The return value of LOAD is just a handle to a bag – The actual reading is done in pull mode, or parallelized 93
FOREACH expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString) expandQuery( ) is a UDF that produces likely expansions Note: it returns a bag, hence expanded_queries is a nested bag 94
FOREACH expanded_queries = FOREACH queries GENERATE userId, flatten(expandQuery(queryString)) Now we get a flat collection 95
96
FLATTEN Note that it is NOT a first class function ! (that’s one thing I don’t like about Pig-latin) • First class FLATTEN: – FLATTEN({{2,3},{5},{},{4,5,6}}) = {2,3,5,4,5,6} – Type: {{T}} � {T} • Pig-latin FLATTEN – FLATTEN({4,5,6}) = 4, 5, 6 – Type: {T} � T, T, T, …, T ????? 97
FILTER Remove all queries from Web bots: real_queries = FILTER queries BY userId neq ‘bot’ Better: use a complex UDF to detect Web bots: real_queries = FILTER queries BY NOT isBot(userId) 98
JOIN results: {(queryString, url, position)} revenue: {(queryString, adSlot, amount)} join_result = JOIN results BY queryString revenue BY queryString join_result : {(queryString, url, position, adSlot, amount)} 99
100
Recommend
More recommend