declarative mapreduce
play

Declarative MapReduce 1 Declarative Languages Describe what you - PowerPoint PPT Presentation

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2 Relational Operators Projection - SELECT revenue


  1. Declarative MapReduce 1

  2. Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2

  3. Relational Operators Projection - π SELECT revenue – expenses AS profit FROM … Selection (Filter) – σ SELECT … WHERE cost > 5000 Aggregate - Σ SELECT SUM(cost) Grouped Aggregate SELECT SUM(cost) GROUP BY product_id Join - ⋈ SELECT … FROM Employee, Department WHERE Employee.dept_id = Deptartment.id 3

  4. Example: Log file host logname time method url response bytes referer useragent pppa006.compuserve.com - 807256800GET /images/launch-logo.gif 200 1713 vcc7.langara.bc.ca - 807256804GET /shuttle/missions/missions.html 200 8677 pppa006.compuserve.com - 807256806GET /history/apollo/images/apollo-logo1.gif 200 1173 bettong.client.uq.oz.au - 807256900GET /history/skylab/skylab.html 304 0 bettong.client.uq.oz.au - 807256913GET /images/ksclogosmall.gif 304 0 202.32.48.43 - 807259091GET /shuttle/resources/orbiters/atlantis.gif 404 0 bettong.client.uq.oz.au - 807256913GET /history/apollo/images/apollo-logo.gif 200 3047 ad03-053.compuserve.com - 807257487GET /cgi-bin/imagemap/countdown70?284,288 302 85 hella.stm.it - 807256914GET /shuttle/missions/sts-70/images/DSC-95EC-0001.jpg 200 513911 We will model a tuple as a map [String → Value]. This can be implemented as a hash table, for example. E.g., tuple.host = “…” 4

  5. Projection Input: A tuple with a set of attributes Output: A tuple with another set of attributes Can be modeled as a map-only job Example: Add day of week based on the time map(tuple, context) { date = new Date(tuple.time) tuple.day_of_week = date.getDayOfWeek(); context.write(tuple); } 5

  6. Selection (Filter) Input: A tuple with a set of attributes Output: Either the tuple of it matches the predicate, or nothing if it does not Can be modeled as a map-only job Example: Find records with response code 200 map(tuple, context) { response_code = tuple.response if (response_code == 200) context.write(tuple) } 6

  7. Aggregation Input: A relation with a set of tuples Output: One value that aggregates an entire column Can be modeled as one map-reduce job Example: Find the sum of bytes // Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } reduce(key, values[], context) { context.write(key, sum(values)); } 7

  8. Aggregation A combiner can be used to speed up the processing // Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); } Note: Hadoop a special key called NullWritable for this scenario 8

  9. Other Aggregate Functions The same technique can be used for any function that is associative and commutative. This includes, min, max, sum, and count. It also includes all functions that can be derived from these functions, e.g., average and standard deviation 9

  10. Grouped Aggregation Input: A relation with a set of tuples Output: One value that aggregates an entire column for each value of the group key Can be modeled as one map-reduce job Example: Find the sum of bytes for each response code map(tuple, context) { context.write(tuple.response, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); } 10

  11. Equi-join Input: Two relations and a join column Output: A tuple that combines each pair of tuples where the join column is equal Can be modeled as one map-reduce job Special case: Self-join. Both inputs are the same Example (Self-join): Given a log file, find log entries which originate from the same host and request the same URL 11

  12. Self Equi-join Given a log file, find log entries which originate from the same host and request the same URL map(tuple, context) { join_key = tuple.host + “|” + tuple.url context.write(join_key, tuple) } reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } } 12

  13. Binary Equi-join Given two log files, find log entries which originate from the same host and request the same URL map(tuple, context, order) { join_key = tuple.host + “|” + tuple.url tuple.input_order = order context.write(join_key, tuple) } map(tuple, context) { // Use MapContext#getInputSplit() if (context.inputPath == inputFile1) map(tuple, context, 1); else map(tuple, context, 2) } 13

  14. Binary Equi- join (cont’d) reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { if (values[i].input_order != values[j].input_order) merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } } 14

  15. Chaining of MapReduce Jobs Hadoop is designed so that the output of MapReduce job can be fed as an input to another MapReduce job SELECT day_of_week(time) as dow, SUM(bytes) FROM logfile WHERE response = 200 GROUP BY dow; Grouped Input Select Project Aggregate 15

  16. Pig A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs 16

  17. Examples Filter: Return all the lines that have a user- specified response code, e.g., 200. log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes); ok_lines = FILTER log BY response = ‘200’; STORE ok_lines into ‘ filtered_output ’; Map 17

  18. Examples Grouped aggregate Find the total number of bytes per response code log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘ grouped_output ’; Map Reduce 18

  19. Examples Grouped aggregate Find the average number of bytes per response code log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘ grouped_output ’; 19

  20. Examples Join: Find pairs of requests that ask for the same URL , coming from the same source log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); 20

  21. Examples Join: Find pairs of requests that ask for the same URL , coming from the same source and happened within an hour of each other log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; 21

  22. How it works LOAD operation Determines the input path and InputFormat STORE operation Determines the output path and OutputFormat FILTER and FOREACH Translated into map-only jobs AGGREGATE and JOIN Translated into map-reduce jobs All are compiled into one or more MapReduce jobs 22

  23. Additional Features Lazy execution Nothing gets actually executed until the STORE command is reached Consolidation of map-only jobs Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function 23

  24. A Complex Example log1 = LOAD ‘logs.csv’ USING PigStorage () AS (…); log2 = LOAD ‘logs.csv’ USING PigStorage () AS (…); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; grouped = GROUP filtered BY log1::host; agg_groups = FOREACH grouped GENERATE group, COUNT(*); STORE agg_groups INTO ‘ final_result'; 24

  25. Further Readings Pig home page: https://pig.apache.org Detailed documentation: http://pig.apache.org/docs/r0.17.0/ The original Pig Latin paper: Olston, Christopher, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. "Pig latin: a not-so-foreign language for data processing." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110. ACM, 2008. 25

Recommend


More recommend