Declarative MapReduce 1 Declarative Languages Describe what you - PowerPoint PPT Presentation

Declarative MapReduce 1

Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2

Relational Operators Projection - π SELECT revenue – expenses AS profit FROM … Selection (Filter) – σ SELECT … WHERE cost > 5000 Aggregate - Σ SELECT SUM(cost) Grouped Aggregate SELECT SUM(cost) GROUP BY product_id Join - ⋈ SELECT … FROM Employee, Department WHERE Employee.dept_id = Deptartment.id 3

Example: Log file host logname time method url response bytes referer useragent pppa006.compuserve.com - 807256800GET /images/launch-logo.gif 200 1713 vcc7.langara.bc.ca - 807256804GET /shuttle/missions/missions.html 200 8677 pppa006.compuserve.com - 807256806GET /history/apollo/images/apollo-logo1.gif 200 1173 bettong.client.uq.oz.au - 807256900GET /history/skylab/skylab.html 304 0 bettong.client.uq.oz.au - 807256913GET /images/ksclogosmall.gif 304 0 202.32.48.43 - 807259091GET /shuttle/resources/orbiters/atlantis.gif 404 0 bettong.client.uq.oz.au - 807256913GET /history/apollo/images/apollo-logo.gif 200 3047 ad03-053.compuserve.com - 807257487GET /cgi-bin/imagemap/countdown70?284,288 302 85 hella.stm.it - 807256914GET /shuttle/missions/sts-70/images/DSC-95EC-0001.jpg 200 513911 We will model a tuple as a map [String → Value]. This can be implemented as a hash table, for example. E.g., tuple.host = “…” 4

Projection Input: A tuple with a set of attributes Output: A tuple with another set of attributes Can be modeled as a map-only job Example: Add day of week based on the time map(tuple, context) { date = new Date(tuple.time) tuple.day_of_week = date.getDayOfWeek(); context.write(tuple); } 5

Selection (Filter) Input: A tuple with a set of attributes Output: Either the tuple of it matches the predicate, or nothing if it does not Can be modeled as a map-only job Example: Find records with response code 200 map(tuple, context) { response_code = tuple.response if (response_code == 200) context.write(tuple) } 6

Aggregation Input: A relation with a set of tuples Output: One value that aggregates an entire column Can be modeled as one map-reduce job Example: Find the sum of bytes // Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } reduce(key, values[], context) { context.write(key, sum(values)); } 7

Aggregation A combiner can be used to speed up the processing // Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); } Note: Hadoop a special key called NullWritable for this scenario 8

Other Aggregate Functions The same technique can be used for any function that is associative and commutative. This includes, min, max, sum, and count. It also includes all functions that can be derived from these functions, e.g., average and standard deviation 9

Grouped Aggregation Input: A relation with a set of tuples Output: One value that aggregates an entire column for each value of the group key Can be modeled as one map-reduce job Example: Find the sum of bytes for each response code map(tuple, context) { context.write(tuple.response, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); } 10

Equi-join Input: Two relations and a join column Output: A tuple that combines each pair of tuples where the join column is equal Can be modeled as one map-reduce job Special case: Self-join. Both inputs are the same Example (Self-join): Given a log file, find log entries which originate from the same host and request the same URL 11

Self Equi-join Given a log file, find log entries which originate from the same host and request the same URL map(tuple, context) { join_key = tuple.host + “|” + tuple.url context.write(join_key, tuple) } reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } } 12

Binary Equi-join Given two log files, find log entries which originate from the same host and request the same URL map(tuple, context, order) { join_key = tuple.host + “|” + tuple.url tuple.input_order = order context.write(join_key, tuple) } map(tuple, context) { // Use MapContext#getInputSplit() if (context.inputPath == inputFile1) map(tuple, context, 1); else map(tuple, context, 2) } 13

Binary Equi- join (cont’d) reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { if (values[i].input_order != values[j].input_order) merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } } 14

Chaining of MapReduce Jobs Hadoop is designed so that the output of MapReduce job can be fed as an input to another MapReduce job SELECT day_of_week(time) as dow, SUM(bytes) FROM logfile WHERE response = 200 GROUP BY dow; Grouped Input Select Project Aggregate 15

Pig A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs 16

Examples Filter: Return all the lines that have a user- specified response code, e.g., 200. log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes); ok_lines = FILTER log BY response = ‘200’; STORE ok_lines into ‘ filtered_output ’; Map 17

Examples Grouped aggregate Find the total number of bytes per response code log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘ grouped_output ’; Map Reduce 18

Examples Grouped aggregate Find the average number of bytes per response code log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘ grouped_output ’; 19

Examples Join: Find pairs of requests that ask for the same URL , coming from the same source log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); 20

Examples Join: Find pairs of requests that ask for the same URL , coming from the same source and happened within an hour of each other log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; 21

How it works LOAD operation Determines the input path and InputFormat STORE operation Determines the output path and OutputFormat FILTER and FOREACH Translated into map-only jobs AGGREGATE and JOIN Translated into map-reduce jobs All are compiled into one or more MapReduce jobs 22

Additional Features Lazy execution Nothing gets actually executed until the STORE command is reached Consolidation of map-only jobs Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function 23

A Complex Example log1 = LOAD ‘logs.csv’ USING PigStorage () AS (…); log2 = LOAD ‘logs.csv’ USING PigStorage () AS (…); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; grouped = GROUP filtered BY log1::host; agg_groups = FOREACH grouped GENERATE group, COUNT(*); STORE agg_groups INTO ‘ final_result'; 24

Further Readings Pig home page: https://pig.apache.org Detailed documentation: http://pig.apache.org/docs/r0.17.0/ The original Pig Latin paper: Olston, Christopher, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. "Pig latin: a not-so-foreign language for data processing." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110. ACM, 2008. 25

Declarative MapReduce 1 Declarative Languages Describe what you - PowerPoint PPT Presentation

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2 Relational Operators Projection - SELECT revenue

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Cognitive Modeling Declarative and Procedural Knowledge 2 Lecture 3: ACT-R Declarative

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

APOLLO AUTOMATIC DETECTION AND DIAGNOSIS OF PERFORMANCE REGRESSIONS IN DATABASE SYSTEMS Jinho

Apollo 11: Lunar Landing INST 154 Apollo at 50 Lunar Landing Apollo 11 Landing Site Selection

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

1 What would you recommend? 2013 ACC/AHA Cholesterol Guidelines: 4 treatment groups 67 year-old

Diana Apollo innocence truth Constantine acknowledged the priority of the pope in spiritual

Rosetta " Rosetta is a great European success, technical and scientific but also a real

Snow Cover activities at DLR: TIMELINE snow cover product and Global SnowPack Andreas Dietz,

Apollo's priest Chryses trying to ransom his daughter Chryseis from Agamemnon ( I liad 1.12-21).

Declarative MapReduce 1 Declarative Languages Describe what you - PowerPoint PPT Presentation

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2 Relational Operators Projection - SELECT revenue

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Cognitive Modeling Declarative and Procedural Knowledge 2 Lecture 3: ACT-R Declarative

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

APOLLO AUTOMATIC DETECTION AND DIAGNOSIS OF PERFORMANCE REGRESSIONS IN DATABASE SYSTEMS Jinho

Apollo 11: Lunar Landing INST 154 Apollo at 50 Lunar Landing Apollo 11 Landing Site Selection

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

1 What would you recommend? 2013 ACC/AHA Cholesterol Guidelines: 4 treatment groups 67 year-old

Diana Apollo innocence truth Constantine acknowledged the priority of the pope in spiritual

Rosetta &quot; Rosetta is a great European success, technical and scientific but also a real

Snow Cover activities at DLR: TIMELINE snow cover product and Global SnowPack Andreas Dietz,

Apollo's priest Chryses trying to ransom his daughter Chryseis from Agamemnon ( I liad 1.12-21).

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Rosetta " Rosetta is a great European success, technical and scientific but also a real