Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map - PowerPoint PPT Presentation

Declarative MapReduce 10/29/2018 1

MapReduce Examples Filter Map Aggregate Map Reduce Grouped aggregated Map Reduce Equi-join Map Reduce Map Reduce Non-equi-join 10/29/2018 2

Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 10/29/2018 3

Pig A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs 10/29/2018 4

Examples Filter: Return all the lines that have a user- specified response code, e.g., 200. log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes); ok_lines = FILTER log BY response = ‘200’; STORE ok_lines into ‘ filtered_output ’; Map 10/29/2018 5

Examples Grouped aggregate Find the total number of bytes per response code log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘ grouped_output ’; Map Reduce 10/29/2018 6

Examples Grouped aggregate Find the average number of bytes per response code log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘ grouped_output ’; 10/29/2018 7

Examples Join: Find pairs of requests that ask for the same URL , coming from the same source log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); 10/29/2018 8

Examples Join: Find pairs of requests that ask for the same URL , coming from the same source and happened within an hour of each other log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; 10/29/2018 9

How it works LOAD operation Determines the input path and InputFormat STORE operation Determines the output path and OutputFormat FILTER and FOREACH Translated into map-only jobs AGGREGATE and JOIN Translated into map-reduce jobs All are compiled into one or more MapReduce jobs 10/29/2018 10

Additional Features Lazy execution Nothing gets actually executed until the STORE command is reached Consolidation of map-only jobs Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function 10/29/2018 11

A Complex Example log1 = LOAD ‘logs.csv’ USING PigStorage () AS (…); log2 = LOAD ‘logs.csv’ USING PigStorage () AS (…); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; grouped = GROUP filtered BY log1::host; agg_groups = FOREACH grouped GENERATE group, COUNT(*); STORE agg_groups INTO ‘ final_result'; 10/29/2018 12

Further Readings Pig home page: https://pig.apache.org Detailed documentation: http://pig.apache.org/docs/r0.17.0/ The original Pig Latin paper: Olston, Christopher, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. "Pig latin: a not-so-foreign language for data processing." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110. ACM, 2008. 10/29/2018 13

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map - PowerPoint PPT Presentation

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce Grouped aggregated Map Reduce Equi-join Map Reduce Map Reduce Non-equi-join 10/29/2018 2 Declarative Languages Describe what you want to do

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Cognitive Modeling Declarative and Procedural Knowledge 2 Lecture 3: ACT-R Declarative

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

The leaflet . e x tras Package IN TE R AC TIVE MAP S W ITH L E AFL E T IN R Rich Majer u s

Data frame manipulation: group_by , summarize somgen223.stanford.edu 2 3.4 1 3 2 5 3.3 2

Coding Lab: Grouped Data Ari Anisfeld Summer 2020 1 / 22 Grouping data with dplyr Often you

Clustering Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

1. Consider the wholesale data in the sheet Wholesale. (a) For the grocery sales in region

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L.

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Lecture: k-means & mean-shift clustering Juan Carlos Niebles and Ranjay Krishna Stanford

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map - PowerPoint PPT Presentation

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce Grouped aggregated Map Reduce Equi-join Map Reduce Map Reduce Non-equi-join 10/29/2018 2 Declarative Languages Describe what you want to do

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Declarative Modelling of Virtual Environments DEM 2 ONS PROJECT 2 ONS PROJECT DEM (Declarative

Connecting declarative software tools Declarative tools [for] connecting software Salvador Lucas

Lecture 31: Declarative Programming Imperative vs. Declarative So far, our programs are

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Cognitive Modeling Declarative and Procedural Knowledge 2 Lecture 3: ACT-R Declarative

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

The leaflet . e x tras Package IN TE R AC TIVE MAP S W ITH L E AFL E T IN R Rich Majer u s

Data frame manipulation: group_by , summarize somgen223.stanford.edu 2 3.4 1 3 2 5 3.3 2

Coding Lab: Grouped Data Ari Anisfeld Summer 2020 1 / 22 Grouping data with dplyr Often you

Clustering Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics

1. Consider the wholesale data in the sheet Wholesale. (a) For the grocery sales in region

Semantic Data Placement for Power Management in Archival Storage Avani Wildani &amp; Ethan L.

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Lecture: k-means &amp; mean-shift clustering Juan Carlos Niebles and Ranjay Krishna Stanford

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

Semantic Data Placement for Power Management in Archival Storage Avani Wildani & Ethan L.

Lecture: k-means & mean-shift clustering Juan Carlos Niebles and Ranjay Krishna Stanford