Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: - PDF document

Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: From MapReduce to Spark (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details 1

The datacenter is the computer! What’s the instruction set? 2 Source: Google 2

Map/Reduce Abstraction Instruction set Combine/Partition CPU Cluster of computers 3 We need a solution for both storage and computing. 3

So you like programming in assembly? 4 Source: Wikipedia (ENIAC) So when we program in MapReduce is it like programming in assembly?! How can we do better? 4

What’s the solution? Design a higher-level language Write a compiler 5 5

Hadoop is great, but it’s really waaaaay too low level! What we really need is a What we really need is SQL! scripting language! Answer: Answer: 6 Yahoo and Facebook designed their own solutions on top of Hadoop to make it more flexible for their engineers. 6

SQL Pig Scripts Both open-source projects today! 7 7

Hive Pig MapReduce HDFS 8 Pig and Hive programs are converted to MapReduce jobs at the end of the day. 8

Pig! 9 Source: Wikipedia (Pig) 9

Pig: Example Task: Find the top 10 most visited pages in each category Visits URL Info User Url Time Url Category PageRank Amy cnn.com 8:00 cnn.com News 0.9 Amy bbc.com 10:00 bbc.com News 0.8 Amy flickr.com 10:05 flickr.com Photos 0.7 Fred cnn.com 12:00 espn.com Sports 0.9 10 Pig Slides adapted from Olston et al. (SIGMOD 2008) 10

Pig: Example Script visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/ urlInfo ’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/ topUrls ’; 11 Pig Slides adapted from Olston et al. (SIGMOD 2008) 11

Pig Query Plan load visits group by url foreach url load urlInfo generate count join on url group by category foreach category generate top(urls, 10) 12 Pig Slides adapted from Olston et al. (SIGMOD 2008) 12

Pig: MapReduce Execution Map 1 load visits group by url Reduce 1 Map 2 foreach url load urlInfo generate count join on url Reduce 2 Map 3 group by category Reduce 3 foreach category generate top(urls, 10) 13 Pig Slides adapted from Olston et al. (SIGMOD 2008) 13

visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/ urlInfo ’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/ topUrls ’; 14 14

But isn’t Pig slower? Sure, but c can be slower than assembly too… 15 15

The datacenter is the computer! What’s the instruction set? Okay, let’s fix this! 16 Source: Google Having to formulate the problem in terms of map and reduce only is restrictive. 16

MapReduce Workflows HDFS map map map map reduce reduce reduce reduce HDFS HDFS HDFS HDFS What’s wrong? 17 There is a lot of disk i/o involved which significantly reduces running MapReduce jobs like this. 17

Want MM? HDFS HDFS map map map map HDFS HDFS ✗ HDFS ✔ 18 It’s okay not to have reduce but the output of map cannot go to another map. 18

Want MRR? HDFS HDFS map map map reduce reduce reduce reduce HDFS HDFS ✗ HDFS ✔ 19 Similarly we cannot directly move the output of reduce to another reduc. 19

The datacenter is the computer! Let’s enrich the instruction set! 20 Source: Google Can we add more operations to make the instruction set more flexible? 20

Spark Answer to “What’s beyond MapReduce?” Brief history: Developed at UC Berkeley AMPLab in 2009 Open-sourced in 2010 Became top-level Apache project in February 2014 21 21

Spark vs. Hadoop Spark September2014 Hadoop Google Trends 22 Spark is more popular than Hadoop today. 22

MapReduce List[(K1,V1)] map f: (K1, V1) ⇒ List[(K2, V2)] reduce g: (K2, Iterable[V2]) ⇒ List[(K3, V3)] List[K3,V3]) This is the only mechanism we had in MapReduce. 23

Map-like Operations RDD[T] RDD[T] RDD[T] RDD[T] map filter mapPartitions flatMap f: (T) f: (T) ⇒ f: (Iterator[T]) f: (T) ⇒ TraversableOnce[U] ⇒ U Boolean ⇒ Iterator[U] RDD[U] RDD[T] RDD[U] RDD[U] But Spark provides many more operations (enriched instruction set). 24

Reduce-like Operations RDD[(K, V)] RDD[(K, V)] RDD[(K, V)] aggregateByKey reduceByKey groupByKey seqOp: (U, V) ⇒ U, combOp: (U, f: (V, V) ⇒ V U) ⇒ U RDD[(K, Iterable[V])] RDD[(K, V)] RDD[(K, U)] 25

And many other operations! 26

Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: - PDF document

Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: From MapReduce to Spark (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Composite Vectors and Scalars in Theories of Electroweak Symmetry Breaking Antonio Enrique C

Interpretability in PRA Marta Bilkova , Dick de Jongh , and Joost J. Joosten ,

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

before the science URC 2010: UG Research in Computer Science Theory & applications

IT350: Web & Internet Programming Set 19: Security, Hacking and myspace Input to your

Pointers Ch 9 & 13.1 Highlights - pointers object vs memory address An object is simply a

draft-fieau-https-delivery-delegation Frdric Fieau - Orange Iuniana Oprescu - Orange IETF 93

WarningBird: Detecting Suspicious URLs in Twitter Stream NDSS 2012 Sangho Lee and Jong Kim

Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: - PDF document

Data-Intensive Distributed Computing CS 431/631 (Fall 2020) Part 3: From MapReduce to Spark (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451/ This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Composite Vectors and Scalars in Theories of Electroweak Symmetry Breaking Antonio Enrique C

Interpretability in PRA Marta Bilkova , Dick de Jongh , and Joost J. Joosten ,

Reconfigurable and Reconfigurable and Adaptive Systems (RAS) Adaptive Systems (RAS) 7. Adaptive

before the science URC 2010: UG Research in Computer Science Theory &amp; applications

IT350: Web &amp; Internet Programming Set 19: Security, Hacking and myspace Input to your

Pointers Ch 9 &amp; 13.1 Highlights - pointers object vs memory address An object is simply a

draft-fieau-https-delivery-delegation Frdric Fieau - Orange Iuniana Oprescu - Orange IETF 93

WarningBird: Detecting Suspicious URLs in Twitter Stream NDSS 2012 Sangho Lee and Jong Kim

before the science URC 2010: UG Research in Computer Science Theory & applications

IT350: Web & Internet Programming Set 19: Security, Hacking and myspace Input to your

Pointers Ch 9 & 13.1 Highlights - pointers object vs memory address An object is simply a