CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall - PowerPoint PPT Presentation

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020

ADMINISTRIVIA , posted some to notes y submission - Assignment 1: Due Sep 21, Monday at 10pm! Piazza ! - Assignment 2: ML will be released Sep 22 Course - Final project details: Next week -

Discussion MOTIVATION: Programmability to and outputs inputs our Apt ! Most real applications require multiple MR steps – Google indexing pipeline: 21 steps . xi – Analytics queries (e.g. sessions, top K): 2-5 steps – Iterative algorithms (e.g. PageRank): 10’s of steps Multi-step jobs create spaghetti code – 21 MR steps à 21 mapper and reducer classes

MOTIVATION: Performance Tine mtor run Latency or MR only provides one pass of computation y lower board has – Must write out data to file system in-between a to read input time in write outfit Expensive for apps that need to reuse data - – Multi-step algorithms (e.g. PageRank) disk to – Interactive data mining ↳ amp

Programmability Google MapReduce WordCount: • • #include "mapreduce/mapreduce.h" MapReduceSpecification spec; • • // User’s reduce function for (int i = 1; i < argc; i++) { • • • // User’s map function class Sum: public Reducer { MapReduceInput* in= spec.add_input(); • • - • class SplitWords: public Mapper { public: in->set_format("text"); • • • public: - virtual void Reduce(ReduceInput* input) in->set_filepattern(argv[i]); 7 e • • • virtual void Map(const MapInput& input) { in->set_mapper_class("SplitWords"); • • • { // Iterate over all entries with the } • • const string& text = input.value(); // same key and add the values • • • const int n = text.size(); int64 value = 0; // Specify the output files • • • for (int i = 0; i < n; ) { while (!input->done()) { MapReduceOutput* out = spec.output(); • • • // Skip past leading whitespace value += StringToInt( out->set_filebase("/gfs/test/freq"); • • • while (i < n && isspace(text[i])) input->value()); out->set_num_tasks(100); • • • i++; input->NextValue(); out->set_format("text"); 3 • • • E // Find word end } out->set_reducer_class("Sum"); • • int start = i; // Emit sum for input->key() • • • while (i < n && !isspace(text[i])) Emit(IntToString(value)); // Do partial sums within map • • • i++; } out->set_combiner_class("Sum"); • • if (start < i) }; • • Emit(text.substr( // Tuning parameters • • start,i-start),"1"); REGISTER_REDUCER(Sum); spec.set_machines(2000); • • } spec.set_map_megabytes(100); • • } spec.set_reduce_megabytes(100); • • }; • // Now run it • • REGISTER_MAPPER(SplitWords); MapReduceResult result; • if (!MapReduce(spec, &result)) abort(); • return 0; • • int main(int argc, char** argv) { } • ParseCommandLineFlags(argc, argv);

APACHE Spark Programmability ② zsgtehisawenrseda ? or lines fewer , functions ① Medway function code hand . inline functions val file = spark.textFile(“hdfs://...”) q val soul val counts = file.flatMap(line => line.split(“ ”)) val - .map(word => (word, 1)) T - .reduceByKey(_ + _) counts.save(“out.txt”) mean ④ www.wetwtfk xoadfoftitewt.in setting operate ③ peak to collections claiming safely on local ( local ) as

APACHE Spark Programmability: clean, functional API – Parallel transformations on collections – 5-10x less code than MR – Available in Scala, Java, Python and R Performance – In-memory computing primitives – Optimization across operators

this create ii% Spark Concepts we yonce L cannot chase ↳ - we contents using lineage charges > track records * - - - - - Resilient distributed datasets (RDDs) ' q a split or ruined – Immutable, partitioned collections of objects ; – May be cached in memory for fast reuse Operations on RDDs r . . – Transformations (build RDDs) ' # T . : – Actions (compute results) i - [ ÷ :* Restricted shared variables . Ii : . :* : : – Broadcast, accumulators a :*

⇒ 2 for C ) - timer Example: Log Mining effete : . .net , inofte RDDC string ) ( put ⇐ → cache this in Find error messages present in log files interactively (Example: HTTP server logs) , meth create gon RDP from agile filter Base RDD , Transformed RDD Worker court lines = spark.textFile(“hdfs://...”) ⇒ ¥ g results - E # HI errors = lines.filter(_.startsWith(“ERROR”)) Block 1 f ⇒ * tasks - Driver . messages = errors.map(_.split(‘\t’)(2)) Action . messages.cache() Lemery * ⇒ river ) . # messages.filter(_.contains(“foo”)).count Worker mainsheets rare Block 2 Worker handle memory € Block 3 data . µ

- D Example: Log Mining Find error messages present in log files interactively (Example: HTTP server logs) E Cache 1 - Worker lines = spark.textFile(“hdfs://...”) results errors = lines.filter(_.startsWith(“ERROR”)) Block 1 tasks Driver = ÷÷÷ messages = errors.map(_.split(‘\t’)(2)) messages.cache() messages.filter(_.contains(“foo”)).count . Cache 2 → - messages.filter(_.contains(“bar”)).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec Result: search 1 TB data in 5-7 sec (vs 170 sec for on-disk data) (vs 20 sec for on-disk data) Block 3

& D - > D - D court on - - " - Fault Recovery ra ?* " ⇒ DI . > D D . . . cache ( messages ) . Base messages = textFile(...).filter(_.startsWith(“ERROR”)) RDD .map(_.split(‘\t’)(2)) ( HDFD HDFS File Filtered RDD Mapped RDD filter map par # Hors paadraj.nl?Yai!mIem " " clefs narrow (func = _.contains(...)) (func = _.split(...)) ⑦ transform ② ? file function : filter Transform

Other RDD Operations map flatMap filter union Transformations sample join (define a new RDD) groupByKey cross reduceByKey mapValues cogroup ... collect count Actions reduce saveAsTextFile (output a result) take saveAsHadoopFile fold ...

⇐ DEPENDENCIES MM.pe#ration FILE f O " ° Pifer mediate → - - files - ng

Driver ftp..es?5fim Job Scheduler (1) ↳ w 2 B: A: Captures RDD dependency graph G: Stage 1 groupBy Pipelines functions into “stages” F: D: C: I one Task , mop map filter , map → E: join - Stage 2 union narrow Stage 3 = cached partition

Job Scheduler (2) MR also in E B: A: Cache-aware for data reuse, locality G: Stage 1 groupBy Partitioning-aware to avoid shuffles F: D: C: - a ! partition map E: join - ? ? scheduler fails * Stage 2 union Stage 3 aborts Job = cached partition

CHECKPOINTING rdd = sc.parallelize(1 to 100, 2).map(x à 2*x) rdd.checkpoint()

SuMMARY } at Driver 2 why reduce → aviate MapReduce ↳ 1£ " Spark: Generalize MR programming model " " " " Support in-memory computations with RDDs API Scala mimic reduce → spark Job Scheduler: Pipelining, locality-aware behaves like this

DISCUSSION https://forms.gle/4JDXfpRuVaXmQHxD8

↳ ⇐ paomptatim " " T aq " ' . , Count ) Driver ( H5t3t6 ) ( like Ds - faction 15 ③ I = ↳ bottleneck / ID partitions p ← reduction dothi?:p × dqtes=Driv# woyapoiffffmh rkeM . we bytes tear d → neg : :# : you . D-doi-hgredraBykeyq.dk at in Decor 'I÷gamxID → Y :L , # arena - " paikmpy.ci D . ) . seek :%÷rdeW tin " peuilnkpm ppp

When would reduction trees be better than using `reduce` in Spark? When would they not be ? - over held stages work doing in with creation task scheduling → , overheads shuffle is small tramnitted data & Compute slow might be → . tree Reduce →

↳ ↳ is full error / partition Device is full big ! Disk hunt → local disk " ¥1 data → n simple " - h df - ppnfuthogg.gg → Tool dstat n ← Python ↳ cpu , network - Ppa until disk a. → D → . enabled " name job info some

NEXT STEPS Review form ÷¥÷¥i:÷m better spark MR ↳ when is Next week: Resource Management versa vice multiple passes - Mesos,YARN data & when better → Spark is over in memory - DRF data fits Assignment 1 is due soon! . : : :b : vs memory speed Frequency of failure

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall - PowerPoint PPT Presentation

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA , posted some to notes y submission - Assignment 1: Due Sep 21, Monday at 10pm! Piazza ! - Assignment 2: ML will be released Sep 22 Course - Final

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2019 ADMINISTRIVIA -

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA -

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Dremel: Interactive Analysis of Web-Scale Datasets CS 744 BIG DATA PHIL MARTINKUS Motivation

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

2.744 Dreamweaver Tutorial Sangmok Han sangmok@mit.edu Feb 24, 2010 Overview We will go over

Resilient Chicago 100 Resilient Cities is a global initiative that seeks to help cities around

Resilient Modulus Unbound Materials 1 Resilient Modulus M R Deviator stress Axial strain

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

Entropy, Randomness, and Information Lecture 27 December 5, 2013 Sariel (UIUC) CS573 1 Fall

The Fabliau . .. .. . . . .. . . .. . . .. . . .. . .. . . . .. . . .. .

PSYC 335 Developmental Psychology I Session 12 Cognitive development in Adolescence Lecturer:

From Polarized Targets to Polarized Ion Beams EIC Accelerator Collaboration Meeting 2019

Entropy, Randomness, and Information Lecture 23 November 13, 2014 Sariel (UIUC) CS573 1 Fall

Ambipolar Diffusion Effects on the Weakly Ionized Turbulence Molecular Clouds UC-HIPACC: The

Dichotomies in Ontology-Mediated Querying with the Guarded Fragment Frank Wolter University of

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall - PowerPoint PPT Presentation

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA , posted some to notes y submission - Assignment 1: Due Sep 21, Monday at 10pm! Piazza ! - Assignment 2: ML will be released Sep 22 Course - Final

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2019 ADMINISTRIVIA -

CS 744: Resilient Distributed Datasets Shivaram Venkataraman Fall 2020 ADMINISTRIVIA -

MapReduce &amp; Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -

Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets

Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids Raising Resilient Kids

Phone Fax 25448 SEIL ROAD 1-815-744-1910 1-815-744-1968 SHOREWOOD, ILLINOIS 60404-7620

Dremel: Interactive Analysis of Web-Scale Datasets CS 744 BIG DATA PHIL MARTINKUS Motivation

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing M.

Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce

Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020

2.744 Dreamweaver Tutorial Sangmok Han sangmok@mit.edu Feb 24, 2010 Overview We will go over

Resilient Chicago 100 Resilient Cities is a global initiative that seeks to help cities around

Resilient Modulus Unbound Materials 1 Resilient Modulus M R Deviator stress Axial strain

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

Entropy, Randomness, and Information Lecture 27 December 5, 2013 Sariel (UIUC) CS573 1 Fall

The Fabliau . .. .. . . . .. . . .. . . .. . . .. . .. . . . .. . . .. .

PSYC 335 Developmental Psychology I Session 12 Cognitive development in Adolescence Lecturer:

From Polarized Targets to Polarized Ion Beams EIC Accelerator Collaboration Meeting 2019

Entropy, Randomness, and Information Lecture 23 November 13, 2014 Sariel (UIUC) CS573 1 Fall

Ambipolar Diffusion Effects on the Weakly Ionized Turbulence Molecular Clouds UC-HIPACC: The

Dichotomies in Ontology-Mediated Querying with the Guarded Fragment Frank Wolter University of

MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline -