Spark Stony Brook University CSE545, Spring 2019 Situations where - PowerPoint PPT Presentation

Spark Stony Brook University CSE545, Spring 2019

Situations where MapReduce is not efficient ● Long pipelines sharing data ● Interactive applications ● Streaming applications ● Iterative algorithms (optimization problems) DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot).

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s).

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 Create RDD dfs:// (DATA) filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 transformation1() dfs:// (DATA) (DATA) filename created from dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (can drop (DATA) (DATA) the data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). ● Enables rebuilding datasets on the fly. ● Intermediate datasets not stored on disk (and only in memory if needed and enough space) Faster communication and I O

The Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). “Stable Storage” Other RDDs

The Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). map filter join ...

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename

Original Transformations: RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Original Transformations: RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record Mul��l� Re��d� of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Original Transformations: RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Original Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Original Actions : RDD to Value, Object, or Storage Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Current Transformations and Actions http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations common transformations: filter, map, flatMap, reduceByKey, groupByKey http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions common actions: collect, count, take

An Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors count() Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

An Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: count() lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.count Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

An Example Collect times of hdfs-related errors lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.persist errors.count ... Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

An Example Collect times of hdfs-related errors Persistance lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME Can specify that an RDD “persists” in memory so other queries can errors use it. Pseudocode: filter.(_.contains(“HDFS”)) Can specify a priority for HDFS errors lines = sc.textFile(“dfs:...”) persistance; lower priority => errors = moves to disk, if needed, earlier map.(_.split(‘\t’)(3)) lines.filter(_.startswith(“ERROR”)) errors.persist time fields errors.count ... collect() Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.

Spark Stony Brook University CSE545, Spring 2019 Situations where - PowerPoint PPT Presentation

Spark Stony Brook University CSE545, Spring 2019 Situations where MapReduce is not efficient Long pipelines sharing data Interactive applications Streaming applications Iterative algorithms (optimization problems) DFS Map

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

loop unrolling (ASM) loop : cmpl %edx, %esi jle endOfLoop addq (%rdi,%rdx,8), %rax incq

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.

Compilers and computer architecture: A realistic compiler to RISC-V Martin Berger 1 November /

Professor: Alvin Chao CS149 Array Activities int[ ] nums = {10, 3, 7, -5}; nums 10 3

Accumulating results Contents Three ways of computing results during re- cursion Collecting

Introducing Racket A brief tour of history We wanted a language that allowed symbolic

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te

Spark Stony Brook University CSE545, Spring 2019 Situations where - PowerPoint PPT Presentation

Spark Stony Brook University CSE545, Spring 2019 Situations where MapReduce is not efficient Long pipelines sharing data Interactive applications Streaming applications Iterative algorithms (optimization problems) DFS Map

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Apache Spark CS240A Winter 2016. T Yang Some of them are based on P. Wendells Spark slides

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

loop unrolling (ASM) loop : cmpl %edx, %esi jle endOfLoop addq (%rdi,%rdx,8), %rax incq

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit &amp; Continue 5.

Compilers and computer architecture: A realistic compiler to RISC-V Martin Berger 1 November /

Professor: Alvin Chao CS149 Array Activities int[ ] nums = {10, 3, 7, -5}; nums 10 3

Accumulating results Contents Three ways of computing results during re- cursion Collecting

Introducing Racket A brief tour of history We wanted a language that allowed symbolic

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

V.3 Query Processing 1. Term-at-a-Time 2. Document-at-a-Time 3. WAND 4. Quit & Continue 5.