Apache Spark: A Unified Engine for Big Data Processing Presented - PowerPoint PPT Presentation

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen

Apache Spark: A Unified Engine for Big Data Processing § Engine? § Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2

Apache Spark: A Unified Engine for Big Data Processing § Engine? § convert one form of data into other useful forms § Unified? § Multiple types of conversions Apache Spark: A Unified Engine for Big Data Processing PAGE 3

Apache Spark: A Unified Engine for Big Data Processing § What is Apache Spark? (Engine) § How can it make multiple types of conversions over big data? (Unified) Apache Spark: A Unified Engine for Big Data Processing PAGE 4

What is Apache Spark? A framework like MapReduce § Resilient Distributed Datasets (RDDs) § RDDs Apache Spark: A Unified Engine for Big Data Processing PAGE 5

Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 6

Resilient Distributed Datasets (RDDs) I/O I/O Apache Spark: A Unified Engine for Big Data Processing PAGE 8

Resilient Distributed Datasets (RDDs) An RDD is a read-only, partitioned collection of records § Transformations § § create RDDs (map, filter, join, etc.) Actions § § return a value to the application § or export data to a storage system Persistence § § Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). Partitioning § § Users can ask that an RDD’s elements be partitioned across machines based on a key in each record. Apache Spark: A Unified Engine for Big Data Processing PAGE 10

Resilient Distributed Datasets (RDDs) § Lineage § An RDD has enough information about how it was derived from other datasets. § Narrow dependencies § each partition of the parent RDD is used by at most one partition of the child RDD § Wide dependencies: § multiple child partitions Apache Spark: A Unified Engine for Big Data Processing PAGE 12

Resilient Distributed Datasets (RDDs) Example: run an action on RDD G Apache Spark: A Unified Engine for Big Data Processing PAGE 14

MapReduce vs Spark MapReduce Ecosystem Spark Ecosystem Apache Spark: A Unified Engine for Big Data Processing PAGE 15

Higher-Level Libraries Apache Spark: A Unified Engine for Big Data Processing PAGE 16

SQL and DataFrames Apache Spark: A Unified Engine for Big Data Processing PAGE 17

SQL and DataFrames § !"#"$%"&'( = *!!( + ,-ℎ'&" = /"01'( § Spark SQL’s DataFrame API supports inline definition of user-defined functions (UDFs), without the complicated packaging and registration process found in other database systems. Apache Spark: A Unified Engine for Big Data Processing PAGE 18

UDF in MySQL Apache Spark: A Unified Engine for Big Data Processing PAGE 19

UDF in Spark SQL Apache Spark: A Unified Engine for Big Data Processing PAGE 20

Spark Streaming Apache Spark: A Unified Engine for Big Data Processing PAGE 21

Spark Streaming Discretized stream processing model Continuous operator processing model Apache Spark: A Unified Engine for Big Data Processing PAGE 22

GraphX Apache Spark: A Unified Engine for Big Data Processing PAGE 23

GraphX § Not able to beat specialized graph-parallel systems itself § But outperform them in graph analytics pipeline Apache Spark: A Unified Engine for Big Data Processing PAGE 28

MLlib Apache Spark: A Unified Engine for Big Data Processing PAGE 29

MLlib § More than 50 common algorithms for distributed model training § Support pipeline construction on Spark § Integrate with other Spark libraries well Apache Spark: A Unified Engine for Big Data Processing PAGE 30

Why use Apache Spark? § Ecosystem § Competitive performance § Low cost in sharing data § Low latency of MapReduce Steps § Control over bottleneck resources Apache Spark: A Unified Engine for Big Data Processing PAGE 31

Apache Spark in 2016 § Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs. § Apache Spark has grown to 1,000 contributors and thousands of deployments from 2010 to 2016. Apache Spark: A Unified Engine for Big Data Processing PAGE 32

Apache Spark Today Apache Spark: A Unified Engine for Big Data Processing PAGE 33

Apache Spark: A Unified Engine for Big Data Processing § What is Apache Spark § Apache Spark = MapReduce + RDDs § How can it make multiple types of conversions over big data § Higher-level libraries enable Apache Spark to handle different types of big data workload Apache Spark: A Unified Engine for Big Data Processing PAGE 34

“Try Apache Spark if you are new to the big data processing world” Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing PAGE 35

Q&A § What issues will it cause by persisting data in memory? For example, garbage collection? § What are Parallel Random Access Machine model and Bulk Synchronous Parallel model? Are these two models able to model any computation in distributed world? § Will optimizing one library cause other libraries to lose performance? § Is using memory as the storage really the next generation of storage? Apache Spark: A Unified Engine for Big Data Processing PAGE 36

Apache Spark: A Unified Engine for Big Data Processing Presented - PowerPoint PPT Presentation

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt

Write-dominated Hybrid Storage Nodes in Cloud Shuyang Liu 1 , Shucheng Wang 1 , Qiang Cao 1 , Ziyi

lecture 8 MIPS assembly language 1 - what is an assembly language? - addressing and Memory -

Compiler Development (CMPSC 401) Syntax Analysis Janyl Jumadinova February 14, 2019 Janyl

Dyalog APL/W Conference 2011 Unicode Edition Serial No : 000000 Mon Feb 20 20:24:29 2012 clear

iSNS Internet Storage Name Service draft-tseng-isns-01.txt Josh Tseng Technology Consultant

BlueMountain Enabling Automated, Rich, and Versatile Data Management for Android Apps Sharath

Kotlin/JS in 1.4 and beyond Sebastian Aigner October 14, 2020 @sebi_io An overview of

SMO: An Integrated Approach To Intra-Array And Inter-Array Storage Optimization Somashekaracharya

Apache Spark: A Unified Engine for Big Data Processing Presented - PowerPoint PPT Presentation

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark: A Unified Engine for Big Data Processing Engine? Unified? Apache Spark: A Unified Engine for Big Data Processing PAGE 2 Apache Spark: A

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Apache Gearpump next-gen streaming engine Karol Brejna, Intel (karolbrejna@apache.org) Huafeng

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Apache Apex: Next Gen Big Data Analytics Thomas Weise &lt;thw@apache.org&gt; @thweise PMC Chair

PySpark()(Data(Processing(in(Python( on(top(of(Apache(Spark Peter%Hoffmann

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 What is Apache

Apache Spark Dr. Mihail Content derived from: Ankam, Venkat. Big Data Analytics. Packt

Write-dominated Hybrid Storage Nodes in Cloud Shuyang Liu 1 , Shucheng Wang 1 , Qiang Cao 1 , Ziyi

lecture 8 MIPS assembly language 1 - what is an assembly language? - addressing and Memory -

Compiler Development (CMPSC 401) Syntax Analysis Janyl Jumadinova February 14, 2019 Janyl

Dyalog APL/W Conference 2011 Unicode Edition Serial No : 000000 Mon Feb 20 20:24:29 2012 clear

iSNS Internet Storage Name Service draft-tseng-isns-01.txt Josh Tseng Technology Consultant

BlueMountain Enabling Automated, Rich, and Versatile Data Management for Android Apps Sharath

Kotlin/JS in 1.4 and beyond Sebastian Aigner October 14, 2020 @sebi_io An overview of

SMO: An Integrated Approach To Intra-Array And Inter-Array Storage Optimization Somashekaracharya

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Apache Apex: Next Gen Big Data Analytics Thomas Weise <thw@apache.org> @thweise PMC Chair