Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea - PowerPoint PPT Presentation

Università degli Studi di Roma “ Tor Vergata ” Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno

The reference Big Data stack High-level Interfaces Support / Integration Data Processing Data Storage Resource Management Matteo Nardelli - SABD 2017/18 1

Main reference for this lecture H.Karau, A. Konwinski, P. Wendell, M. Zaharia, "Learning Spark" O'Reilly Media, 2015. Matteo Nardelli - SABD 2017/18 2

Java 8: Lambda Expressions • You're usually trying to pass functionality as an argument to another method – e.g., what action should be taken when someone clicks a button • Lambda expressions enable to treat functionality as method argument, or code as data Matteo Nardelli - SABD 2017/18 3

Java 8: Lambda Expressions Example: a social networking application. • You want to create a feature that enables an administrator to perform any kind of action, such as sending a message, on members of the social networking application that satisfy certain criteria public class Person { • Suppose that members of public enum Sex { MALE, FEMALE } this social networking String name; application are LocalDate birthday; represented by the Sex gender; following Person class: String emailAddress; public int getAge() { ... } public void printPerson() { ... } } Matteo Nardelli - SABD 2017/18 4

Java 8: Lambda Expressions • Suppose that the members of your social networking application are stored in a List instance Approach 1: Create Methods That Search for Members That Match One Characteristic public static void invitePersons(List<Person> roster, int age){ for (Person p : roster) { if (p.getAge() >= age) { p.sendMessage(); } } } Matteo Nardelli - SABD 2017/18 5

Java 8: Lambda Expressions Approach 2: Specify Search Criteria Code in a Local Class public static void invitePersons(List<Person> roster, CheckPerson tester){ for (Person p : roster) { if (tester.test(p)) { p.sendMessage(); } } } interface CheckPerson { boolean test(Person p); } class CheckEligiblePerson implements CheckPerson { public boolean test(Person p) { return p.getAge() >= 18 && p.getAge() <= 25; } } Matteo Nardelli - SABD 2017/18 6

Java 8: Lambda Expressions Approach 3: Specify Search Criteria Code in an Anonymous Class invitePersons( roster, new CheckPerson() { public boolean test(Person p) { return p.getAge() >= 18 && p.getAge() <= 25; } } ); Matteo Nardelli - SABD 2017/18 7

Java 8: Lambda Expressions Approach 4: Specify Search Criteria Code with a Lambda Expression invitePersons( roster, (Person p) -> p.getAge() >= 18 && p.getAge() <= 25 ); Matteo Nardelli - SABD 2017/18 8

Apache Spark Matteo Nardelli - SABD 2017/18

Spark Cluster • Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in a Spark program (called the driver program ). Cluster Manager Types • Standalone: a simple cluster manager included with Spark • Apache Mesos • Hadoop YARN Matteo Nardelli - SABD 2017/18 10

Spark Cluster • You can start a standalone master server by executing: $ $SPARK_HOME/sbin/start-master.sh (on master node) • Similarly, you can start one or more workers and connect them to the master via: $ $SPARK_HOME/sbin/start-slave.sh <master-spark-URL> (on slave nodes) • It is also possible to start slaves from the master node: # Starts a slave instance on each machine specified # in the conf/slaves file on the master node $ $SPARK_HOME/sbin/start-slaves.sh (on master node) • Spark has a WebUI reachable at http://localhost:8080 Matteo Nardelli - SABD 2017/18 11

Spark Cluster • You can stop the master server by executing: $ $SPARK_HOME/sbin/stop-master.sh (on master node) • Similarly, you can stop a worker via: $ $SPARK_HOME/sbin/stop-slave.sh (on slave nodes) • It is also possible to stop slaves from the master node: # Starts a slave instance on each machine specified # in the conf/slaves file on the master node $ $SPARK_HOME/sbin/stop-slaves.sh (on master node) Matteo Nardelli - SABD 2017/18 12

Spark: Launching Applications $ ./bin/spark-submit \ --class <main-class> \ --master <master-url> \ [--conf <key>=<value>] \ <application-jar> \ [application-arguments] --class : The entry point for your application (e.g. package.WordCount) --master : The master URL for the cluster e.g., " local ", " spark://HOST:PORT ", " mesos://HOST:PORT " --conf : Arbitrary Spark configuration property application-jar : Path to a bundled jar including your application and all dependencies. application-arguments : Arguments passed to the main method of your main class, if any Matteo Nardelli - SABD 2017/18 13

Spark programming model • Spark programming model is based on parallelizable operators • Parallelizable operators are higher-order functions that execute user-defined functions in parallel • A data flow is composed of any number of data sources, operators, and data sinks by connecting their inputs and outputs • Job description based on DAG Matteo Nardelli - SABD 2017/18 14

Resilient Distributed Dataset (RDD) • Spark programs are written in terms of operations on RDDs • RDDs built and manipulated through: – Coarse-grained transformations • Map, filter, join, … – Actions • Count, collect, save, … Matteo Nardelli - SABD 2017/18 15

Resilient Distributed Dataset (RDD) • The primary abstraction in Spark: a distributed memory abstraction • Immutable, partitioned collection of elements – Like a LinkedList <MyObjects> – Operated on in parallel – Cached in memory across the cluster nodes • Each node of the cluster that is used to run an application contains at least one partition of the RDD(s) that is (are) defined in the application Matteo Nardelli - SABD 2017/18 16

Resilient Distributed Dataset (RDD) • Stored in main memory of the executors running in the worker nodes (when it is possible) or on node local disk (if not enough main memory) – Can persist in memory, on disk, or both • Allow executing in parallel the code invoked on them – Each executor of a worker node runs the specified code on its partition of the RDD – A partition is an atomic piece of information – Partitions of an RDD can be stored on different cluster nodes Matteo Nardelli - SABD 2017/18 17

Resilient Distributed Dataset (RDD) • Immutable once constructed – i.e., the RDD content cannot be modified • Automatically rebuilt on failure (but no replication) – Track lineage information to efficiently recompute lost data – For each RDD, Spark knows how it has been constructed and can rebuilt it if a failure occurs – This information is represented by means of a DAG connecting input data and RDDs • Interface – Clean language-integrated API for Scala, Python, Java, and R – Can be used interactively from Scala console Matteo Nardelli - SABD 2017/18 18

Resilient Distributed Dataset (RDD) • Applications suitable for RDDs – Batch applications that apply the same operation to all elements of a dataset • Applications not suitable for RDDs – Applications that make asynchronous fine-grained updates to shared state, e.g., storage system for a web application Matteo Nardelli - SABD 2017/18 19

Spark and RDDs • Spark manages scheduling and synchronization of the jobs • Manages the split of RDDs in partitions and allocates RDDs’ partitions in the nodes of the cluster • Hides complexities of fault-tolerance and slow machines • RDDs are automatically rebuilt in case of machine failure Matteo Nardelli - SABD 2017/18 20

Spark and RDDs Matteo Nardelli - SABD 2017/18 21

How to create RDDs • RDD can be created by: – Parallelizing existing collections of the hosting programming language (e.g., collections and lists of Scala, Java, Python, or R) • Number of partitions specified by user • API: parallelize – From (large) files stored in HDFS or any other file system • One partition per HDFS block • API: textFile – By transforming an existing RDD • Number of partitions depends on transformation type • API: transformation operations ( map , filter , flatMap ) Matteo Nardelli - SABD 2017/18 22

How to create RDDs • parallelize : Turn a collection into an RDD • textFile : Load text file from local file system, HDFS, or S3 Matteo Nardelli - SABD 2017/18 23

Operations over RDD Transformations • Create a new dataset from and existing one. • Lazy in nature. They are executed only when some action is performed. • Example: map(), filter(), distinct() Actions • Returns to the driver program a value or exports data to a storage system after performing a computation. • Example: count(), reduce(), collect() Persistence • For caching datasets in-memory for future operations. Option to store on disk or RAM or mixed. • Functions: persist(), cache() Matteo Nardelli - SABD 2017/18 24

Operations over RDD: Transformations Matteo Nardelli - SABD 2017/18 25

Operations over RDD: Transformations Matteo Nardelli - SABD 2017/18 26

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea - PowerPoint PPT Presentation

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno The reference Big

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Hands Overview Outline Existing hands Robot hands of the 80s Commercial hands Research

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce

Goals: Devise space-time DG-method for the wave equation : u tt u xx = f in Q := ] 0 , T

Lecture 18 Review: E&M, Relativity Finishing Classical Physics: Waves, E&M Timeline The

UNI Extensions for Diversity and Latency Support

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea - PowerPoint PPT Presentation

Universit degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Apache Spark: Hands-on Session A.A. 2017/18 Matteo Nardelli Laurea Magistrale in Ingegneria Informatica - II anno The reference Big

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Hands Overview Outline Existing hands Robot hands of the 80s Commercial hands Research

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Cypher for Apache Spark Graph processing workloads on OLAP and OLTP Mats Rydberg

Distributed Deep Learning Inference using Apache MXNet* and Apache Spark Naveen Swamy Amazon AI

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Resilient Distributed Datasets Presented by Henggang Cui 15799b Talk 1 Why not MapReduce

Goals: Devise space-time DG-method for the wave equation : u tt u xx = f in Q := ] 0 , T

Lecture 18 Review: E&amp;M, Relativity Finishing Classical Physics: Waves, E&amp;M Timeline The

UNI Extensions for Diversity and Latency Support

Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin,

CSE 547 : Spark Tutorial Topics Overview Useful Spark Actions and Operations Help

An Introduction to Apostolos N. Papadopoulos (papadopo@csd.auth.gr) Assistant Professor Data

Introduction to Big Data Systems CS 448 - Spring 2019 March 18th Thamir Qadah Overview

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Lecture 18 Review: E&M, Relativity Finishing Classical Physics: Waves, E&M Timeline The