Spark RDD Operations Transformations and Actions 1 RDD Processing - PowerPoint PPT Presentation

Spark RDD Operations Transformations and Actions 1

RDD Processing Model • RDD can be modeled using the Bulk Synchronous Parallel (BSP) model Communication Independent Local Independent Local Processor 1 Processing Processing Independent Local Independent Local Processor 2 Processing Processing Independent Local Independent Local … Processing Processing Independent Local Independent Local Processor n Processing Processing Barrier 2

RDD  BSP • In Spark RDD, you can generally think of these two rules ▪ Narrow dependency ➔ Local processing ▪ Wide dependency ➔ Network communication 3

Local Processing in RDD RDD<T> RDD Local Processing • A simple abstraction for local processing • Based on functional programming • Local Processing(input: Iterator<T>, output: Writer) { … // output.write(U) } 4

Functional Programming • RDD is a functional programming paradigm • Which of these are functions? C output D B A input 5

Functional Programming • RDD is a functional programming paradigm • Which of these are functions? C output D B A input 6

Function Limitations • For one input, the function should return one output • The function should be memoryless ▪ Should not remember past input • The function should be stateless ▪ Should not change any state when called • It is up to the developer to enforce these properties 7

Examples Function1(x) { Int sum return x + 5; Function2(x) { } sum += x; return sum; } RNG random; Function3(x) { random.randomInt(0, x); } Map<String, Int> lookuptable; Function4(x) { return lookuptable.get(x); } 8

Examples Function1(x) { Int sum return x + 5; Function2(x) { } sum += x; return sum; } RNG random; Function3(x) { random.randomInt(0, x); } Map<String, Int> lookuptable; Function4(x) { return lookuptable.get(x); } 9

Network Communication • a.k.a. shuffle operation • Given a record 𝑠 and 𝑜 partitions: ▪ Assign the record to one of the partitions [0, 𝑜 − 1] 10

RDD Operations • Spark is rich with operations • Sometimes, you can do the same logic with more than one way • In the following part, we will explain how different RDD operations work • The goal is to understand the performance implications of these operations and choose the most efficient one 11

RDD<T>#filter • Filter is a function{T → Boolean} • Applies the predicate function on each record and produces that tuple only of the predicate returns true • Result RDD<T> with same or fewer records than the input • Local Processing { for-each (t in input) { if (func(t)) output.write(t) } } 12

RDD<T>#map(func) • func: T → U • Applies the map function to each record in the input to produce one record • Results in RDD with the same number of records as the input • Local Processing { for-each (t in input) output.write(func(t)) } 13

RDD<T>#flatMap(func) • func: T → Iterator<V> • Applies the map function to each record and add all resulting values to the output RDD • Result: RDD<V> • This is the closest function to the Hadoop map function • Local Processing { Iterator<V> results = func(input); for (V result : results) output.write(result) } 14

RDD<T>#mapPartition(func) • func: Iterator<T> → Iterator • Applies the map function to a list of records in one partition in the input and adds all resulting values to the output RDD • Can be helpful in two situations ▪ If there is a costly initialization step in the function ▪ If many records can result in one record • Result: RDD 15

RDD<T>#mapPartition(func) • Local Processing { results = func(input) for-each (v in results) output.write(v); } 16

RDD<T>#mapPartitionWithIndex(func) • func: (Integer, Iterator<T>) → Iterator • Similar to mapPartition but provides a unique index for each partition • To achieve this in Spark, the partition ID is passed to the function 17

RDD<T>#sample(r, f, s) • r: Boolean: With replacement (true/false) • f: Float: Fraction [0,1] • s: Long: Seed for random number generation • Returns RDD<T> with a sample of the records in the input RDD • Can be implemented using mapPartitionWithIndex as follows ▪ Initialize the random number generator based on seed and partition index ▪ Select a subset of records as desired ▪ Return the sampled records 18

RDD<T>#reduce(func) • func: (T, T) → T • Reduces all the records to a single value by repeatedly applying the given function • The function should be associative and commutative • Result: T • This is an action 19

RDD<T>#reduce(func) • mapPartition { T result = input.next for-each (r in input) result = reduce(result, r) return result } • Shuffle: assign all records to one partition • Collect partial results and apply the same function again 20

RDD<T>#reduce(func) f Driver Machine f f f Final Result Local Processing Network f f f Transfer f Local Processing f f Local Processing 21

RDD<K,V>#reduceByKey(func) • func: (V, V) → V • Similar to reduce but applies the given function to each group separately • Since there could be so many groups, this operation is a transformation that can be followed by further transformations and actions • Result: RDD<K,V> • By default, number of reducers is equal to number of input partitions but can be overridden 22

RDD<K,V>#reduceByKey(func) • mapPartition { Map<K,V> results; for-each ((k,v) in input) { if (results.contains(k)) results[k] = reduce(results[k], v); else results[k] = v; } } • Shuffle by key, assign (k,v) to hash(k) mod n • mapPartition { // All input records have the same key V result = value.next for-each (v in values) result = reduce(result, v) output.write(k, v) } 23

RDD<T>#distinct() • Removes duplicate values in the input RDD • Returns RDD<T> • Implemented as follows map(x => (x, null)). reduceByKey((x, y) => x, numPartitions). map(_._1) 24

Limitation of reduce methods • Both reduce methods have a limitation is that they have to return a value of the same type as the input. • Let us say we want to implement a program that operates on an RDD<Integer> and returns one of the following values ▪ 0: Input is empty ▪ 1: Input contains only odd values ▪ 2: Input contains only even values ▪ 3: Input contains a mix of even and odd values 25

RDD<T>#aggregate(zero, seqOp, combOp) • zero: U - Zero value of type U • seqOp: (U, T) → U – Combines the aggregate value with an input value • combOp: (U, U) → U – Combines two aggregate values • Like reduce, aggregate is an action • Returns U • Similarly, aggregateByKey is a transformation that takes RDD<K,V> and returns RDD<K,U> 26

RDD<T>#aggregate(zero, seqOp, combOp) • mapPartition { U partialResult = zero for-each (t in input) result = seqOp(partialResult, t) return partialResult } • Collect all partial results into one partition • mapPartition { U finalResult = input.next for-each (u in input) finalResult = combOp(finalResult, u) return finalResult } 27

RDD<T>#aggregate(zero, seqOp, combOp) • Example: • RDD<Integer> values • Byte marker = values.aggregate( (Byte)0, (result: Byte, x: Integer) => { if (x % 1 == 0) // Even return result | 2; else return result | 1; }, (result1: Byte, result2: Byte) => result1 | result2 ); 28

RDD<T>#aggregate(zero, seqOp, combOp) z s s Driver Machine s s Final Result s Local Processing Network c c c Transfer Local Processing 29

RDD<K,V>#groupByKey() • Groups all values with the same key into the same partition • Closest to the shuffle operation in Hadoop • Returns RDD<K, Iterator<V>> • Performance notice: By default, all values are kept in memory so this method can be very memory consuming. • Unlike the reduce and aggregate methods, this method does not run a combiner step, i.e., all records get shuffled over network 30

Further Readings • List of common transformations and actions ▪ http://spark.apache.org/docs/latest/r dd-programming- guide.html#transformations • Spark RDD Scala API ▪ http://spark.apache.org/docs/latest/a pi/scala/index.html#org.apache.spark. rdd.RDD 31

Spark RDD Operations Transformations and Actions 1 RDD Processing - PowerPoint PPT Presentation

Spark RDD Operations Transformations and Actions 1 RDD Processing Model RDD can be modeled using the Bulk Synchronous Parallel (BSP) model Communication Independent Local Independent Local Processor 1 Processing Processing Independent

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop

Math 211 Math 211 Lecture #19 Nullspaces and Subspaces October 9, 2002 2 Homogeneous Systems

Week 4 -Wednesday What did we talk about last time? Functions Unix never says

Control Announcements Print and None (Demo) None Indicates that Nothing is Returned The special

Lex and Yacc A Quick Tour Lex (& Flex): A Lexical Analyzer Generator Input: Regular

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and

What is the kernel upto? Powerful tracing techniques Joel Fernandes

Trace Abstraction Monday, December 14, 2011 Example Our Model of a Verification Problem 0

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Job Shop Lecture 15 Modelling Flow

Spark RDD Operations Transformations and Actions 1 RDD Processing - PowerPoint PPT Presentation

Spark RDD Operations Transformations and Actions 1 RDD Processing Model RDD can be modeled using the Bulk Synchronous Parallel (BSP) model Communication Independent Local Independent Local Processor 1 Processing Processing Independent

Spark RDD Operations Transformation and Actions 1 MapReduce Vs RDD Both MapReduce and RDD can

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

SparkSQL 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce Spark RDD

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

SparkSQL 11/14/2018 1 Where are we? Pig Latin HiveQL Pig Hive ??? Hadoop MapReduce

Spark RDD 1 Where are we? Distributed storage in HDFS MapReduce query execution in Hadoop

Math 211 Math 211 Lecture #19 Nullspaces and Subspaces October 9, 2002 2 Homogeneous Systems

Week 4 -Wednesday What did we talk about last time? Functions Unix never says

Control Announcements Print and None (Demo) None Indicates that Nothing is Returned The special

Lex and Yacc A Quick Tour Lex (&amp; Flex): A Lexical Analyzer Generator Input: Regular

Eligibility Traces Chapter 12 Eligibility traces are Another way of interpolating between MC and

What is the kernel upto? Powerful tracing techniques Joel Fernandes

Trace Abstraction Monday, December 14, 2011 Example Our Model of a Verification Problem 0

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Job Shop Lecture 15 Modelling Flow

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Lex and Yacc A Quick Tour Lex (& Flex): A Lexical Analyzer Generator Input: Regular