Spark: Resilient Distributed Datasets as Workflow System H. Andrew Schwartz CSE545 Spring 2020
Big Data Analytics, The Class Goal: Generalizations A model or summarization of the data. Data Frameworks Algorithms and Analyses Similarity Search Hadoop File System Spark Hypothesis Testing Graph Analysis Streaming Recommendation Systems MapReduce Tensorflow Deep Learning
Where is MapReduce Inefficient? DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot).
Where is MapReduce Inefficient? ● Long pipelines sharing data ● Interactive applications ● Streaming applications ● Iterative algorithms (optimization problems) DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot). (Anytime where MapReduce would need to write and read from disk a lot).
Where is MapReduce Inefficient? ● Long pipelines sharing data ● Interactive applications ● Streaming applications ● Iterative algorithms (optimization problems) DFS Map LocalFS Network Reduce DFS Map ... (Anytime where MapReduce would need to write and read from disk a lot). (Anytime where MapReduce would need to write and read from disk a lot).
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s).
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 Create RDD dfs:// (DATA) filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 transformation1() dfs:// (DATA) (DATA) filename created from dfs://filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (can drop (DATA) (DATA) the data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). ● Enables rebuilding datasets on the fly. ● Intermediate datasets not stored on disk (and only in memory if needed and enough space) Faster communication and I O
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). “Stable Storage” Other RDDs
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). map filter join ...
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). RDD1 RDD2 RDD3 transformation2() dfs:// (DATA) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename
Spark’s Big Idea Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of RDD4 ) ( transformations from other dataset(s). 3 (DATA) n o i t a m transformation3 r o from RDD2 f s n a r t RDD1 RDD2 RDD3 transformation2() dfs:// (will recreate (DATA) data) filename created from transformation1 transformation2 from RDD1 from RDD2 dfs://filename
(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record Multiple Records of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
(original) Transformations : RDD to RDD Resilient Distributed Datasets (RDDs) -- Read-only partitioned collection of records (like a DFS) but with a record of how the dataset was created as combination of transformations from other dataset(s). (orig.) Actions : RDD to Value Object, or Storage Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
Current Transformations and Actions http://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations common transformations: filter, map, flatMap, reduceByKey, groupByKey http://spark.apache.org/docs/latest/rdd-programming-guide.html#actions common actions: collect, count, take
Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors count() Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
Example Count errors in a log file: lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: count() lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.count Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
Example 2 Collect times of hdfs-related errors lines filter.(_.startsWith(“ERROR”)) TYPE MESSAGE TIME errors Pseudocode: lines = sc.textFile(“dfs:...”) errors = lines.filter(_.startswith(“ERROR”)) errors.persist errors.count ... Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing.”. NSDI 2012 . April 2012.
Recommend
More recommend