Spark Processing 101 September 10, 2015 Justin Sun
Overview What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources
What is Spark? General cluster computing system for Big Data Supports in-memory processing APIs for Scala, Java, and Python Additional libraries: Spark Streaming – Process live data streams Spark SQL – SQL and Data Frames MLlib – Machine learning GraphX - Graph processing
Spark Context Starting point for working with Spark Specifjes access to cluster or local machine Required if you write a standalone program Provided as ‘ sc ’ by the Spark shell Scala: val conf = new SparkConf().setAppName("Simple App") val sc = new SparkContext(conf) Java: SparkConf conf = new SparkConf().setAppName("Simple App"); JavaSparkContext sc = new JavaSparkContext(conf);
Resilient Distributed Datasets (RDDs) Main abstraction in Spark Fault-tolerant Supports parallel operations Create RDDs by Calling sc.parallelize() Reading in data from an external source T ext fjle – sc.textFile() HDFS source Cassandra
Transformations Immutable after creation Enable parallel computations Input is an RDD, output is a pointer to an RDD Can be chained together Arguments are functions or closures Lazy evaluation: Nothing happens until an action is run
Actions Program is run when an action is called Examples: reduce() collect() count() fjrst() take()
Visual Transformations DataBricks Visual Guide to Spark Transformations and Actions – http://training.databricks.com/visualapi.pdf map() fjlter() fmatMap()
Code examples http://spark.apache.org/docs/latest/quick-start.h tml
Resources Spark website – http://spark.apache.org/docs/latest Quick Start – http://spark.apache.org/docs/latest/quick-start.html DataBricks Developer Resources – https://databricks.com/spark/developer-resources Spark YouT ube channel – https://www.youtube.com/channel/UCRzsq7k4-kT-h 3TDUBQ82-w edX.org Online Courses CS100.1X – Introduction to Big Data with Apache Spark CS190.1X – Scalable Machine Learning
Recommend
More recommend