spark processing 101
play

Spark Processing 101 September 10, 2015 Justin Sun Overview What - PowerPoint PPT Presentation

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext Resilient Distributed Datasets (RDDs) Transformations Actions Code Examples Resources What is Spark? General cluster


  1. Spark Processing 101 September 10, 2015 Justin Sun

  2. Overview  What is Spark?  SparkContext  Resilient Distributed Datasets (RDDs)  Transformations  Actions  Code Examples  Resources

  3. What is Spark?  General cluster computing system for Big Data  Supports in-memory processing  APIs for Scala, Java, and Python  Additional libraries:  Spark Streaming – Process live data streams  Spark SQL – SQL and Data Frames  MLlib – Machine learning  GraphX - Graph processing

  4. Spark Context  Starting point for working with Spark  Specifjes access to cluster or local machine  Required if you write a standalone program  Provided as ‘ sc ’ by the Spark shell  Scala: val conf = new SparkConf().setAppName("Simple App") val sc = new SparkContext(conf)  Java: SparkConf conf = new SparkConf().setAppName("Simple App"); JavaSparkContext sc = new JavaSparkContext(conf);

  5. Resilient Distributed Datasets (RDDs)  Main abstraction in Spark  Fault-tolerant  Supports parallel operations  Create RDDs by  Calling sc.parallelize()  Reading in data from an external source  T ext fjle – sc.textFile()  HDFS source  Cassandra

  6. Transformations  Immutable after creation  Enable parallel computations  Input is an RDD, output is a pointer to an RDD  Can be chained together  Arguments are functions or closures  Lazy evaluation: Nothing happens until an action is run

  7. Actions  Program is run when an action is called  Examples:  reduce()  collect()  count()  fjrst()  take()

  8. Visual Transformations  DataBricks Visual Guide to Spark Transformations and Actions – http://training.databricks.com/visualapi.pdf  map()  fjlter()  fmatMap()

  9. Code examples http://spark.apache.org/docs/latest/quick-start.h tml

  10. Resources  Spark website – http://spark.apache.org/docs/latest  Quick Start – http://spark.apache.org/docs/latest/quick-start.html  DataBricks Developer Resources – https://databricks.com/spark/developer-resources  Spark YouT ube channel – https://www.youtube.com/channel/UCRzsq7k4-kT-h 3TDUBQ82-w  edX.org Online Courses  CS100.1X – Introduction to Big Data with Apache Spark  CS190.1X – Scalable Machine Learning

Recommend


More recommend