Big Data Processing with Apache Spark Jay Urbain, PhD Credits: Resilient Distributed Datasets Resilient Distributed Datasets A Fault-T A Fault-Tolerant Abstraction for In-Memory Cluster Computing olerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica http://spark.apache.org/
Motivation
Example: MapReduce
Example: MapReduce
Example: MapReduce
Example: MapReduce
Example: MapReduce
Idea: cache data in-memory h"p://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ¡ ¡
Example: MapReduce h"p://people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf ¡ ¡
Goal: In-Memory Data Sharing
Challenge
Challenge h"p://web.stanford.edu/~ouster/cgi-‑bin/papers/ramcloud.pdf ¡ ¡ h"p://piccolo.news.cs.nyu.edu/piccolo.pdf ¡ ¡
Solution: Resilient Distributed Datasets (RDDs)
RDD Recovery
Generality of RDDs
Tradeoffs
Tradeoffs
Tradeoffs
h"p://databricks.com/blog/2014/11/05/spark-‑officially-‑sets-‑a-‑new-‑record-‑in-‑large-‑scale-‑sorDng.html ¡ ¡
Programming API
Programming Spark • Written in Scala “ scah-lah ” (runs on JVM) • Can write applications in Scala, Java, Python, and R • Interactive: Scala, Python, R
h"p://mesos.apache.org/ ¡ ¡
Spark References • http://spark.apache.org/docs/latest/programming- guide.html • http://spark.apache.org/docs/latest/api/python/index.html
h"p://shop.oreilly.com/product/0636920028512.do ¡
h"p://shop.oreilly.com/product/0636920028512.do ¡
h"p://shop.oreilly.com/product/0636920028512.do ¡
h"p://shop.oreilly.com/product/0636920028512.do ¡
Recommend
More recommend