Re Resilient Distributed Datasets: A Fa Fault-To Tolerant Abstraction for In In-Me Memor ory Cl Cluster r Com Computi ting Authors: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica, University of California, Berkeley NSDI’12 Awarded Best Paper! Presented by Xiaofeng Wu, adapted from Matei’s NSDI’12 presentation and other resources 1
Problems • Current cluster computing frameworks • MapReduce • Dryad: distributed data-parallel programs from sequential building blocks • Pros: high-level operators; work distribution; fault tolerance • Problems when doing large-scale data analytics • lack abstractions for leveraging distributed memory • inefficient in reusing intermediate results across multiple computations • iterative machine learning; graph algorithms • e.g., PageRank, K-means clustering, and logistic regression • interactive data mining • e.g., multiple ad-hoc queries on the same subset of the data 2
Examples HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input query 1 result 1 HDFS read query 2 result 2 query 3 result 3 Input . . . Substantial overheads: data replication, disk I/O, and serialization 3
Existing solutions • Specialized frameworks • Pregel: a system for iterative graph computations that keeps intermediate data in memory • HaLoop: offer iterative MapReduce interface • Problems • They do not provide abstractions for more general reuse • e.g., to let a user load several datasets into memory; run ad-hoc queries • Existing in-memory storage on clusters • e.g., distributed shared memory; key-value stores; databases; Piccolo • Problems for data-intensive workloads • only provide low-level programming interface : just reads and updates to table cells • to copy large amounts of data over the cluster network when providing fault tolerance • to replicate too much data across machines or to log updates across machines 4 Piccolo: a new data-centric programming model for writing parallel in-memory applications in data centers.
Comparison of RDDs with distributed shared memory 5
Proposal • Resilient distributed datasets (RDDs) • Why RDD is better? • optimized data placement via controlling partitioning • let users explicitly persist intermediate results in memory • enables efficient data reuse in a broad range of applications • a rich set of operators : map, join, filter • fault-tolerant by logging the transformations used to build a dataset ( lineage ) • up to 20× faster than Hadoop for iterative applications, speeds up a real- world data analytics report by 40×, and can be used interactively to scan a 1 TB dataset with 5–7s latency. 6
Applications Not Suitable for RDDs • Applications that make asynchronous fine-grained updates to shared state. • e.g., a storage system for a web application or an incremental web crawler • Reason: • RDDs are best suited for batch applications that apply the same operation to all elements of a dataset 7
What is RDD • A distributed memory abstraction that lets programmers perform in- memory computations on large clusters in a fault-tolerant manner. • An RDD is a read-only, partitioned collection of records. • RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs. 8
“Worker” “Driver” Partition RDD “Worker” Partition “Worker” Partition “Worker” Partition 9 Adopted from: http://www.cs.tau.ac.il/~milo/courses/ds16slides/OmriZ.pptx
Question 1: • (1) “...individual RDDs are immutable...” • What does it mean by being “immutable”? • What benefits does this property of RDD bring? 10
In-Memory Data Sharing HDFS HDFS HDFS HDFS read write read write iter. 1 iter. 2 . . . Input query 1 result 1 HDFS one-time read processing query 2 result 2 query 3 result 3 Input . . . 11
Question 2: • (2) When an RDD is being created (new data are being written into it), can the data in the RDD be read for computing before the RDD is completely created? 12
RDD(Recovery( iter."1" iter."2" .((.((.( Input" query"1" one(time" processing" query"2" query"3" Input" .((.((.( 13
Tradeoff(Space( Network" Memory" bandwidth" bandwidth" Fine" Best(for( K(V"stores," transactional( databases," workloads( RAMCloud" Granularity( of(Updates( Best(for(batch( workloads( HDFS" RDDs" Coarse" Low" High" Write(Throughput( 14
Spark Programming Interface • DryadLINQ and FlumeJava-like API in the Scala language • Usable interactively from Scala interpreter • Provides: • Resilient distributed datasets (RDDs) represented as objects • Operations on RDDs: • transformations (build new RDDs) • actions (compute and output results) • Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, or spill on disk, etc) 15
Example: Log Mining • Problem Description • Suppose that a web service is experiencing errors and an operator wants to search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause. ERROR 16
Transformations and actions available on RDDs in Spark 17
Question 5: lines = spark.textFile("hdfs://...") Explain Figure 1 about a lineage graph. //transformations errors =lines.filter(_.startsWith("ERROR")) errors.persist() // persistence errors.count() // action // Count errors mentioning MySQL: errors.filter(_.contains("MySQL")).count() // Return the time fields of errors mentioning // HDFS as an array (assuming time is field // number 3 in a tab-separated format): errors.filter(_.contains("HDFS")) .map(_.split("\t")(3)).collect() Figure: the lineage graph for the RDDs in the third query Spark can rebuild it by applying a filter on only the corresponding partition of lines. 18
Representing RDDs Simple graph-based representation Representing of each RDD 1. a set of partitions , which are atomic pieces of the dataset. 2. a set of dependencies on parent RDDs. 3. a function for computing the dataset based on its parents. 4. metadata of partitioning scheme . 5. metadata of data placement . 19
Fault(Recovery( RDDs"track"the"graph"of"transformations"that" built"them"(their" lineage )"to"rebuild"lost"data" E.g.:" messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) " " HadoopRDD" FilteredRDD" MappedRDD" HadoopRDD" FilteredRDD" MappedRDD" " " " path"="hdfs://…" func"="_.contains(...)" func"="_.split(…)" 20
Question 3: • (3) “This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.” • “To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state.” • Why does using RDD help to provide efficient fault tolerance? or why does coarse-grained transformation help with the efficiency? 21
Question 4: • (4) “In addition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations.” • What’s the consequence if a user does not explicitly request persistence of an RDD? 22
Example: PageRank 23
Example:(PageRank( 1."Start"each"page"with"a"rank"of"1" 2."On"each"iteration,"update"each"page’s"rank"to" Σ i � neighbors "rank i "/"|neighbors i | " links = // RDD of (url, neighbors) pairs ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) } 24 "
Optimizing(Placement( links "&" ranks "repeatedly"joined Links" Ranks 0" (url,"neighbors)" (url,"rank)" Can" co,partition "them"(e.g."hash" join" both"on"URL)"to"avoid"shuffles" Contribs 0" reduce" Can"also"use"app"knowledge," Ranks 1" e.g.,"hash"on"DNS"name" join" Contribs 2" links = links.partitionBy( reduce" new URLPartitioner()) Ranks 2" . . . 25
PageRank(Performance( 171" 200" Time(per(iteration((s)( Hadoop" 150" Basic"Spark" 100" 72" Spark"+"Controlled" 50" 23" Partitioning" 0" 26
Conclusion • Efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications • Coarse-grained transformations that let them recover data efficiently using lineage • Expressive for a wide range of parallel applications like iterative computation 27
Recommend
More recommend