CNV/CC&V MEIC-A/MEIC-T/METI Computação em Nuvem e Virtualização Big-Data Processing III (Stream Processing) Prof. Luís Veiga IST / INESC-ID Lisboa https://fenix.tecnico.ulisboa.pt/disciplinas/AVExe7/2019-2020/2-semestre/ LV, JG 2015-20, sources Spark, Flink
Agenda Spark overview, RDDs programming Model, Examples RDD operations fault-tolerance, Performance Spark Streaming overview, discretized stream processing windows, sliding windows, micro-batching Flink overview, windowing tumbling windows, sliding windows, custom windows time-based windows, watermarks state management, versioning, fault tolerance, distributed snapshots, execution semantics 2
Spark
Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Output Map Reduce Map 4
Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Benefits of data flow: runtime can Reduce decide where to run tasks and can Input Output Map automatically recover from failures Reduce Map 5
Motivation Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data: Iterative algorithms (many in machine learning) Interactive data mining tools (R, Excel, Python) Spark makes working sets a first-class concept to efficiently support these apps 6
Spark Goal Provide distributed memory abstractions for clusters to support apps with working sets Retain the attractive properties of MapReduce: Fault tolerance (for crashes & stragglers) Data locality Scalability Solution: augment data flow model with “ resilient distributed datasets ” (RDDs) 7
Generality of RDDs Spark ’ s combination of data flow with RDDs unifies many proposed cluster programming models General data flow models: MapReduce, Dryad, SQL Specialized models for stateful apps: Pregel (Bulk Synchronous Processing), HaLoop (iterative MR), Continuous Bulk Processing Instead of specialized APIs for 1 type of app, give the users first-class control of the distributed datasets 8
Programming Model Resilient distributed datasets (RDDs) Immutable collections partitioned across clusters that can be rebuilt if a partition is lost. Created by transforming data in stable storage using data flow operators (map, filter, group- by, …) Can be cached across parallel operations Parallel operations on RDDs Reduce, collect, count, save, … 9
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile( “ hdfs://... ” ) Worker results errors = lines.filter(_.startsWith( “ ERROR ” )) tasks messages = errors.map(_.split( ‘ \t ’ )(2)) Block 1 Driver cachedMsgs = messages.cache() Cached RDD Parallel operation cachedMsgs.filter(_.contains( “ foo ” )).count Cache 2 cachedMsgs.filter(_.contains( “ bar ” )).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3 10
RDDs in More Detail An RDD is an immutable, partitioned, logical collection of records needs not be materialized, but rather contains enough information to allow rebuilding a dataset from stable storage Partitioning can be based on a key in each record (using hash or range partitioning) Built using bulk transformations on other RDDs Can be cached for future reuse 11
RDD Operations Transformations Parallel actions/operations (define a new RDD) (return a result to driver) map reduce filter collect sample count union countByKey groupByKey save reduceByKey join … cache … 12
RDD Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions i.e., track data dependencies in the data flow Ex: cachedMsgs = textFile(...).filter(_.contains( “ error ” )) .map(_.split( ‘ \t ’ )(2)) .cache() FilteredRDD HdfsRDD MappedRDD CachedRDD func: path: hdfs://… func: split(…) contains(...) 13
Benefits of RDD Model Consistency is easy due to immutability Inexpensive fault tolerance (log lineage dependency information rather than replicating/checkpointing data) Locality-aware scheduling of tasks on partitions Despite being restricted (not as expressive as queries) model seems applicable to a broad variety of applications 14
Example: Logistic Regression Goal: find best line separating two sets of points random initial line target 15
Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) 16
Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s 17
Example: MapReduce MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) 18
Word Count in Spark val lines = spark.textFile( “ hdfs://... ” ) val counts = lines.flatMap(_.split( “ \\s ” )) .reduceByKey(_ + _) counts.save( “ hdfs://... ” ) 19
Spark Streaming
Traditional data processing E.g., log analysis example using a batch processor Latency from log event to serving layer usually in the range of hours Periodic (custom) or Periodic log analysis Web server continuous ingestion job into storage Logs Web server Batch job(s) for Serving HDFS / S3 log analysis layer Logs Web server Job scheduler Logs (Oozie) every 2 hrs 21
Log event analysis using stream processor Stream processors allow to analyze events with sub-second latency . Forward events Process events in Web server immediately to real time & update pub/sub bus serving layer Web server High throughput Stream Serving publish/subscribe Processor layer bus Web server 22
Discretized Stream Processing Run a streaming computation as a series of Spark very small, deterministic batch jobs Streami live data stream ng Chop up the live stream into micro -batches of X seconds batches of X seconds Spark treats each batch of data as RDDs and processes them using RDD operations Spark processed Finally, the processed results of the RDD results operations are returned in batches 23
Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Spark live data stream Streamin Batch sizes as low as ½ second, g latency ~ 1 second (micro- batches of X batching) seconds Spark Potential for limited processed results combination of batch processing and stream processing in the same system 24
Example – Count hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags. countByValue () batch @ t+1 batch @ t batch @ t+2 tweets flatMap flatMap flatMap hashTags map map map … reduceByKey reduceByKey reduceByKey tagCounts [(#cat, 10), (#dog, 25), ... ] 25
Example – Count hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags. window(Minutes(10), Seconds(1)) .countByValue() sliding window window length sliding interval operation 26
Example – Count hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t t+1 t+2 t+3 hashTags sliding window countByValue count over all tagCounts the data in the window 27
Fault-tolerance RDDs remember the sequence tweets input data (dataflow) of operations that RDD replicated created it from the original in memory fault-tolerant input data flatMap Batches of input data are replicated in memory of hashTags multiple worker nodes, RDD lost partitions therefore fault-tolerant recomputed on other workers Data lost due to worker failure, can be recomputed from input data 28
Flink
Apache Flink Apache Flink is an open source stream processing framework Low latency High throughput Stateful Distributed Developed at the Apache Software Foundation, Used in production 30
Apache Flink Real-world data is produced in a continuous fashion. Systems like Flink embrace streaming nature of data. Kafka topic Web server Stream processor Apache Kafka: reliable message queue/feed broker 31
Overview of Flink Architecture 32
Recommend
More recommend