big data processing iii stream processing
play

Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / - PowerPoint PPT Presentation

CNV/CC&V MEIC-A/MEIC-T/METI Computao em Nuvem e Virtualizao Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / INESC-ID Lisboa https://fenix.tecnico.ulisboa.pt/disciplinas/AVExe7/2019-2020/2-semestre/ LV, JG


  1. CNV/CC&V MEIC-A/MEIC-T/METI Computação em Nuvem e Virtualização Big-Data Processing III (Stream Processing) Prof. Luís Veiga IST / INESC-ID Lisboa https://fenix.tecnico.ulisboa.pt/disciplinas/AVExe7/2019-2020/2-semestre/ LV, JG 2015-20, sources Spark, Flink

  2. Agenda  Spark  overview, RDDs  programming Model, Examples  RDD operations  fault-tolerance, Performance  Spark Streaming  overview, discretized stream processing  windows, sliding windows, micro-batching  Flink  overview, windowing  tumbling windows, sliding windows, custom windows  time-based windows, watermarks  state management, versioning,  fault tolerance, distributed snapshots, execution semantics 2

  3. Spark

  4. Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Output Map Reduce Map 4

  5. Motivation  Current popular programming models for clusters transform data flowing from stable storage to stable storage  E.g., MapReduce: Map Benefits of data flow: runtime can Reduce decide where to run tasks and can Input Output Map automatically recover from failures Reduce Map 5

  6. Motivation  Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:  Iterative algorithms (many in machine learning)  Interactive data mining tools (R, Excel, Python)  Spark makes working sets a first-class concept to efficiently support these apps 6

  7. Spark Goal  Provide distributed memory abstractions for clusters to support apps with working sets  Retain the attractive properties of MapReduce:  Fault tolerance (for crashes & stragglers)  Data locality  Scalability Solution: augment data flow model with “ resilient distributed datasets ” (RDDs) 7

  8. Generality of RDDs  Spark ’ s combination of data flow with RDDs unifies many proposed cluster programming models  General data flow models: MapReduce, Dryad, SQL  Specialized models for stateful apps: Pregel (Bulk Synchronous Processing), HaLoop (iterative MR), Continuous Bulk Processing  Instead of specialized APIs for 1 type of app,  give the users first-class control of the distributed datasets 8

  9. Programming Model  Resilient distributed datasets (RDDs)  Immutable collections partitioned across clusters that can be rebuilt if a partition is lost.  Created by transforming data in stable storage using data flow operators  (map, filter, group- by, …)  Can be cached across parallel operations  Parallel operations on RDDs  Reduce, collect, count, save, … 9

  10. Example: Log Mining  Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile( “ hdfs://... ” ) Worker results errors = lines.filter(_.startsWith( “ ERROR ” )) tasks messages = errors.map(_.split( ‘ \t ’ )(2)) Block 1 Driver cachedMsgs = messages.cache() Cached RDD Parallel operation cachedMsgs.filter(_.contains( “ foo ” )).count Cache 2 cachedMsgs.filter(_.contains( “ bar ” )).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3 10

  11. RDDs in More Detail  An RDD is an immutable, partitioned, logical collection of records  needs not be materialized,  but rather contains enough information to allow rebuilding a dataset from stable storage  Partitioning can be based on a key in each record (using hash or range partitioning)  Built using bulk transformations on other RDDs  Can be cached for future reuse 11

  12. RDD Operations Transformations Parallel actions/operations (define a new RDD) (return a result to driver) map reduce filter collect sample count union countByKey groupByKey save reduceByKey join … cache … 12

  13. RDD Fault Tolerance  RDDs maintain lineage information that can be used to reconstruct lost partitions  i.e., track data dependencies in the data flow  Ex: cachedMsgs = textFile(...).filter(_.contains( “ error ” )) .map(_.split( ‘ \t ’ )(2)) .cache() FilteredRDD HdfsRDD MappedRDD CachedRDD func: path: hdfs://… func: split(…) contains(...) 13

  14. Benefits of RDD Model  Consistency is easy due to immutability  Inexpensive fault tolerance  (log lineage dependency information  rather than replicating/checkpointing data)  Locality-aware scheduling of tasks on partitions  Despite being restricted  (not as expressive as queries)  model seems applicable to a broad variety of applications 14

  15. Example: Logistic Regression  Goal: find best line separating two sets of points random initial line target 15

  16. Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) 16

  17. Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s 17

  18. Example: MapReduce  MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) 18

  19. Word Count in Spark val lines = spark.textFile( “ hdfs://... ” ) val counts = lines.flatMap(_.split( “ \\s ” )) .reduceByKey(_ + _) counts.save( “ hdfs://... ” ) 19

  20. Spark Streaming

  21. Traditional data processing  E.g., log analysis example using a batch processor  Latency from log event to serving layer usually in the range of hours Periodic (custom) or Periodic log analysis Web server continuous ingestion job into storage Logs Web server Batch job(s) for Serving HDFS / S3 log analysis layer Logs Web server Job scheduler Logs (Oozie) every 2 hrs 21

  22. Log event analysis using stream processor  Stream processors allow to analyze events with sub-second latency . Forward events Process events in Web server immediately to real time & update pub/sub bus serving layer Web server High throughput Stream Serving publish/subscribe Processor layer bus Web server 22

  23. Discretized Stream Processing Run a streaming computation as a series of Spark very small, deterministic batch jobs Streami live data stream ng  Chop up the live stream into micro -batches of X seconds batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations Spark processed  Finally, the processed results of the RDD results operations are returned in batches 23

  24. Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Spark live data stream Streamin  Batch sizes as low as ½ second, g latency ~ 1 second (micro- batches of X batching) seconds Spark  Potential for limited processed results combination of batch processing and stream processing in the same system 24

  25. Example – Count hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags. countByValue () batch @ t+1 batch @ t batch @ t+2 tweets flatMap flatMap flatMap hashTags map map map … reduceByKey reduceByKey reduceByKey tagCounts [(#cat, 10), (#dog, 25), ... ] 25

  26. Example – Count hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags. window(Minutes(10), Seconds(1)) .countByValue() sliding window window length sliding interval operation 26

  27. Example – Count hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t t+1 t+2 t+3 hashTags sliding window countByValue count over all tagCounts the data in the window 27

  28. Fault-tolerance RDDs remember the sequence tweets  input data (dataflow) of operations that RDD replicated created it from the original in memory fault-tolerant input data flatMap Batches of input data are  replicated in memory of hashTags multiple worker nodes, RDD lost partitions therefore fault-tolerant recomputed on other workers Data lost due to worker  failure, can be recomputed from input data 28

  29. Flink

  30. Apache Flink  Apache Flink is an open source stream processing framework  Low latency  High throughput  Stateful  Distributed  Developed at the Apache Software Foundation,  Used in production 30

  31. Apache Flink Real-world data is produced in a continuous fashion. Systems like Flink embrace streaming nature of data. Kafka topic Web server Stream processor Apache Kafka: reliable message queue/feed broker 31

  32. Overview of Flink Architecture 32

Recommend


More recommend