stream processing
play

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. - PowerPoint PPT Presentation

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch processing Bounded input Bounded one-shot computation Bounded output Stream processing Unbounded input: data stream


  1. Stream Processing Marco Serafini COMPSCI 532 Lecture 5

  2. Stream vs. Batch Processing • Batch processing • Bounded input • Bounded one-shot computation • Bounded output • Stream processing • Unbounded input: data stream • Unbounded computation: always on • Unbounded output 2 2

  3. Advantages and Disadvantages • Advantages of stream processing • Many real-time datasets are data streams (e.g. IoT) • Near real-time results • No need for accumulating data for processing • Streaming operators typically require less memory • Disadvantages of stream processing • Need to deal with timing semantics • Some operators are harder to implement with streaming • Especially if we want operator state to be constant • E.g. Find median • Stream algorithms are often approximations 3 3

  4. Streaming Computation • Dataflow Graph of (possibly stateful) operators • Data streams connecting them • Tuples in the stream: <key, [timestamp,] value> • FIFO channels • Partitioning • Operators are parallelized into subtasks based on key • Streams are split into partitions 4 4

  5. Example: Streaming Inverted Index • How could the architecture look like? • What kind of streams would we have? • What kind of operators? 5 5

  6. Windowing • Windows create batches for bounded operations • Example: aggregation • Tuples replicated on multiple windows 6 6

  7. Event and Processing Time • Event time • Time when events occurs • Associated to event itself • Immutable • Processing time • Time when event is processed • Depends on system implementation • Mutable • Q: Which one is easier to program with? 7 7

  8. Watermarks • Event-time processing • Requires reordering since event time ≠ processing time • Example: process all events generated [from, to) • Low watermarks • How to know that I got all events until event-time T? • Watermark is a special message telling us that • Forwarded throughout dataflow graph • If an operator has multiple input channels, forward minimum (earliest) low watermark across inputs • Punctuations • Similar to watermarks 8 8

  9. Triggers • Watermarks can be too fast (when?) or too slow (when?) • Triggers are used to process a window based on a processing time signal • Problem: what to do with the window once triggering? • Discard? Accumulate? Retract? 9 9

  10. Spark Structured Streaming • Define a relational query on stream • System incrementally updates output tables 10 10

  11. System Implementation 11

  12. Stream Processing Systems • “Pure” streaming systems • Tuple-at-a-time semantic • Example: Apache Storm, Apache Flink • Micro-batching • Create small batches of inputs and then execute batched computation • Example: Spark Streaming • System implementation ≠ programming semantics 12 12

  13. Control vs. Data Messages • Control messages are injected in event stream • Checkpoint markers • Inserting them in stream helps consistent snapshot • Watermarks for windowing • Inserting them in stream allows triggering windows • Coordination barriers • Inserting them in stream allows marking event before-after barrier 13 13

  14. Implementing Batch on Streaming • DataSet abstraction in Flink • Example • Q: How to implement map-reduce on Flink? • A: Control messages • Mappers send an “eof” marker to each reducer when done • Reducer do not process until they receive markers from all mappers 14 14

  15. Fault Tolerance • Streaming: stateful operators • Cannot rerun the whole stream from beginning • Spark Streaming: Lineage (Spark) + checkpoints • Flink: Periodic checkpointing of stateful operators • Export API to application to define state to be checkpointed • How to checkpoint? 15 15

  16. Distributed Checkpoints • Uncoordinated Checkpoint? Domino effect • Coordinated checkpoint • Consistent cut: no message received but not sent • Distributed checkpointing protocol (Chandy-Lamport) 16 16

  17. Chandy-Lamport Protocol • Assumptions • Originator process starts it • FIFO channels • One checkpoint at a time • Goal: checkpoint state + all in-flight messages • Algorithm • Originator checkpoints its state and sends checkpoint marker • Upon receiving checkpoint marker • Checkpoint and send checkpoint marker on each channel • Record subsequent messages on each channel until receive checkpoint marker back 17 17

  18. Load Balancing • How to balance load in tuple-at-a-time system? • Redistribute keys • Move key from one server to another • Require migrating operator state • Replicate and aggregate • Multiple copies of the same operator compute partial results • A downstream operator aggregates them • Power of both choices 18 18

  19. Lambda Architecture Recreate accurate results Periodically Batch Incoming Processing data stream Persistent Result of store analysis (e.g. Kafka) (e.g. Index) Stream Update Real-time Processing approximate results • Compromise between accuracy and freshness • Requires maintaining 2 platforms and 2 implementations 19 19

  20. Regulating the Data Flow • Backpressure • If receiver operator cannot process inputs fast enough… • ... Block or slow down senders (recursively if needed) • Intermediate buffer pools (queues) • Decouple communication from consuming messages • Conflicting requirements • Throughput: batch output messages, don’t send one by one • Latency: send messages asap • Tradeoff: Send when either • Max batch size reached (e.g. 1 kB) or • Timeout (e.g. 5 milliseconds) 20 20

  21. Exercise 21

  22. Exercise: Online Store • Two input streams • One has ad clicks: <user_ID, time, ad_ID> • The other has ad impressions: <ad_ID, time, user_ID> • Design DataFlow application that • Correlates ad impressions with user clicks (for billing) • Correlated = click happens within 10 seconds after ad • Questions • Which operators? How are they partitioned? • Watermarks? • Triggers? 22 22

  23. Possible Implementation • Streaming operator • Receives both streams • Partitioned by ad_ID • Join by ad_ID and return <ad_ID, user_ID, ad_time> • Windowing: session • When an ad appears, gather all clicks for 5 minutes • Watermarks • Both input streams emit a low watermark every second • Earliest low watermark triggers the window • Triggers: aggregate and retract 23 23

Recommend


More recommend