! welcome CS 744: DATAFLOW Shivaram Venkataraman Fall 2020
↳ ↳ ADMINISTRIVIA - Assignment 2 grades are up! Canvas → - Midterm grading in progress - Course project proposal comments week tis Thursday feedback Peer feedback Instructor - AEFIS feedback (next slide)
↳ AEFIS FEEDBACK organization Better T Improve writing on the slides, speak slower - Get a better internet connection? Better microphone? how know Let me sounds ? it this ring More office hour slots Discussion groups: same group each time? Also add prof. input More time for Midterm exam, more guidance on deliverables More homework/hands-on experience vs. too many evaluation components?
stream Processing Applications f- J f Machine Learning SQL Streaming Graph , Spark → MapReduce Computational Engines GFS Scalable Storage Systems → Meson Resource Management → DRF Datacenter Architecture -
operators of DAA operators or " spat tape Pytorch DATAFLOW MODEL (?)
↳ ↳ MOTIVATION ESPN Lom . Streaming Video Provider video each - videos , ads - How much to bill each advertiser ? has some each ) - Need per-user, per-video viewing sessions Foard - Handle out of order data heard phone main which city etc . -1 → Goals Offline order out data , unbounded of - Easy to program - much delay till results are how - - Balance correctness, latency and cost available results accurate how your are
APPROACH Developers writing API Design → Dataflow Model applications I Separate user-facing model from execution Decompose queries into L L TENET - What is being computed Ll ) framework - Where in time is it computed d) Framework can process → Output as it processes - When is it materialized data bounded :# ftp..sk arrives - How does it relate to earlier results similar to very data ① MapReduce ② streaming batch small a FEE iii. i . viewing → e → ' day I 1 day events events ' process ma ma arrive ' when they ' and - - - as
⇒ ↳ Dashboard TERMINOLOGY Processing - time Syst € ) arriving # ② constantly Unbounded/bounded data is Data → Streaming/Batch execution ESPN .com 'D ) slide - ad µµ mtmt¥ See previous - Timestamps user ( input wrt event occurs Time Event time: when video viewed in ad was at time e.g ; processed Processing time: is event which at an Time is event ad - view which dashboard at time e- g. , the - update processed to
WINDOWING logical winadroewsae.ge/:;:::n?I:soam/ constructs keys across ^ - I - - Id window - 10am HIGH ' - I . . . - - - - - - . . . . . . Finneran ↳ remake - ¥ - - FF - Hom - # # , rpm - . - - - - - - - - - - - - ↳ noamtmatauidne not ) Do oimapbeueen Tuning ← overeat each with consecutive windows keys other windows
WATERMARK or SKEW is Watermark " know to not easy time processing Heuristics . so ↳ you - - - • - - - - - - After = - - 10 mins time , event lags , most devices : events : serial skew time Event ' - catch up . System has processed all : events up to • . 12:02:30 T / . & event - t & between gap - No processing time
API Spark in flatmate MapReduce ParDo: in or Map MapReduce GroupByKey: Reduce in Windowing window into tuple a Buckets AssignWindow → based strategy buckets on MergeWindow Merge → ( sessions )
hwan EXAMPLE Assign tuples to sessions - timestamp meant , + ÷i¥ I - - - and overlap - - aedrdenfo.fi/ftamp them merges - - - - o . - I GroupByKey
TRIGGERS AND INCREMENTAL PROCESSING Windowing: where in event time data are grouped Triggering: when in processing time groups are emitted ;÷ : : ? ;÷iwsr . : FEI ÷ Strategies Discarding . = Accumulating . 11 = 6 Output v1 = Accumulating & Retracting - 5 11 I , ataumulahng retracting
RUNNING EXAMPLE PCollection<KV<String, Integer>> input = IO.read(...); PCollection<KV<String, Integer>> output = input.apply(Sum.integersPerKey()); - Single ← summit for f key key each ' - -
GLOBAL WINDOWS, ACCUMULATE PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) ÷ .accumulating()) .apply(Sum.integersPerKey()); 33 t 18 I 22 HI • ' = 22 12+10 ed → Crigger - O every A . in min . Prout Fane
GLOBAL WINDOWS, COUNT, DISCARDING PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtCount(2))) - .discarding()) .apply(Sum.integersPerKey()); . ! or ::fgE card . . → a ,
FiXED WINDOWS, MICRO BATCH - PCollection<KV<String, Integer>> output = input 5 - 12 : 02 .apply(Window.into(FixedWindows.of(2, MINUTES)) 12 : 00 - - # 14 - 12 : 04 .trigger(Repeat(AtWatermark()))) 12 : 02 - - .accumulating()) 12=04-12--006 3 " -00 - ;D : of monk " o . iii : a :* :÷ in A a M t
SUMMARY/LESSONS Design for unbounded data: Don’t rely on completeness Be flexible, diverse use cases - Billing - Recommendation - Anomaly detection Windowing, Trigger API to simplify programming on unbounded data
DISCUSSION https://forms.gle/jwHjTBbR49vyQASq6
↳ ⇒ window fires time windows Fixed every a) streaming watermark pass given watermark is Assume latency ⇒ worse outputs ⇒ fewer EA Eat X T batch Micro - D - - - - - nie partial rum -1 entry streaming ⇒ IEEE 'm event - ts * . system to . . Ingest proc - t - Sub Pub Apache Kafka time update query disk persist
Consider you are implementing a micro-batch streaming API on top of Apache Spark. What are some of the bottlenecks/challenges you might have in building such a system?
NEXT STEPS Next class: Naiad Course project proposal peer feedback
Recommend
More recommend