Have Your Cake & Eat It Too Further Dispelling the Myths of the - PowerPoint PPT Presentation

Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer

MillWheel - Stream Processing System Streaming Flume - High-level API Cloud Dataflow - Data Processing Service

Google Cloud Dataflow Optimize Schedule GCS GCS

MillWheel - Slava Chernyak, Josh Haberman, Reuven Lax, Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more... Streaming Flume - Robert Bradshaw, Daniel Mills, and more... Cloud Dataflow - Robert Bradshaw, Craig Chambers, Reuven Lax, Daniel Mills, Frances Perry, and more...

Cloud Dataflow is unreleased. Things may change.

Agenda 1 Lambda vs Streaming 2 Strong Consistency 3 Reasoning About Time

Lambda vs Streaming 1

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

The Lambda Architecture

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

The Evolution of Streaming

What does it take? Strong Consistency Tools for Reasoning About Time

Strong Consistency 2

Consistent Storage Storage

Why consistency is important • Mostly correct is not good enough • Required for exactly-once processing • Required for repeatable results • Cannot replace batch without it

How? • Sequencers (e.g. BigTable) • Leases (e.g. Spanner) • Federation of storage silos (e.g. Samza, Dataflow) • RDDs (e.g. Spark)

http://research.google.com/pubs/pub41378.html

Reasoning About Time 3

Event Time vs Stream Time Batch vs Streaming Approaches Dataflow API

Event Time - When Events Happened Stream Time - When Events Are Processed

Batch vs Streaming

Batch MapReduce

Batch: Fixed Windows [10:00 - 11:00) [10:00 - 11:00) [11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) MapReduce [22:00 - 23:00) [23:00 - 0:00)

Batch: User Sessions [10:00 - 11:00) [11:00 - 12:00) [10:00 - 11:00) [11:00 - 12:00) Joan Larry Ingo MapReduce Amanda Cheryl Arthur

Streaming 16:00 15:00 14:00 13:00 12:00 11:00 10:00

Confounding characteristics of data streams Unordered Unbounded Of Varying Event Time Skew

Event Time Skew Skew Stream Time Event Time

Approaches

Approaches to reasoning about time 1. Time-Agnostic Processing 2. Approximation 3. Stream Time Windowing 4. Event Time Windowing

1. Time-Agnostic Processing - Filters 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Web server traffic logs Example Output: All traffic from specific domains Pros: Straightforward Efficient Cons: Limited utility

1. Time-Agnostic Processing - Hash Join 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Query & Click traffic Example Output: Joined stream of Query + Click pairs Pros: Straightforward Efficient Cons: Limited utility

2. Approximation via Online Algorithms 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Twitter hashtags Example Output: Approximate top N hashtags per prefix Pros: Efficient Cons: Inexact Complicated Algorithms

3. Windowing by Stream Time 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time Example Input: Web server request traffic Example Output: Per-minute rate of received requests Pros: Straightforward Results reflect contents of stream Cons: Results don’t reflect events as they happened If approximating event time, usefulness varies

4. Windowing by Event Time - Fixed Windows 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Event Time Example Input: Twitter hashtags Example Output: Top N hashtags by prefix per hour. Pros: Reflects events as they occurred Cons: More complicated buffering Completeness issues

4. Windowing by Event Time - Sessions 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Stream Time 16:00 15:00 14:00 13:00 12:00 11:00 10:00 Event Time Example Input: User activity stream Example Output: Per-session group of activities Pros: Reflects events as they occurred Cons: More complicated buffering Completeness issues

Dataflow API

What are you computing? Where in event time? When in stream time?

What = Aggregation API Where = Windowing API When = Watermarks + Triggers API

Aggregation API PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());

Aggregation API 2 3 9 3 7 Sum 1 16 4 8 9 6 18 0

Streaming Mode 2 6 7 4 9 3 3 0 2 3 8 6 8 4 3 3 4 1 3 2 1 0 3 0 7 10:06 10:04 10:02 Stream Time 10:00 2 6 3 9 2 3 7 4 3 0 6 8 8 4 3 4 3 7 1 1 3 2 3 0 0 10:06 10:04 10:02 10:00 Event Time

Windowing API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());

Windowing API 3 2 6 7 4 9 3 0 2 3 8 6 8 4 3 3 4 7 1 3 2 1 0 3 0 10:06 10:04 10:02 10:00 Stream Time FixedWindows 2 6 3 9 2 3 7 4 3 0 6 8 8 4 3 4 3 7 1 1 3 2 3 0 0 10:06 10:04 10:02 10:00 Event Time Sum 15 5 4 3 16 18 6 13 12 10:06 10:04 10:02 10:00 Event Time

Watermarks ● f(S) -> E ● S = a point in stream time (i.e. now) ● E = the point in event time up to which input data is complete as of S

Event Time Skew Stream Time Event Time

Watermarks 3 2 6 7 4 9 3 0 2 3 8 6 8 4 3 3 4 7 1 3 2 1 0 3 0 10:06 10:04 10:02 10:00 Stream Time FixedWindows 2 6 3 9 2 3 7 4 3 0 6 8 8 4 3 4 3 7 1 1 3 2 3 0 0 10:06 10:04 10:02 10:00 Event Time Sum 15 5 4 3 16 18 6 13 12 10:03 10:02 10:01 10:00 Event Time

Watermark Caveats Too slow = more latency Too fast = late data

Triggers When in stream time to emit?

Triggers API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());

13:00 10:06 9 9 3 Late datum 6 5 10:05 13 13 8 10:04 20 20 5 10:03 8 12 5 4 1 10:02 3 7 2 10:01 2 10:00 10:00 10:01 10:02 10:03 10:04 10:05 10:06 Event Time Stream Time

A Better Strategy 1. Once per stream time minute 2. At watermark 3. Once per record for two weeks

13:00 10:06 25 13 9 25 13 9 9 9 3 Late datum 6 5 13 25 25 13 10:05 20 13 20 13 13 8 10:04 20 5 20 20 5 10:03 20 20 8 5 5 12 5 4 1 10:02 12 12 3 7 10:01 2 2 2 10:00 10:00 10:01 10:02 10:03 10:04 10:05 10:06 Event Time Stream Time

Triggers API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

Lambda vs Streaming Low-latency, approximate results Complete, correct results as soon as possible Ability to deal with changes upstream

One Last Thing... What if I want sessions?

Triggers API PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

13:00 10:06 9 6 9 9 3 38 3 Late datum 6 25 5 33 38 38 5 10:05 33 25 33 33 13 8 8 10:04 25 10:03 9 7 25 25 8 20 8 5 9 3 8 1 4 4 1 10:02 12 9 9 3 3 2 7 1 minute 3 7 2 9 7 1 minute 10:01 2 2 2 2 2 1 minute 10:00 10:00 10:01 10:02 10:03 10:04 10:05 10:06 Event Time Stream Time

Summary Lambda is great Streaming by itself is better :-) Strong Consistency = Correctness Streaming = Aggregation + Windowing + Triggers Tools For Reasoning About Time = Power + Flexibility

Thank you! Questions? Questions about this talk: takidau@google.com (Tyler Akidau) Questions about Cloud Dataflow: cloude@google.com (Eric Schmidt)

Have Your Cake & Eat It Too Further Dispelling the Myths of the - PowerPoint PPT Presentation

Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer MillWheel - Stream Processing

CS133 Computational Geometry Intersection Problems 1 Riddle: Fair Cake-cutting Using only one

Alankrita Bhatt Sharbatanu Chatterjee Fish fish fish eat eat eat is a valid id sente tenc nce.

Code Blocks and Indentation Indentation is important Think of a recipe - Chocolate Cake

DevOps, microservices and stress-free incidents: How to have your cake and eat it too Peter

Upstream Graphics: Too Little, Too Late Upstream Graphics: Too Little, Too Late Daniel Vetter,

Memory A Memory Test! Who remembers cake sweet anger Who remembers A cake sweet B

CS133 Computational Geometry Intersection Problems 4/24/2018 1 Riddle: Fair Cake-cutting

eat out eat well Adam French eat out eat well Scheme developed to recognise caterers that

Cake Malik Magdon-Ismail Costas Busch M. S. Krishnamoorthy Rensselaer Polytechnic Institute

Creating a Housing Ecosystem or You CAN Have Your Cake and Eat it Too! City of Buda, Texas

User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R.

Low Cost Transactional and Analytics with MySQL + Clickhouse Have your Cake and Eat it Too

HOW TO MAKE A CAKE 1/11 Lesson Objectives Lesson: HOW TO MAKE A CAKE By learning this lesson

Cakes We will discuss the division of a single divisible good, commonly referred to as a cake

Staying Alive Animals need to eat to stay alive. Different animals eat different types of food.

EAT RIGHT CAMPUS EAT RIGHT MOVEMENT E AT R IGHT C AMPUS (ERC) Eat Right Campus initiative of

THE INDUSTRIAL APPROACH TO DUST MONITORING Sintrol Oy Finnish owned and run business, founded in

Ceramic Filter Elements for Hot Gas Filtration

Use of Alternative Use of Alternative Cement Cement Formulations Formulations Don Getzlaf

Use of Alternative Use of Alternative Cement Cement Formulations Formulations Don Getzlaf

RadiLock Acrylic Fiber Presentation November 2014 Company Profile Founded in CELEBRATING

2016 CIBC Sales Desk Presentation 1 www.kinross.com CAUTIONARY STATEMENT ON FORWARD-LOOKING

JGSEE Joint Graduate School of Energy and Environment Research-based and Professional-oriented

SANITATION & ITS IMPORTANCE IN PEST MANAGEMENT Bill Pursley VP Food Safety &

Have Your Cake & Eat It Too Further Dispelling the Myths of the - PowerPoint PPT Presentation

Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer MillWheel - Stream Processing

CS133 Computational Geometry Intersection Problems 1 Riddle: Fair Cake-cutting Using only one

Alankrita Bhatt Sharbatanu Chatterjee Fish fish fish eat eat eat is a valid id sente tenc nce.

Code Blocks and Indentation Indentation is important Think of a recipe - Chocolate Cake

DevOps, microservices and stress-free incidents: How to have your cake and eat it too Peter

Upstream Graphics: Too Little, Too Late Upstream Graphics: Too Little, Too Late Daniel Vetter,

Memory A Memory Test! Who remembers cake sweet anger Who remembers A cake sweet B

CS133 Computational Geometry Intersection Problems 4/24/2018 1 Riddle: Fair Cake-cutting

eat out eat well Adam French eat out eat well Scheme developed to recognise caterers that

Cake Malik Magdon-Ismail Costas Busch M. S. Krishnamoorthy Rensselaer Polytechnic Institute

Creating a Housing Ecosystem or You CAN Have Your Cake and Eat it Too! City of Buda, Texas

User-level Threading: Have Your Cake and Eat It Too Martin Karsten and Saman Barghi David R.

Low Cost Transactional and Analytics with MySQL + Clickhouse Have your Cake and Eat it Too

HOW TO MAKE A CAKE 1/11 Lesson Objectives Lesson: HOW TO MAKE A CAKE By learning this lesson

Cakes We will discuss the division of a single divisible good, commonly referred to as a cake

Staying Alive Animals need to eat to stay alive. Different animals eat different types of food.

EAT RIGHT CAMPUS EAT RIGHT MOVEMENT E AT R IGHT C AMPUS (ERC) Eat Right Campus initiative of

THE INDUSTRIAL APPROACH TO DUST MONITORING Sintrol Oy Finnish owned and run business, founded in

Ceramic Filter Elements for Hot Gas Filtration

Use of Alternative Use of Alternative Cement Cement Formulations Formulations Don Getzlaf

Use of Alternative Use of Alternative Cement Cement Formulations Formulations Don Getzlaf

RadiLock Acrylic Fiber Presentation November 2014 Company Profile Founded in CELEBRATING

2016 CIBC Sales Desk Presentation 1 www.kinross.com CAUTIONARY STATEMENT ON FORWARD-LOOKING

JGSEE Joint Graduate School of Energy and Environment Research-based and Professional-oriented

SANITATION &amp; ITS IMPORTANCE IN PEST MANAGEMENT Bill Pursley VP Food Safety &amp;

SANITATION & ITS IMPORTANCE IN PEST MANAGEMENT Bill Pursley VP Food Safety &