End-to-end Exactly-once Aggregation over Ad Streams Amiraj Dhawan - PowerPoint PPT Presentation

End-to-end Exactly-once Aggregation over Ad Streams Amiraj Dhawan Amit Ramesh

Yelp’s Mission Connecting people with great local businesses.

Outline ● Background & context ● Business requirements ● Design iterations ● Exactly-once aggregation ● What’s next?

Local Ads ● Work done within the Local Ads group ● Manage a few 100K ad campaigns daily ● Mom and pop stores to national chains ● Pipelines receive a few thousand msgs/sec ● Pipelines in production for more than a year

Local Ads – Consumer facing

Local Ads – Advertiser facing

Local Ads – Ad Campaign Management

Distilled Business Requirements ● Aggregate events over a day period ● Slice aggregates along defined dimensions ● Provide partial aggregates as day progresses ● Make aggregates as accurate as possible Day Dimension Dimension Dimension Aggregate Aggregate Aggregate 1 2 N 1 2 M

An Illustrative Example ● Count ad clicks over a day period ● Provide click counts by ad campaign ● Provide partial click counts as day progresses Day Campaign ID Number of clicks Day Campaign ID Number of clicks 4/17/2019 23265 35 4/17/2019 23265 42

Stream Processing 101 Input Stream(s) Output Stream(s) Stream Processing Engine Database

Windowed operations Tumbling window Sliding window

Why not... ∑ Processing pipeline Day Campaign ID Number of clicks 4/17/2019 23265 35

Why not... ∑ Processing pipeline Day Campaign ID Number of clicks 4/17/2019 23265 35 ● Need partial click counts as day progresses! ● Stateful operation

How about... ∆’s Processing pipeline Day Campaign ID Number of clicks 4/17/2019 23265 35

How about... ∆’s Processing pipeline Day Campaign ID Number of clicks 4/17/2019 23265 35 ● Cassandra has a Counter column type ● Integer type with increment and decrement

However... ● Counter is not meant to be idempotent ● Good for approximate metrics (likes/follows) ● Reported discrepancies of up to 5% ● Discrepancies due to being distributed ● No plans to make it idempotent

Alright... ∑ t ∑ t + ∆ Processing pipeline Day Campaign ID Number of clicks 4/17/2019 23265 35 ● Use Cassandra for the current count ● Increment in Spark and update Cassandra

Kafka 101 Partitions ● Data is in partitions 10 9 8 7 6 5 4 3 2 1 0 ● Partition is ordered Offsets 10 9 8 7 6 5 4 3 2 1 0 ● Consumers track their own progress 10 9 8 7 6 5 4 3 2 1 0

Spark Streaming 101 ● Micro-batching ● No pipelining ● App manages offset commits

Putting them together Kafka Offset Commit Stage 1 Stage 3 Stage 2 ∆ ∑ t + ∆ ∑ t ∑ t ∑ t + ∆

In the words of Ken Arnold Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in ten to the thirteenth, how often would it happen?" Your natural human sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time." When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern.

Failure Modes Kafka Offset Commit Stage 1 Stage 3 Stage 2 ∆ ∑ t + ∆ ∑ t ∑ t ∑ t + ∆

At Least + At Most = Exactly-once ● Should be able to distinguish processed data ● Versioning rows is one way to do it ● Versions need to be monotonically increasing ● Data in Kafka partitions are already ordered ● Versioning can leverage data order

Basic Idea Day Campaign Number of Version ID clicks 4/17/2019 5 3 2 Offset 5 4 3 2 1 0 ID: 5 ID: 5 ID: 5 ID: 5 ID: 5 ID: 5 CLK CLK CLK CLK Commit

Basic Idea Day Campaign Number of Version ID clicks 4/17/2019 5 1 2 4/17/2019 9 2 1 Offset 5 4 3 2 1 0 ID: 5 ID: 9 ID: 5 ID: 5 ID: 9 ID: 9 CLK CLK CLK CLK Commit

Basic Idea 5 4 3 2 1 0 ID: 5 ID: 9 ID: 9 ID: 5 ID: 5 ID: 9 CLK CLK CLK CLK Day Campaign Number of Version ID clicks Partition 0 4/17/2019 5 2 P0: 2 P1: 3 5 4 3 2 1 0 4/17/2019 9 3 P0: 0 P1: 1 ID: 5 ID: 9 ID: 5 ID: 5 ID: 9 ID: 9 CLK CLK CLK CLK Partition 1

Exactly-once Aggregation Kafka Offset Commit Stage 2 Stage 3 Stage 1 Ver ∆ ∑ t + ∆ , ∑ t , Ver t Ver t+1 ∑ t ∑ t , Ver t ∑ t + ∆ , Ver t+1

Exactly-once Aggregation Kafka Offset Commit Stage 2 Stage 3 Stage 1 Ver ∆ ∑ t + ∆ , ∑ t , Ver Ver t+1 ∑ t ∑ t , Ver t ∑ t + ∆ , Ver t+1

Generalization ● Aggregation logic is in the pipeline ● Logic can be arbitrarily complex ● Does not have to be a mathematical function ● Strings, sets, lists, maps, etc.

What’s next? ● Windowed joins ○ As a specialization of aggregation ○ Allows for arbitrary business rules in joins ● Deduplication within aggregation ○ Input streams can typically have duplicates

We're Hiring! www.yelp.com/careers/

fb.com/YelpEngineers @YelpEngineering engineeringblog.yelp.com github.com/yelp

End-to-end Exactly-once Aggregation over Ad Streams Amiraj Dhawan - PowerPoint PPT Presentation

End-to-end Exactly-once Aggregation over Ad Streams Amiraj Dhawan Amit Ramesh Yelps Mission Connecting people with great local businesses. Outline Background & context Business requirements Design

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Graphs Walks and Paths Bridges of Knigsberg Cross each bridge exactly once ?! Impossible!

EXACTLY ONCE STATEFUL STREAMS THE EASY WAY COLIN MACNAUGTHON NEEVE RESEACH INTRODUCTIONS

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

SSA Form & SSA-form: x 17-4 Each name is defined exactly once. Dead Code Elimination

EOS: E Exactly xactly- -O Once E nce E- -S Service Middleware ervice Middleware EOS:

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Data Streams Many large sources of data are generated as streams of updates: IP Network

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Escaping Saddle Points with Adaptive Gradient Methods Matthew Staib 1 , Sashank Reddi 2 ,

The Changing Landscape of Unmanned Systems Larry Osborn EVP & Chief Strategy Officer

University of Tennessee Knoxville, Tennessee Smokey Mountain National Park Durable Materials for

A Quantum Money solution to the Blockchain Scalability Problem Andrea Coladangelo, Or Sattath

VERITAS Observations Maria Krause Alexis Popkow of the Cygnus Region for the VERITAS

Par$$oning & Clustering Big Graphs George Karypis Department

An Approach for Detecting Learning Styles in Learning Management Systems Sabine Graf Kinshuk

Kubernetes+GlusterFS: Lightning Ver. Mohamed Ashiq Liazudeen & Jos A. Rivera

End-to-end Exactly-once Aggregation over Ad Streams Amiraj Dhawan - PowerPoint PPT Presentation

End-to-end Exactly-once Aggregation over Ad Streams Amiraj Dhawan Amit Ramesh Yelps Mission Connecting people with great local businesses. Outline Background & context Business requirements Design

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Graphs Walks and Paths Bridges of Knigsberg Cross each bridge exactly once ?! Impossible!

EXACTLY ONCE STATEFUL STREAMS THE EASY WAY COLIN MACNAUGTHON NEEVE RESEACH INTRODUCTIONS

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

Frequency Counts Frequency Counts over over Data Streams Data Streams Gurmeet Singh Manku

SSA Form &amp; SSA-form: x 17-4 Each name is defined exactly once. Dead Code Elimination

EOS: E Exactly xactly- -O Once E nce E- -S Service Middleware ervice Middleware EOS:

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Data Streams Many large sources of data are generated as streams of updates: IP Network

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Escaping Saddle Points with Adaptive Gradient Methods Matthew Staib 1 , Sashank Reddi 2 ,

The Changing Landscape of Unmanned Systems Larry Osborn EVP &amp; Chief Strategy Officer

University of Tennessee Knoxville, Tennessee Smokey Mountain National Park Durable Materials for

A Quantum Money solution to the Blockchain Scalability Problem Andrea Coladangelo, Or Sattath

VERITAS Observations Maria Krause Alexis Popkow of the Cygnus Region for the VERITAS

Par$$oning &amp; Clustering Big Graphs George Karypis Department

An Approach for Detecting Learning Styles in Learning Management Systems Sabine Graf Kinshuk

Kubernetes+GlusterFS: Lightning Ver. Mohamed Ashiq Liazudeen &amp; Jos A. Rivera

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

SSA Form & SSA-form: x 17-4 Each name is defined exactly once. Dead Code Elimination

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

The Changing Landscape of Unmanned Systems Larry Osborn EVP & Chief Strategy Officer

Par$$oning & Clustering Big Graphs George Karypis Department

Kubernetes+GlusterFS: Lightning Ver. Mohamed Ashiq Liazudeen & Jos A. Rivera