Personalizing Netflix with Streaming datasets Shriya Arora Senior - PowerPoint PPT Presentation

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora

What is this talk about ? Helping you decide if a streaming pipeline fits your ETL problem ● ● If it does, how to make a decision on what streaming solution to pick What is this NOT talk about ? ● X streaming engine is the BEST, go use that one! ● Batch is dead, must stream everything!

What is Netflix’s Mission? Entertaining you by allowing you to stream content anywhere, anytime

What is Netflix’s Mission? Entertaining you by allowing you to stream personalized content anywhere, anytime

How much data do we process to have a personalized Netflix for everyone? ● 100M+ active members ● 125M hours/ day ● 190 countries with unique catalogs ● 450B unique events/day ● 700+ Kafka topics Image credit:http://www.bigwisdom.net/

DEA Personalization at a (very) high level Data flows through Netflix Servers User watches a video on Netflix

Data Infrastructure Application instances Raw data Processed data Batch processing (S3/hdfs) (Tables/Indexers) ( Spark/Pig/Hive/MR ) Keystone Ingestion Pipeline Stream Processing ( Spark, Flink …)

Why have data later when you can have it now?

Business wins ● Algorithms can be trained with the latest data

Business wins ● Innovation in marketing of new launches ● Creates opportunity for news kinds of algorithms

Technical wins ● Save on storage costs ○ Raw data in its original form has to be persisted ● Faster turnaround time on error correction ○ Long-running batch jobs can incur significant delays when they fail Real-time auditing on key personalization metrics ● ● Integrate with other real-time systems ○ Additional infrastructure is required to make ‘online’ systems be available offline

How to pick a Stream Processing Engine? Problem Scope/Requirements Event-based streaming or micro-batches? ○ ○ What features will be the most important for the problem? Do you want to implement Lambda? ○ Stream layer Data Source/ Serving Message layer source Batch layer

How to pick a Stream Processing Engine? Existing Internal Technologies Infrastructure support: What are other teams using? ○ ○ ETL eco-system: Will it fit in with the existing sources and sinks What’s your team’s learning curve? What do you use for batch? ○ ○ What is the most fluent language of the team?

Our problem: Source of Play / Source of Discovery Anatomy of a Netflix Homepage : Billboard Video Rankings (ordering of shows within a row) Rows

Source of Discovery Source of Play Continue Watching Trending now Percentage of plays Percentage of plays Time Time

What we need to solve for Source of Discovery: ● High throughput ~100M events/day ○ ● Talk to live micro-services via thick clients ● Integrate with the Netflix platform eco-system ● Small State Allow for side inputs of slowly changing data ●

Source-of-Discovery pipeline: Data Flow Enriched sessions Playback Message sessions Streaming app Bus Backup CLIENT JAR Discovery Other Video side service Metadata inputs

Source-of-Discovery pipeline: Tech stack By Source, Fair use, https://en.wikipedia.org/w/index.php?curid=47175041

Getting streaming ETL to work ● Getting Data from Live sources ○ Every event (session) enriched with attributes from past history ○ Making a call to the micro-service via a thick client ● Side inputs ○ Get metadata about shows from the content service ○ Slowly changing data, optimize to call less frequently ● Dependency Isolation ○ Shading jars is fun (said no one ever)

Getting streaming ETL to work cont.. ● Data Recovery ○ Kafka TTLs are aggressive ○ Raw data stored in HDFS for finite time for replay ● Out of order events ○ Late arriving data must be attributed correctly ● Increased Monitoring, Alerts ○ Because recovery is non-trivial, prevent data-loss

Challenges with Streaming ● Pioneer Tax ○ Conventional ETL is batch ○ Training ML models on streaming data is new ground ● Outages and Pager Duty ;) ○ Batch failures have to be addressed urgently, Streaming failures have to be addressed immediately. ● Fault-tolerant infrastructure Monitoring, Alerts, ○ ○ Rolling deployments There are two kinds of pain...

Questions? Stay in touch! @ NetflixData

Personalizing Netflix with Streaming datasets Shriya Arora Senior - PowerPoint PPT Presentation

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora What is this talk about ? Helping you decide if a streaming pipeline fits your ETL problem If it does, how to

PERSONALIZING PERSONALIZING PERSONALIZING PERSONALIZING PERSONALIZING PERSONALIZING

Netflix Generation? How Todays Undergraduates Watch Videos Peter Shirts HLA conference, 12

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error

Learning Analytics: Personalizing and Adapting the Learning Personalizing and Adapting the

Netflix and FreeBSD: Using Open Source to Deliver Streaming Video Jonathan Looney FOSDEM 2019

Drupal, NetFlix, & Chill: Adaptive Bitrate Video Streaming Stephen Barker, Digital Frontiers

Innovation & Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Datasets for AVC (H.264) and HEVC (H.265) for Evaluating Dynamic Adaptive Streaming over HTTP

Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University

Personalizing Relevance on the Semantic Web through Trusted Recommendations from a Social

The Paved PaaS to Microservices Yunong Xiao, Principal Software Engineer, Netflix

Edit Timelines & Efficient Streaming of Media Mangala Prabhu and Eric Reinecke Agenda

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

NETFLIX TRAFFIC CHARACTERIZATION Michel Laterman Department of Computer Science University of

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

Containers at Netflix - An Evolving Story Sangeeta Narayanan Engineering Manager @Netflix

How We Know Where You Are in House of Cards @zimmermatt Netflix Scale @zimmermatt Netflix

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics

Peering to Scale the Netflix Perspective Scaling for Growth How Does Netflix Manage Growth?

P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 / 2018 @ monaldax Profile

Improving Netflix Performance Bill Scott Director, UI Engineering Netflix June 23, 2008 1

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

Personalizing Netflix with Streaming datasets Shriya Arora Senior - PowerPoint PPT Presentation

Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora What is this talk about ? Helping you decide if a streaming pipeline fits your ETL problem If it does, how to

PERSONALIZING PERSONALIZING PERSONALIZING PERSONALIZING PERSONALIZING PERSONALIZING

Netflix Generation? How Todays Undergraduates Watch Videos Peter Shirts HLA conference, 12

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error

Learning Analytics: Personalizing and Adapting the Learning Personalizing and Adapting the

Netflix and FreeBSD: Using Open Source to Deliver Streaming Video Jonathan Looney FOSDEM 2019

Drupal, NetFlix, &amp; Chill: Adaptive Bitrate Video Streaming Stephen Barker, Digital Frontiers

Innovation &amp; Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Datasets for AVC (H.264) and HEVC (H.265) for Evaluating Dynamic Adaptive Streaming over HTTP

Efficient Algorithms for Streaming Datasets with Near-Duplicates Qin Zhang Indiana University

Personalizing Relevance on the Semantic Web through Trusted Recommendations from a Social

The Paved PaaS to Microservices Yunong Xiao, Principal Software Engineer, Netflix

Edit Timelines &amp; Efficient Streaming of Media Mangala Prabhu and Eric Reinecke Agenda

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

NETFLIX TRAFFIC CHARACTERIZATION Michel Laterman Department of Computer Science University of

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

Containers at Netflix - An Evolving Story Sangeeta Narayanan Engineering Manager @Netflix

How We Know Where You Are in House of Cards @zimmermatt Netflix Scale @zimmermatt Netflix

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics

Peering to Scale the Netflix Perspective Scaling for Growth How Does Netflix Manage Growth?

P Patterns Of O Streaming Applications S A Monal Daxini 11/ 6 / 2018 @ monaldax Profile

Improving Netflix Performance Bill Scott Director, UI Engineering Netflix June 23, 2008 1

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

Drupal, NetFlix, & Chill: Adaptive Bitrate Video Streaming Stephen Barker, Digital Frontiers

Innovation & Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Edit Timelines & Efficient Streaming of Media Mangala Prabhu and Eric Reinecke Agenda