Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos

Overview ● Data at Netflix ● Netflix Scale ● Platform Architecture ● Data Warehouse ● Genie ● Q&A

Data at Netflix

Our Biggest Challenge is Scale

Netflix Key Business Metrics 86+ million Global 1000+ devices 125+ million members supported hours / day

Netflix Key Platform Metrics 500B Events 60 PB DW Read 3PB Write 500TB

Big Data Platform Architecture

Data Pipelines Event Data Kafka Ursula Cloud apps 5 min S3 Dimension Data SS Aegisthus Cassandra Tables Daily

Interface Big Data Portal Big Data API Tools Transport Visualization Quality Workflow Vis Job/Cluster Vis Service Orchestration Metadata Compute Parquet S3 Storage

Production Ad-hoc Other ~2300 d2.4xl ~1200 d2.4xl

S3 Data Warehouse

Why S3? • Lots of 9’s • Features not available in HDFS • Decouple Compute and Storage

Decoupled Scaling Warehouse Size HDFS Capacity All Clusters 3x Replication No Buffer

Decouple Compute / Storage Production Ad-hoc S3

Tradeoffs - Performance • Split Calculation (Latency) – Impacts job start time – Executes off cluster • Table Scan (Latency + Throughput) – Parquet seeks add latency – Read overhead and available throughput • Performance Converges with Volume and Complexity

Tradeoffs - Performance

Metadata • Metacat: Federated Metadata Service • Hive Thrift Interface • Logical Abstraction

Partitioning - Less is More Database Table Partition country_d date=20161101 data_science etl catalog_d date=20161102 telemetry playback_f date=20161103 ab_test search_f date=20161104

Partition Locations data_science playback_f date=20161101 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161101/… date=20161102 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161102/…

Parquet

Parquet File Format Column Oriented ● Store column data contiguously ● Improve compression ● Column projection Strong Community Support ● Spark, Presto, Hive, Pig, Drill, Impala, etc. ● Works well with S3

Column Chunk Column Chunk Column Chunk Row Group Dict Page Data Page Dict Page Data Page Data Page Data Page Data Page Data Page Data Page Column Chunk Column Chunk Column Chunk Row Group Dict Page Data Page Dict Page Data Page Data Page Data Page Data Page Data Page Data Page schema, version, etc. RowGroup Metadata Footer row count, size, etc. Column Chunk Metadata Column Chunk Metadata Column Chunk Metadata [encoding, size, min, max] [encoding, size, min, max] [encoding, size, min, max]

Staging Data • Partition by low cardinality fields • Sort by high cardinality predicate fields

Staging Data Original Sorted

Filtered Original Processed

Parquet Tuning Guide http://www.slideshare.net/RyanBlue3/parquet- performance-tuning-the-missing-guide

A Nascent Data Platform Gateway

Need Somewhere to Test Prod Gateway Prod Test Gateway Test

More Users = More Resources Prod Prod Gateway Prod Gateway Prod Gateways Prod Prod Gateway Test Gateway Test Gateways

Clusters for Specific Purposes Prod Prod Gateway Prod Gateway Prod Gateways Prod Prod Gateway Test Gateway Test Gateways Prod Prod Gateway Backfill Gateway Backfill Gateways

User Base Matures R? Prod Prod Gateway Prod Gateway Prod Gateways There’s a bug in Presto 0.149 need 0.150 Prod Prod Gateway Test Gateway I want Test Gateways Spark 1.6.1 I need Prod Spark Prod Gateway 2.0 Backfill Gateway Backfill Gateways My job is slow I need more resources

No one is happ No one is happy

Genie to the Rescue Prod Test Backfill

Problems Netflix Data Platform Faces • For Administrators – Coordination of many moving parts • ~15 clusters • ~45 different client executables and versions for those clusters – Heavy load • ~45-50k jobs per day – Hundreds of users with different problems • For Users – Don’t want to know details – All clusters and client applications need to be available for use – Need to provide tools to make doing their jobs easy

Genie for the Platform Administrator

An administrator wants a tool to… • Simplify configuration management and deployment • Minimize impact of changes to users • Track and respond to problems with system quickly • Scale client resources as load increases

Genie Configuration Data Model • Metadata about cluster Cluster – [sched:sla, type:yarn, ver:2.7.1] 1 0..* • Executable(s) Command – [type:spark-submit, ver:1.6.0] 1 0..* • Dependencies for an executable ApplicaLon

Search Resources

Administration Use Cases

Updating a Cluster • Start up a new cluster • Register Cluster with Genie • Run tests • Move tags from old to new cluster in Genie – New cluster begins taking load immediately • Let old jobs finish on old cluster • Shut down old cluster • No down time!

Load Balance Between Clusters • Different loads at different times of day • Copy tags from one cluster to another to split load • Remove tags when done • Transparent to all clients!

Update Application Binaries • Copy new binaries to central download location • Genie cache will invalidate old binaries on next invocation and download new ones • Instant change across entire Genie cluster

Genie for Users

User wants a tool to… • Discover a cluster to run job on • Run the job client • Handle all dependencies and configuration • Monitor the job • View history of jobs • Get job results

Clusters Submitting a Job { … “clusterCriteria”:[ “type:yarn”, “sched:sla” ], “commandCriteria”:[ “type:spark”, “ver:1.6.0” ] … } Commands 1. https://analyticsforinsights.files.wordpress.com/2015/04/superman-data-scientist-graphic.jpg

Genie Job Data Model 1 Cluster Job Request 1 1 1 1 1 1 1 Command Job Job Metadata 1 1 1 ApplicaLon Job ExecuLon 0..*

Job Request

Python Client Example

Job History

Job Output

Wrapping Up

Data Warehouse • S3 for Scale • Decouple Compute & Storage • Parquet for Speed

Genie at Netflix • Runs the OSS code • Runs ~45k jobs per day in production • Runs on ~25 i2.4xl instances at any given time • Keeps ~3 months of jobs (~3.1 million) in history

Resources • http://netflix.github.io/genie/ – Work in progress for 3.0.0 • https://github.com/Netflix/genie – Demo instructions in README • https://hub.docker.com/r/netflixoss/genie- app/ – Docker Container

Questions?

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos Overview Data at Netflix Netflix Scale Platform Architecture Data

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

How We Know Where You Are in House of Cards @zimmermatt Netflix Scale @zimmermatt Netflix

Trends in Managing Data at the Petabyte Scale Steve Kleiman Sr. VP & CTO Before we

Peering to Scale the Netflix Perspective Scaling for Growth How Does Netflix Manage Growth?

Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors Facebook Data Infrastructure

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg

Searching and Navigating Petabyte-Scale Files Systems Based on Facets Jonathan Koren, Yi Zhang,

Transferring a Petabyte in a Day Raj Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin

But what if I Youre not need to scale to a Netflix! million... E1M7: Monitoring E1M6:

Speed and Scale: How to get there. Adrian Cockcroft @adrianco May 2014 # | Battery

User & Device Identity For Microservices @ Netflix Scale Satyajit Thadeshwar QCon San

How Stranger Things can happen with Visual Analytics Jason Flittner Senior Analytics

Containers at Netflix - An Evolving Story Sangeeta Narayanan Engineering Manager @Netflix

Improving Netflix Performance Bill Scott Director, UI Engineering Netflix June 23, 2008 1

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data

The Netflix API service Sangeeta Narayanan @sangeetan

How Computers Help Humans Root Cause Issues at Netflix SETH KATZ QCON NEW YORK, 2018 Hello!

Co Comp mput utat ation ional al Pa Path tholo ology gy at at Sca Scale le Changing

Netflix Built Its Own Monitoring System (And You Probably Shouldnt) Roy Rapoport

Innovation & Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Spring Cloud, Spring Boot and Netflix OSS http://localhost:4000/decks/cloud-boot-netflix.html

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Mistakes and Discoveries while Cultivating Ownership @aaronblohowiak aaronb@netflix.com On your

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos Overview Data at Netflix Netflix Scale Platform Architecture Data

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

How We Know Where You Are in House of Cards @zimmermatt Netflix Scale @zimmermatt Netflix

Trends in Managing Data at the Petabyte Scale Steve Kleiman Sr. VP &amp; CTO Before we

Peering to Scale the Netflix Perspective Scaling for Growth How Does Netflix Manage Growth?

Hive* A Petabyte Scale Data Warehouse Using Hadoop Authors Facebook Data Infrastructure

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg

Searching and Navigating Petabyte-Scale Files Systems Based on Facets Jonathan Koren, Yi Zhang,

Transferring a Petabyte in a Day Raj Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin

But what if I Youre not need to scale to a Netflix! million... E1M7: Monitoring E1M6:

Speed and Scale: How to get there. Adrian Cockcroft @adrianco May 2014 # | Battery

User &amp; Device Identity For Microservices @ Netflix Scale Satyajit Thadeshwar QCon San

How Stranger Things can happen with Visual Analytics Jason Flittner Senior Analytics

Containers at Netflix - An Evolving Story Sangeeta Narayanan Engineering Manager @Netflix

Improving Netflix Performance Bill Scott Director, UI Engineering Netflix June 23, 2008 1

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data

The Netflix API service Sangeeta Narayanan @sangeetan

How Computers Help Humans Root Cause Issues at Netflix SETH KATZ QCON NEW YORK, 2018 Hello!

Co Comp mput utat ation ional al Pa Path tholo ology gy at at Sca Scale le Changing

Netflix Built Its Own Monitoring System (And You Probably Shouldnt) Roy Rapoport

Innovation &amp; Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Spring Cloud, Spring Boot and Netflix OSS http://localhost:4000/decks/cloud-boot-netflix.html

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Mistakes and Discoveries while Cultivating Ownership @aaronblohowiak aaronb@netflix.com On your

Trends in Managing Data at the Petabyte Scale Steve Kleiman Sr. VP & CTO Before we

User & Device Identity For Microservices @ Netflix Scale Satyajit Thadeshwar QCon San

Innovation & Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix