netflix netflix petabyte scale petabyte scale analytics
play

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos Overview Data at Netflix Netflix Scale Platform Architecture Data


  1. Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos

  2. Overview ● Data at Netflix ● Netflix Scale ● Platform Architecture ● Data Warehouse ● Genie ● Q&A

  3. Data at Netflix

  4. Our Biggest Challenge is Scale

  5. Netflix Key Business Metrics 86+ million Global 1000+ devices 125+ million members supported hours / day

  6. Netflix Key Platform Metrics 500B Events 60 PB DW Read 3PB Write 500TB

  7. Big Data Platform Architecture

  8. Data Pipelines Event Data Kafka Ursula Cloud apps 5 min S3 Dimension Data SS Aegisthus Cassandra Tables Daily

  9. Interface Big Data Portal Big Data API Tools Transport Visualization Quality Workflow Vis Job/Cluster Vis Service Orchestration Metadata Compute Parquet S3 Storage

  10. Production Ad-hoc Other ~2300 d2.4xl ~1200 d2.4xl

  11. S3 Data Warehouse

  12. Why S3? • Lots of 9’s • Features not available in HDFS • Decouple Compute and Storage

  13. Decoupled Scaling Warehouse Size HDFS Capacity All Clusters 3x Replication No Buffer

  14. Decouple Compute / Storage Production Ad-hoc S3

  15. Tradeoffs - Performance • Split Calculation (Latency) – Impacts job start time – Executes off cluster • Table Scan (Latency + Throughput) – Parquet seeks add latency – Read overhead and available throughput • Performance Converges with Volume and Complexity

  16. Tradeoffs - Performance

  17. Metadata • Metacat: Federated Metadata Service • Hive Thrift Interface • Logical Abstraction

  18. Partitioning - Less is More Database Table Partition country_d date=20161101 data_science etl catalog_d date=20161102 telemetry playback_f date=20161103 ab_test search_f date=20161104

  19. Partition Locations data_science playback_f date=20161101 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161101/… date=20161102 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161102/…

  20. Parquet

  21. Parquet File Format Column Oriented ● Store column data contiguously ● Improve compression ● Column projection Strong Community Support ● Spark, Presto, Hive, Pig, Drill, Impala, etc. ● Works well with S3

  22. Column Chunk Column Chunk Column Chunk Row Group Dict Page Data Page Dict Page Data Page Data Page Data Page Data Page Data Page Data Page Column Chunk Column Chunk Column Chunk Row Group Dict Page Data Page Dict Page Data Page Data Page Data Page Data Page Data Page Data Page schema, version, etc. RowGroup Metadata Footer row count, size, etc. Column Chunk Metadata Column Chunk Metadata Column Chunk Metadata [encoding, size, min, max] [encoding, size, min, max] [encoding, size, min, max]

  23. Staging Data • Partition by low cardinality fields • Sort by high cardinality predicate fields

  24. Staging Data Original Sorted

  25. Filtered Original Processed

  26. Parquet Tuning Guide http://www.slideshare.net/RyanBlue3/parquet- performance-tuning-the-missing-guide

  27. A Nascent Data Platform Gateway

  28. Need Somewhere to Test Prod Gateway Prod Test Gateway Test

  29. More Users = More Resources Prod Prod Gateway Prod Gateway Prod Gateways Prod Prod Gateway Test Gateway Test Gateways

  30. Clusters for Specific Purposes Prod Prod Gateway Prod Gateway Prod Gateways Prod Prod Gateway Test Gateway Test Gateways Prod Prod Gateway Backfill Gateway Backfill Gateways

  31. User Base Matures R? Prod Prod Gateway Prod Gateway Prod Gateways There’s a bug in Presto 0.149 need 0.150 Prod Prod Gateway Test Gateway I want Test Gateways Spark 1.6.1 I need Prod Spark Prod Gateway 2.0 Backfill Gateway Backfill Gateways My job is slow I need more resources

  32. No one is happ No one is happy

  33. Genie to the Rescue Prod Test Backfill

  34. Problems Netflix Data Platform Faces • For Administrators – Coordination of many moving parts • ~15 clusters • ~45 different client executables and versions for those clusters – Heavy load • ~45-50k jobs per day – Hundreds of users with different problems • For Users – Don’t want to know details – All clusters and client applications need to be available for use – Need to provide tools to make doing their jobs easy

  35. Genie for the Platform Administrator

  36. An administrator wants a tool to… • Simplify configuration management and deployment • Minimize impact of changes to users • Track and respond to problems with system quickly • Scale client resources as load increases

  37. Genie Configuration Data Model • Metadata about cluster Cluster – [sched:sla, type:yarn, ver:2.7.1] 1 0..* • Executable(s) Command – [type:spark-submit, ver:1.6.0] 1 0..* • Dependencies for an executable ApplicaLon

  38. Search Resources

  39. Administration Use Cases

  40. Updating a Cluster • Start up a new cluster • Register Cluster with Genie • Run tests • Move tags from old to new cluster in Genie – New cluster begins taking load immediately • Let old jobs finish on old cluster • Shut down old cluster • No down time!

  41. Load Balance Between Clusters • Different loads at different times of day • Copy tags from one cluster to another to split load • Remove tags when done • Transparent to all clients!

  42. Update Application Binaries • Copy new binaries to central download location • Genie cache will invalidate old binaries on next invocation and download new ones • Instant change across entire Genie cluster

  43. Genie for Users

  44. User wants a tool to… • Discover a cluster to run job on • Run the job client • Handle all dependencies and configuration • Monitor the job • View history of jobs • Get job results

  45. Clusters Submitting a Job { … “clusterCriteria”:[ “type:yarn”, “sched:sla” ], “commandCriteria”:[ “type:spark”, “ver:1.6.0” ] … } Commands 1. https://analyticsforinsights.files.wordpress.com/2015/04/superman-data-scientist-graphic.jpg

  46. Genie Job Data Model 1 Cluster Job Request 1 1 1 1 1 1 1 Command Job Job Metadata 1 1 1 ApplicaLon Job ExecuLon 0..*

  47. Job Request

  48. Python Client Example

  49. Job History

  50. Job Output

  51. Wrapping Up

  52. Data Warehouse • S3 for Scale • Decouple Compute & Storage • Parquet for Speed

  53. Genie at Netflix • Runs the OSS code • Runs ~45k jobs per day in production • Runs on ~25 i2.4xl instances at any given time • Keeps ~3 months of jobs (~3.1 million) in history

  54. Resources • http://netflix.github.io/genie/ – Work in progress for 3.0.0 • https://github.com/Netflix/genie – Demo instructions in README • https://hub.docker.com/r/netflixoss/genie- app/ – Docker Container

  55. Questions?

Recommend


More recommend