microservices at netflix scale
play

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons - PowerPoint PPT Presentation

Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg @rusmeshenberg Microservices: all benefits, no costs? Netflix is the worlds leading Internet television network with over 81 million members in


  1. Microservices at Netflix Scale First Principles, Tradeoffs, Lessons Learned Ruslan Meshenberg @rusmeshenberg

  2. Microservices: all benefits, no costs?

  3. Netflix is the world’s leading Internet television network with over 81 million members in over 190 countries enjoying more than 125 million hours of TV shows and movies per day, including original series, documentaries and feature films.

  4. Ruslan Meshenberg Director, Platform Engineering • Runtime Systems • Container Runtime • Persistence and Databases • Real Time Data Infrastructure

  5. Netflix runs on microservices

  6. Netflix journey to microservices

  7. Our journey took 7 years https://media.netflix.com/en/company-blog/completing-the-netflix-cloud-migration

  8. Data Center - Monolith RDBMS

  9. August 2008

  10. First Principles

  11. Buy vs. Build ● Use or contribute to OSS technologies first ● Only build what you have to

  12. Services should be stateless* ● Must not rely on sticky sessions ● Prove by Chaos testing *Except the Persistence / Caching layers

  13. Scale out vs. scale up ● If you keep scaling up, you’ll hit a limit ● Horizontal scaling gives you a longer runway

  14. Redundancy and Isolation For Resiliency ● Make more than one of anything ● Isolate the blast radius for any given failure

  15. Automate destructive testing ● Simian Army ● Started with Chaos Monkey

  16. First Principles In Action

  17. Stateless services Service A Service B Service B Service B Service B Service B

  18. Verify stateless

  19. Data – from RDBMS to Cassandra ● NoSQL at scale ● Open Source ● Multi-Regional ● Multi-directional ● Available ● Partition Tolerance ● Tunable Consistency*

  20. Multi-Regional Replication Zone Zone B B 500ms Zone Zone Zone Zone A B A B Client Client Local Quorum Zone Zone Zone Zone (Typical) A C A C Zone Zone C C Bi-directional Region A Region B Nightly compare & repair

  21. Last, but not least - Billing

  22. Microservices – Benefits

  23. Our Priorities 3. Efficiency 1. Innovation 2. Reliability

  24. Innovation: tight coupling doesn’t work Develop • Team A Test Release • Team B • Team C • …

  25. Innovation: Loose coupling Develop, Team A Test, Deploy, Support Develop, Team B Test, Deploy, Support Develop, Team C Test, Deploy, Support

  26. Architect Support Run Design End-end ownership Deploy Develop Test Review

  27. End-end ownership + velocity Architect Support Architect Support Support Support Architect Architect Run Design Run Design Run Design Run Design Deploy Develop Deploy Develop Deploy Develop Deploy Develop Test Review Test Review Test Review Test Review Support Architect Support Architect Support Support Architect Architect Run Design Run Design Run Design Run Design Deploy Develop Deploy Develop Deploy Develop Deploy Develop Test Review Test Review Test Review Test Review Support Architect Support Architect Support Support Architect Architect Run Design Run Design Run Design Run Design Deploy Develop Deploy Develop Deploy Develop Deploy Develop Test Review Test Review Test Review Test Review

  28. Separation of concerns Feature A Feature B Feature C UI Feature D A/B Test E Personalization Leverage A/B Feature H Mid-tier Test F Availability Scalability Security Infrastructure

  29. Microservices – Costs

  30. Microservices Is an org change! Org changes are hard!

  31. Evolving the organization

  32. Central infrastructure investment

  33. Migration doesn’t happen overnight ● Living in the hybrid world ● Supporting 2 tech stacks ● Double the maintenance ● Multi-master data replication

  34. Microservices - Lessons Learned

  35. IPC is crucial for loose coupling ● Common language between the services ● Establishes the contract of interaction

  36. Caching to protect DBs Client Application Request Cache Client Library EVCache Client Service Client . . . . . . S S S S 1. Read from Cache 2. On cache miss call service 3. Service calls DB and responds . . . 4. Service updates the cache DB DB DB DB

  37. Operational visibility matters If you can’t see it, you can’t improve it

  38. Will your Telemetry scale? Observe Orient Act Decide

  39. Edge Middle Tier & Platform Zuul EVCache ELB API Cassandra Playback

  40. Reliability Matters ● We strive for 4 9’s of availability ● That leaves only 52 minutes of downtime per YEAR ● Netflix outages lead to …

  41. Disappointment

  42. Outrage

  43. Withdrawal

  44. Humor

  45. Cascading failures … 99% availability 99% availability 99% availability 500 = 0.0657% 99%

  46. Microservice failure FIT Fault-Injection Test Framework

  47. Regional fail-over x x

  48. Regional fail-over

  49. A word on containers ● Containers change the level of encapsulation from VM to process ● Containers can help deliver great developer experience ● To run containers in production at scale …

  50. Requires something like this: Cassandra Zookeeper Docker Docker S3 Registry Docker Registry Registry Titus Master Titus UI Titus UI Titus UI Titus Agent metrics agent Job Management & Scheduler container container container Titus executor Rhea Titus API Rhea container Fenzo container logging agent container docker Mesos Master VPC networking docker driver zfs AWS container Integration metadata proxy mesos agent EC2 Autocaling API Amazon VM’s CI/CD 50

  51. Microservices - Resources

  52. http://netflix.github.com

  53. http://netflix.github.com

  54. http://netflix.github.com

  55. http://netflix.github.com

  56. http://netflix.github.com

  57. http://netflix.github.com

  58. Wrap up

  59. Microservices bring great value to development velocity, availability and other dimensions

  60. Microservices at scale require organizational change and centralized infrastructure investment

  61. Be aware of your situation and what works for you

  62. Questions? Ruslan Meshenberg @rusmeshenberg

Recommend


More recommend