keeping movies running amid thunderstorms
play

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ - PowerPoint PPT Presentation

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011 Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011 Netflix Then and Now Netflix prior


  1. Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011

  2. Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011

  3. Netflix Then and Now Netflix prior to circa 2009 Netflix post circa 2009 Users watched DVDs at home Users watch streaming at home Peak days : Friday, Saturday, Sunday Peak days : Friday, Saturday, Sunday Users returned DVDs & Updated their Qs Off-Peak days see many orders of magnitude more traffic than prior to Peak days : Sunday, Monday 2009 We shipped the next DVDs User expectation is that streaming is always available Peak days : Monday, Tuesday No Scheduled Site Downtimes Scheduled Site Downtimes on alternate Wednesdays Fault Tolerance is a top design concern 3 Thursday, November 17, 2011

  4. Netflix DC Architecture A Simple System 4 Thursday, November 17, 2011

  5. Netflix’s DC Architecture Components H/W Load Balancer 1 Netscaler H/W Load Balancer Apache + Tomcat Apache + Tomcat Apache + Tomcat ~20 “ WWW ” Apache+Tomcat servers 3 Oracle DBs & 1 MySQL DB Cache Servers Cinematch System Cache Servers MySQL Oracle Cinematch Recommendation System 5 Thursday, November 17, 2011

  6. Netflix’s DC Architecture Types of Production Issues H/W Load Balancer Java Garbage Collection problems, which would would result in slower Apache + Tomcat Apache + Tomcat Apache + Tomcat WWW pages Deadlocks in our multi-threaded Java application would cause web page loading to timeout Cinematch System Cache Servers MySQL Oracle Transaction locking in the DB would result in the similar web page loading timeouts Under-optimized SQL or DB would cause slower web pages ( e.g. DB optimizer picks a sub-optimal the execution plan ) 6 Thursday, November 17, 2011

  7. Netflix’s DC Architecture H/W Load Balancer Architecture Pros As serious as these sound, they were Apache + Tomcat Apache + Tomcat Apache + Tomcat typically single-system failure scenarios Single-system failures are relatively easy to resolve Architecture Cons Cinematch System Cache Servers MySQL Oracle Not horizontally scalable We ʼ re constrained by what can fit on a single box Not conducive to high-velocity development and deployment 7 Thursday, November 17, 2011

  8. Netflix’s Cloud Architecture A Less Simple System 8 Thursday, November 17, 2011

  9. Netflix’s Cloud Architecture ELB ELB NES NES NES NES Components Many (~100) applications, organized in Discovery clusters NMTS NMTS NMTS NMTS Clusters can be at different levels in the call stack NMTS NMTS Clusters can call each other NBES NBES IAAS IAAS IAAS 9 Thursday, November 17, 2011

  10. Netflix’s Cloud Architecture ELB ELB Levels NES NES NES NES NES : Netflix Edge Services Discovery NMTS : Netflix Mid-tier Services NMTS NMTS NMTS NMTS NBES : Netflix Back-end Services IAAS : AWS IAAS Services NMTS NMTS Discovery : Help services discover NMTS and NBES services NBES NBES IAAS IAAS IAAS 10 Thursday, November 17, 2011

  11. Netflix’s Cloud Architecture ELB ELB Components (NES) NES NES NES NES Overview Any service that browsers and streaming Discovery devices connect to over the internet NMTS NMTS NMTS NMTS They sit behind AWS Elastic Load Balancers ( a.k.a. ELB ) NMTS NMTS They call clusters at lower levels NBES NBES IAAS IAAS IAAS 11 Thursday, November 17, 2011

  12. Netflix’s Cloud Architecture Components (NES) ELB ELB Examples NES NES NES NES API Servers Discovery Support the video browsing experience NMTS NMTS NMTS NMTS Also allows users to modify their Q Streaming Control Servers NMTS NMTS Support streaming video playback Authenticate your Wii, PS3, etc... NBES NBES Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, etc... IAAS IAAS IAAS 12 Thursday, November 17, 2011

  13. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Overview Discovery Can call services at the same or lower NMTS NMTS NMTS NMTS levels Other NMTS NMTS NMTS NBES, IAAS Not NES NBES NBES Exposed through our Discovery service IAAS IAAS IAAS 13 Thursday, November 17, 2011

  14. Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Examples Discovery Netflix Queue Servers NMTS NMTS NMTS NMTS Modify items in the users ʼ movie queue Viewing History Servers NMTS NMTS Record and track all streaming movie watching SIMS Servers NBES NBES Compute and serve user-to-user and movie-to-movie similarities IAAS IAAS IAAS 14 Thursday, November 17, 2011

  15. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Overview Discovery A back-end, usually 3rd party, open-source service NMTS NMTS NMTS NMTS Leaf in the call tree. Cannot call anything else NMTS NMTS NBES NBES IAAS IAAS IAAS 15 Thursday, November 17, 2011

  16. Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Examples Discovery Cassandra Clusters NMTS NMTS NMTS NMTS Our new cloud database is Cassandra and stores all sorts of data to support application needs NMTS NMTS Zookeeper Clusters Our distributed lock service and sequence NBES NBES generator Memcached Clusters Typically caches things that we store in S3 but need to access quickly or often IAAS IAAS IAAS 16 Thursday, November 17, 2011

  17. Netflix’s Cloud Architecture ELB ELB Components (IAAS) NES NES NES NES Examples AWS S3 Discovery Large-sized data ( e.g. video encodes, NMTS NMTS NMTS NMTS application logs, etc... ) is stored here, not Cassandra NMTS NMTS AWS SQS Amazon ʼ s message queue to send events ( e.g. Facebook network updates are processed asynchronously over SQS ) NBES NBES IAAS IAAS IAAS 17 Thursday, November 17, 2011

  18. Netflix’s Cloud Architecture ELB ELB Types of Production Issues NES NES NES NES A user-issued call will pass through multiple levels during normal operation Discovery We are now exposed to multi-system NMTS NMTS NMTS NMTS coincident failures, a.k.a. coordinated failures NMTS NMTS NBES NBES IAAS IAAS IAAS 18 Thursday, November 17, 2011

  19. Netflix’s Cloud Architecture ELB ELB Architecture Pros NES NES NES NES Horizontally scalable at every level Should give us maximum availability Discovery NMTS NMTS NMTS NMTS Supports high-velocity development and deployment Architecture Cons NMTS NMTS A user-issued call will pass through multiple levels ( a.k.a. hops ) during normal operation NBES NBES Latency can be a concern We are now exposed to multi-system coincident failures, a.k.a. coordinated IAAS IAAS IAAS failures A lot of moving parts 19 Thursday, November 17, 2011

  20. Issue 1 Capacity Planning 20 Thursday, November 17, 2011

  21. Issue 1 X X Y Y • Service X and Service Y , each made up of 2 instances, call Service A , also made up of 2 instance • If either of these services expect a large increase in A A traffic, they need to let the owner of Service A know • Service A can then scale up ahead of the traffic X X Y Y increase A A A A A A Disaster Avoided ?? 21 Thursday, November 17, 2011

  22. Issue 1 • A given application owner may need to contact 20 other application owners each time he expects to get a large increase in traffic X X Y Y • Too much human coordination • A few options A A • Some service owners vastly over-provision for their application X X Y Y • Not cost effective • Auto-scaling • A A A A A A We want to generalize the model first proved by our Streaming Control Server (a.k.a. NCCP) team 22 Thursday, November 17, 2011

  23. ELB AutoScaling Interlude How to use an ELB An elastic-load balancer (ELB) routes traffic to your EC2 instances e.g. of an ELB : nccp-wii-11111111.us- east-1.elb.amazonaws.com Netflix maps a CNAME to this ELB e.g. : nccp.wii.netflix.com Netflix then registers the API Service’s EC2 instances with this ELB The ELB periodically polls attached EC2 instances to ensure the instances are healthy 23 Thursday, November 17, 2011

  24. ELB AutoScaling Interlude Taking this a bit further The NCCP servers can publish metrics to AWS CloudWatch We can set up an alarm in Cloud Watch on a metric ( e.g. CPU ) We can associate an auto scale policy with that alarm ( e.g. if CPU > 60%, add 3 more instances ) When a metric goes above a limit, an alarm is triggered, causing auto-scaling, which grows our pool 24 Thursday, November 17, 2011

  25. ELB AutoScaling Interlude Cloud EC2 instances publish NCCP Watch CPU data to CW (Alarms) CloudWatch alarms trigger ASG policies Auto EC2 Instances Scaling Service Added/Removed (Policies) 25 Thursday, November 17, 2011

  26. ELB AutoScaling Interlude Scale Out Event Average CPU > 60% for 5 minutes Scale In Event Average CPU < 30% FOR 5 minutes Cool Down Period 10 minutes Auto-Scale Alerts DLAutoScaleEvents 26 Thursday, November 17, 2011

  27. 27 @r39132 23 Thursday, November 17, 2011

  28. Issue 1 X X Y Y Summary A A We would like to have auto-scaling at all levels. X X Y Y A A A A A A 28 Thursday, November 17, 2011

  29. Issue 2 Thundering herds to NMTS 29 Thursday, November 17, 2011

Recommend


More recommend