chaos kong endowing netflix with antifragility
play

Chaos Kong Endowing Netflix with Antifragility Luke Kosewski - PowerPoint PPT Presentation

Chaos Kong Endowing Netflix with Antifragility Luke Kosewski Traffic & Chaos Engineering This is a Case Study Well Be Doing TOGETHER This is What AWS Failover Looks Like us-west-2 us-east-1 eu-west-1 Failover is Run By This Guy A


  1. Chaos Kong Endowing Netflix with Antifragility Luke Kosewski Traffic & Chaos Engineering

  2. This is a Case Study We’ll Be Doing TOGETHER

  3. This is What AWS Failover Looks Like us-west-2 us-east-1 eu-west-1

  4. Failover is Run By This Guy A Traffic Engineer

  5. Failover is Run By This Guy A Traffic Engineer

  6. A Traffic Engineer’s Environment ● Netflix control plane

  7. A Traffic Engineer’s Environment ● Netflix control plane ● Primarily in 3 AWS regions (EU, us-east-1, us-west-2)

  8. A Traffic Engineer’s Environment ● Netflix control plane ● Primarily in 3 AWS regions (EU, us-east-1, us-west-2) ● They look like this: us-west-2 us-east-1 eu-west-1

  9. Traffic’s Teammates traffic@netflix.com / chaos@netflix.com Intuition Traffic Chaos Justin Reynolds Niosha Behnam & myself (management) Lorin Hochstein, Aaron Blohowiak & Ali Basiri Casey Rosenthal

  10. Our Relationship Chaos Traffic Flow High Availability

  11. Storytime with Luke Once upon a time... (August 2013)

  12. 3 SREs at Netflix

  13. 3 SREs at Netflix 10s of services

  14. 3 SREs at Netflix 10s of services 100s of devs

  15. Disaster

  16. Active-Active

  17. Opportunity

  18. Flow

  19. Fail Out of US-East-1: Case Study ➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

  20. Fail Out of US-East-1: Case Study ➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

  21. January 14, 2016 Stream Starts per Second – us-east region

  22. January 14, 2016 Stream Starts per Second – us-east region

  23. Fail Out of US-East-1: Case Study ➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

  24. Diurnal Scaling

  25. Y’all Ready for This?

  26. What to Scale?

  27. What to Scale? ● Anything absorbing incoming traffic

  28. What to Scale? ● Anything absorbing incoming traffic ● Large stateless services

  29. What to Scale? ● Anything absorbing incoming traffic ● Large stateless services ● Required stateful services (carefully)

  30. That’s Better

  31. How to Scale?

  32. Two More Fallbacks ● “Time of Day” estimation

  33. Two More Fallbacks ● “Time of Day” estimation ● largest observed value in the last 24h as an intercept

  34. How Much?

  35. Ooze

  36. Nimble

  37. Fail Out of US-East-1: Case Study ➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

  38. What do I mean by that?

  39. Why We Proxy Stream Starts per Second - EU

  40. How do We Proxy? Archaius dynamic properties – regionally scoped Zuul proxy with dynamic filters (Groovy)

  41. Fail Out of US-East-1: Case Study ➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

  42. Traditional DNS

  43. Netflix’s DNS as a DB

  44. Failover

  45. Are We Done? Stream Starts per Second - EU

  46. Nope Stream Starts per Second - EU

  47. Fail Out of US-East-1: Case Study ➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

  48. Recap: Proxying

  49. Recap: Proxying

  50. The Crowd Goes Wild Stream Starts per Second - EU

  51. This is What Success Feels Like

  52. Positive Feedback Loop The more we practice, the better and more daring we get

  53. Other Takeaways

  54. Thank You and Questions Luke Kosewski – luke@netflix.com Traffic & Chaos Engineering

  55. Summary of NFLX github/techblog links Active/Active ● http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html http://techblog.netflix.com/2016/03/global-cloud-active-active-and-beyond.html Archaius ● https://github.com/Netflix/archaius http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html Zuul ● https://github.com/Netflix/zuul http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html SPS ● http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html

Recommend


More recommend