learning to bend but not break at whoops something went
play

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error Were having trouble playing this title right now. Please try again later or select a different title. Functional Sharding RPC tuning Shard A


  1. : LEARNING TO BEND BUT NOT BREAK AT

  2. Whoops, something went wrong… Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.

  3. Functional Sharding RPC tuning Shard A Shard B Shard C Client Server Bulkheads & Fallbacks

  4. How to stay up in spite of change and turmoil? How to fail well? How to help teams build more resilient systems? Non-Critical Critical Service Chaos Service Owner. Owner. Engineer.

  5. Service Criticality Driver_free_car.jpg, CC BY-SA 3.0, BP63Vincente 2015, Wikimedia

  6. Service Criticality KPI = Playback Starts Per Second ( SPS ) Service A Service F Service B Service C Service G Non-critical Service D Service E Critical

  7. Non-Critical Service Critical Service Chaos Owner. Owner. Engineer.

  8. Badging

  9. My service is non-critical, who needs Chaos? How do you know your service is non-critical?

  10. https://github.com/Netflix/Hystrix Insights Bulkheads Circuit Breakers Timeouts Fallbacks

  11. Badging Service (Non-Critical) API Service Fallback Badging Service

  12. Surprise! Badging is Critical! API Service Fallback Badging Service

  13. Gaps in Traditional Testing ● Environmental factors may differ between test and production (config, data, etc.) ● Systems behave differently under load than they do in a single unit or integration test ● Users react differently to failures than you expect.

  14. How to fail well? ● Functioning fallbacks. ● Use Chaos to close gaps in traditional testing methods. Non-Critical Service Critical Service Chaos Owner. Owner. Engineer.

  15. Critical Service Non-Critical Chaos Owner. Service Owner. Engineer.

  16. Protect your service (and your customers)

  17. How can I decrease the blast radius of failures? How about functional sharding!

  18. Playback Service Architecture API Service API Service API Service Playback Service URL Service

  19. CRITICAL NON-CRITICAL Customer Experience or Streaming Performance Impact Impact

  20. Playback Service Functional Shards API Service API Service API Service Critical Playback Non-Critical Service Playback Service Critical URL Non-Critical Service URL Service

  21. CC BY-NC 2.5, Randall Munroe, xkcd.com

  22. Experimenting with Shards API Service API Service API Service Critical Playback Non-Critical Service Playback Service Non-Critical URL Service URL Service

  23. Customer Behavior Insights API Service API Service API Service Critical Playback Non-Critical Service Playback Service 25% More Non-Critical Traffic URL Service URL Service

  24. How do I confirm my system is tuned properly? Inject latency, of course!

  25. Dependency Tuning Playback Customer Tag Service Service ● Retries ● Timeouts ● Load balancing strategies ● Concurrency limits ● Circuit breakers

  26. Calendar*, CC BY 2.0, Dafne Cholete 2011, Flikr

  27. Playback Service → Customer Tag Service Playback Service Customer Tag Service

  28. Latency Injection - Round 1 Playback Service Customer Tag Service

  29. Latency Injection - Round 2 Playback Service Customer Tag Service

  30. Latency Injection - Round 2 300ms timeout Playback Service 350ms Out of time!! 1. Customer 2. URL Service Tag Service

  31. Continuous Experimentation FTW! ● Fewer changes between experiments make it easier to isolate the regression. ● Fine-grained experiments scope the investigation (as opposed to outages where there are lots of red-herrings) .

  32. How to stay up in spite of change and turmoil? ● Functional sharding for fault isolation. ● Tune RPC calls. ● Use Chaos to validate config and resiliency strategies. Critical Service Non-Critical Chaos Owner. Service Owner. Engineer.

  33. Chaos Non-Critical Critical Service Engineer. Service Owner. Owner.

  34. How do you help teams build more resilient systems? We need to do more of the heavy lifting. Perhaps the Principles of Chaos can help!

  35. Principles of Chaos ● Minimize Blast Radius ● Build a Hypothesis around Steady State Behavior ● Vary Real-world Events ● Run Experiments in Production ● Automate Experiments to Run Continuously https://principlesofchaos.org/

  36. Test v. Production Rock-em, CC BY-SA 2.0, Ariel Waldmane 2009, Flikr

  37. How can we Minimize Blast Radius? Safety, safety, safety!!

  38. Kill Switch

  39. Canary Strategy Service A Service B Service C 0.5% 0.5% Service B (Control) Service B (Experiment)

  40. Limit Impact Runs In Progress Experiment Cluster Status Latency api-prod In Progress Latency dredd-prod In Progress Failure api-prod Queued

  41. Limit When Experiments can Run Safety First during the Holidays

  42. Ensure Failures are Addressed

  43. Fail Open 1. Control errors too high. 2. Errors in chaos code unrelated to the experiment in question. 3. Platform components crashing (monitoring, worker nodes, etc).

  44. How should we Build a Hypothesis around Steady Observability is key! State? Add effective monitoring, analysis, and insights.

  45. Insights

  46. Automated Canary Analysis (ACA) https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69

  47. ChAP ACA Configurations Validate the experiment itself Validate the real-time monitoring didn’t miss anything Check for service failures even if they didn’t cause an impact in KPIs See if your service is approaching an unhealthy state

  48. How do you Vary Real-world Events in an automated fashion? By carefully designing and prioritizing your experiments, of course!

  49. Understand the Service Under Test Dependency Insights: ● Timeouts ● Retries ● % of Requests Involved Requests Per Second ● ● Latency ● Hystrix Commands ○ Fallbacks Timeouts ○

  50. Evaluate Safety NOT SAFE TO FAIL!!!

  51. Can more automation eventually lead to fewer experiments?

  52. Prioritize Experiments Retries Traffic Percentage Failure Latency Experiment Type Aging

  53. Generate Experiments Failure Failure Latency Latency

  54. Is it time to Run Experiments in Production? Here we go!

  55. What happened? 14 0 Vulnerabilities Outages Tooling Confidence Gaps

  56. Example Finding Playback Service No Fallback! 376 ms Latency License Service

  57. 88.85% of cluster traffic Circuit Breaker s t u o e m i T 10 threads Thread Pool Rejections

  58. Fully validated fix in tool before rollout!

  59. After a day's worth of data, the results are looking fantastic. Every negative metric [for that Hystrix command] had a drastic improvement, and some by an order of magnitude. --Robert Reta, Playback Licensing

  60. What else can be safer?

  61. How do you help teams build more resilient systems? ● Apply the “Principles of Chaos” to tooling. ● Manage the heavy lifting. Chaos Non-Critical Critical Service Engineer. Service Owner. Owner.

  62. You Must be This Tall to Ride?

  63. How to stay up in spite of change and turmoil? ● Functional sharding for fault isolation. ● Tune RPC calls. ● Use Chaos to validate config and resiliency strategies. How to fail well? How to help teams build more ● Functioning fallbacks. resilient systems? Use Chaos to close gaps in ● ● Apply the “Principles of Chaos” to traditional testing methods. tooling. ● Manage the heavy lifting. Non-Critical Critical Service Chaos Service Owner. Owner. Engineer.

  64. You Can Either Curl Up In A Ball And Die… Or You Can Stand Up And Say, “We’re Different. We’re The Strong Ones, And You Can’t Break Us!” Haley Tucker Senior Software Engineer Chaos Engineering @hwilson1204

Recommend


More recommend