designing services for resilience experiments lessons
play

Designing Services for Resilience Experiments: Lessons from Netflix - PowerPoint PPT Presentation

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js So, how can teams


  1. Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js

  2. Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js

  3. So, how can teams design services for resilience testing? ● Failure Injection Enabled

  4. So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled

  5. So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled ● Fallback Paths ○ And ways to discover them

  6. So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled ● Fallback Paths ○ And ways to discover them ● Proper monitoring ○ Key business metrics to look for

  7. So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled ● Fallback Paths ○ And ways to discover them ● Proper monitoring ○ Key business metrics to look for ● Proper timeouts ○ And ways to discover them

  8. Known Ways to Increase Confidence in Resilience

  9. Known Ways to Increase Confidence in Resilience ● Unit Tests

  10. Known Ways to Increase Confidence in Resilience ● Integration Tests

  11. New Ways to Increase Confidence in Resilience ● Chaos Experiments

  12. SPS: Key Business Metric

  13. Chaos Engineering: Netflix’s ChAP 100% API Personalization

  14. Chaos Engineering: Netflix’s ChAP 98% Gateway API Personalization 1% API Control

  15. Chaos Engineering: Netflix’s ChAP 98% Gateway API Personalization 1% API Control

  16. Chaos Engineering: Netflix’s ChAP 98% Gateway API Personalization 1% API Control 1% API Exp

  17. Chaos Engineering: Netflix’s ChAP 98% Gateway API Personalization 1% API Control 1% API Exp

  18. Monitoring

  19. Monitoring SHORTED

  20. 1. Have Failure Injection Testing Enabled.

  21. Sample Failure Injection Library https://github.com/norajones/FailureInjectionLibrary

  22. Types of Chaos Failures

  23. Types of Chaos Failures

  24. Criteria&API

  25. Automating Creation of Chaos Experiments

  26. 2. Have Good Monitoring in Place for Configuration Changes.

  27. Have Good Monitoring in Place ● RPC Enabled

  28. Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands

  29. Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks

  30. Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks ● Timeouts

  31. Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks ● Timeouts ● Retries

  32. Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks ● Timeouts ● Retries ● All in One Place!

  33. RPC/Ribbon ● Java library managing REST clients to/from different services ● Fast failing/fallback capability

  34. RPC/Ribbon Timeouts

  35. RPC Timeouts At what point does the service give up?

  36. Retries Immediately retrying a failure after an operation is not usually a great idea.

  37. Retries Understand the logic between your timeouts and your retries.

  38. Circuit Breakers/Fallback Paths

  39. Hystrix Commands/Fallback Paths If your service is non-critical, ensure that there are fallback paths in place.

  40. Fallback Strategies Static Content Fallback Cache Service

  41. Fallback Strategies Know what your fallback strategy is and how to get that information.

  42. 3. Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

  43. ChAP’s Monocle

  44. ChAP’s Monocle

  45. ChAP’s Monocle

  46. There isn’t always money in microservices

  47. Criticality Score

  48. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

  49. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

  50. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

  51. Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

  52. Chaos Success Stories

  53. “We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was resolved before it resulted in any availability incident!”

  54. “While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful...

  55. “While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback was retrying the call, causing an increase in license requests.”

  56. Don’t lose sight of your company’s customers.

  57. @nora_js Takeaways ● Designing for resiliency testability is a shared responsibility. ● Configuration changes can cause outages. ● Have explicit monitoring in place on antipatterns in configuration changes.

  58. Questions? @nora_js

Recommend


More recommend