pitfalls in measuring slos
play

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage - PowerPoint PPT Presentation

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel What do you do when things break? How bad was this break? Danyel Fisher @fisherdanyel


  1. Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel

  2. An Outage Danyel Fisher @fisherdanyel

  3. Danyel Fisher @fisherdanyel

  4. Danyel Fisher @fisherdanyel

  5. What do you do when things break? How bad was this break? Danyel Fisher @fisherdanyel

  6. Danyel Fisher @fisherdanyel

  7. We need to improve Build new features! quality! Danyel Fisher @fisherdanyel

  8. Management How broken is “too broken”? Engineering What does “good enough” mean? Clients and Users Combatting alert fatigue Danyel Fisher @fisherdanyel

  9. A telemetry system produces events that correspond to real world use We can describe some of these events as eligible We can describe some of them as good Danyel Fisher @fisherdanyel

  10. Given an event , is it eligible? Is it good? Eligible: “Had an http status code” Good: “... that was a 200, and was served under 500 ms ” Danyel Fisher @fisherdanyel

  11. Danyel Fisher @fisherdanyel

  12. Danyel Fisher @fisherdanyel

  13. Minimum Quality ratio over a Number of bad events allowed. period of time Danyel Fisher @fisherdanyel

  14. Deploy faster Room for experimentation Opportunity to tighten SLO Danyel Fisher @fisherdanyel

  15. We always store incoming user 99.99% ~4.3 minutes data Default dashboards usually load 99.9% 45 minutes in < 1s 99% 7.3 hours Queries often return in < 10 s Danyel Fisher @fisherdanyel

  16. User Data Throughput We blew through three months’ budget in those 12 minutes. Danyel Fisher @fisherdanyel

  17. We dropped customer data Danyel Fisher @fisherdanyel

  18. We dropped customer data We rolled it back (manually) We communicated to customers We halted deploys Danyel Fisher @fisherdanyel

  19. We checked in code that didn’t build . We had experimental CI build wiring. Our scripts deployed empty binaries . There was no health check and rollback. Danyel Fisher @fisherdanyel

  20. We stopped writing new features We prioritized stability We mitigated risks Danyel Fisher @fisherdanyel

  21. SLOs allowed us to characterize what went wrong, how badly it went wrong, and how to prioritize repair Danyel Fisher @fisherdanyel

  22. Learning from SLOs Danyel Fisher @fisherdanyel

  23. Final point A one-line description of it Danyel Fisher @fisherdanyel

  24. ● Design Thinking Expressing and Viewing ● Burndown Alerts and Responding ● ● Learning from our Experiences ● Success Stories Danyel Fisher @fisherdanyel

  25. Design Thinking and Task Analysis Understand user goals and needs Learn from informants and experts Collaborate with internal team Collect feedback and ideas externally Danyel Fisher @fisherdanyel

  26. Displays and Views Danyel Fisher @fisherdanyel

  27. See where the burndown was happening, explain why, and remediate Danyel Fisher @fisherdanyel

  28. Expressing SLOs Event based Time based “How many events had a duration < “How many 5 minute periods, had a 500 ms” P95(duration) < 500 ms” Danyel Fisher @fisherdanyel

  29. How do we express SLOs? Good events Bad events How often Time range Danyel Fisher @fisherdanyel

  30. How do we express SLOs? Good events Bad events How often Time range Danyel Fisher @fisherdanyel

  31. How do we express SLOs? Eligible: $name is “run_trigger_detailed” Good: $app.error does not exist Good events Bad events How often Time range Danyel Fisher @fisherdanyel

  32. How do we express SLOs? Good events Bad events How often Time range Danyel Fisher @fisherdanyel

  33. How do we express SLOs? Good events Bad events How often Time range Danyel Fisher @fisherdanyel

  34. Status of an SLO Danyel Fisher @fisherdanyel

  35. How have we done? Danyel Fisher @fisherdanyel

  36. Danyel Fisher @fisherdanyel

  37. Where did it go? Danyel Fisher @fisherdanyel

  38. When did the errors happen? Danyel Fisher @fisherdanyel

  39. When did the errors happen? Danyel Fisher @fisherdanyel

  40. What went wrong? High dimensional data High cardinality data Danyel Fisher @fisherdanyel

  41. Why did it happen? Danyel Fisher @fisherdanyel

  42. Why did it happen? Danyel Fisher @fisherdanyel

  43. Why did it happen? Danyel Fisher @fisherdanyel

  44. See where the burndown was happening, explain why, and remediate Danyel Fisher @fisherdanyel

  45. User Feedback “The Bubble Up in the SLO page is really powerful at highlighting what is contributing the most to missing our SLIs, it has definitely confirmed our assumptions.” Danyel Fisher @fisherdanyel

  46. User Feedback “Your customers have to be happy... we have to have an understanding of the customer experience. … To the millisecond we knew what our percentage was of success versus failure .” -Josh Hull, Site Reliability Engineering Lead, Clover Health Danyel Fisher @fisherdanyel

  47. User Feedback “The historical SLO chart also confirms a fix for a performance issue we did greatly contributed to the SLO compliance by showing a nice upward trend line. :)” Danyel Fisher @fisherdanyel

  48. User Feedback “I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate but they’re a little spiky to be useful. It would be great to get a better sense of when the budget is going and define alerts that way.” Danyel Fisher @fisherdanyel

  49. Burndown Alerts Danyel Fisher @fisherdanyel

  50. How is my system doing? Am I over budget? When will my alarm fail? Danyel Fisher @fisherdanyel

  51. When will I fail? User goal: get alerts to exhaustion time Human-digestible units 24 hours: “I’ll take a look in the morning” 4 hours: “All hands on deck!” Danyel Fisher @fisherdanyel

  52. Danyel Fisher @fisherdanyel

  53. How is my system doing? Am I over budget? When will my alarm fail? Danyel Fisher @fisherdanyel

  54. Implementing Burn Alerts Run a 30 day query at a 5 minute resolution every minute Danyel Fisher @fisherdanyel

  55. Danyel Fisher @fisherdanyel

  56. Caching is Fun! Danyel Fisher @fisherdanyel

  57. Fun with Caching Vital to cache results … but not incomplete results … … at what resolution of cache? Danyel Fisher @fisherdanyel

  58. Flappy Alerts “It’ll expire at 3:55” (We added a 10%ish buffer) “Wait, make that 4:05” “Nope, 3:55 again!” Danyel Fisher @fisherdanyel

  59. Recovering from Bankruptcy A failure a month ago brought us to -169% and still hasn’t aged out? That means we don’t get alerts anymore Customer workaround: delete and re-create the SLO, thus blowing the cache Danyel Fisher @fisherdanyel

  60. Learning from Experience Danyel Fisher @fisherdanyel

  61. Volume is important Tolerate at least dozens of bad events per day Danyel Fisher @fisherdanyel

  62. Faults

  63. SLOs for Customer Service Remember that user having a bad day? ADD IMAGE Danyel Fisher @fisherdanyel

  64. Blackouts are easy … but brownouts are much more interesting Danyel Fisher @fisherdanyel

  65. Danyel Fisher @fisherdanyel

  66. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes Danyel Fisher @fisherdanyel

  67. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash Danyel Fisher @fisherdanyel

  68. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” Danyel Fisher @fisherdanyel

  69. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” Danyel Fisher @fisherdanyel

  70. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash Danyel Fisher @fisherdanyel

  71. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash Danyel Fisher @fisherdanyel

  72. Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash 10:32 am Fixed Danyel Fisher @fisherdanyel

  73. Danyel Fisher @fisherdanyel

  74. Danyel Fisher @fisherdanyel

Recommend


More recommend