the art of slos
play

The Art of SLOs In the midst of chaos , there is also opportunity - PowerPoint PPT Presentation

The Art of SLOs In the midst of chaos , there is also opportunity reliability Sun Tzu, The Art of War https://cre.page.link/art-of-slos-slides Welcome! Don't be shy say hello to your neighbours https://cre.page.link/art-of-slos-slides


  1. The Art of SLOs In the midst of chaos , there is also opportunity reliability — Sun Tzu, The Art of War https://cre.page.link/art-of-slos-slides

  2. Welcome! Don't be shy … say hello to your neighbours https://cre.page.link/art-of-slos-slides

  3. Group Agreements ⁄ We’re here to learn ⁄ Please ask questions (raise your hand) ⁄ One speaker at a time ⁄ Assume positive intent ⁄ “Why am I speaking?” https://cre.page.link/art-of-slos-slides

  4. Agenda ⁄ Terminology ⁄ Why your services need SLOs ⁄ Spending your error budget ⁄ Choosing a good SLI ⁄ Developing SLOs and SLIs https://cre.page.link/art-of-slos-slides

  5. S ervice L evel I ndicator A quantifiable measure of service reliability https://cre.page.link/art-of-slos-slides

  6. S ervice L evel O bjectives Set a reliability target for an SLI https://cre.page.link/art-of-slos-slides

  7. Users? Customers? Customers are users who directly pay for a service https://cre.page.link/art-of-slos-slides

  8. Services Need SLOs https://cre.page.link/art-of-slos-slides

  9. Don't believe us? "Since introducing SLOs, the relationship between our operations and development teams has subtly but markedly improved ." — Ben McCormack, Evernote; The Site Reliability Workbook, Chapter 3 "... it is difficult to do your job well without clearly defining well . SLOs provide the language we need to define well ." — Theo Schlossnagle, Circonus; Seeking SRE, Chapter 21 https://cre.page.link/art-of-slos-slides

  10. The most ➊ important feature of any system is its reliability https://cre.page.link/art-of-slos-slides

  11. Developers Operators How do you incentivize Agility Stability reliability? https://cre.page.link/art-of-slos-slides

  12. A principled way to agree on the desired reliability of a service https://cre.page.link/art-of-slos-slides

  13. What does " reliable " mean? Think about Netflix, Google Search, Gmail, Twitter… how do you tell if they are ‘working’? https://cre.page.link/art-of-slos-slides

  14. Objective Agreement 200 ms “Ugh” 0 ms 300 ms “HTTP GET / …” Customer https://cre.page.link/art-of-slos-slides

  15. With me so far? https://cre.page.link/art-of-slos-slides

  16. When do we need to make a service more reliable ? https://cre.page.link/art-of-slos-slides

  17. 100% 100% is the wrong reliability target for basically everything — Benjamin Treynor Sloss , VP 24x7, Google; Site Reliability Engineering, Introduction https://cre.page.link/art-of-slos-slides

  18. 😢😌 SLOs should capture the performance and availability levels that, if barely met , would keep the typical customer of a service happy “meets SLO targets” ⇒ “happy customers” “sad customers” ⇒ “misses SLO targets” https://cre.page.link/art-of-slos-slides

  19. Measure SLO SLI achieved & try Target to be slightly over target... https://cre.page.link/art-of-slos-slides

  20. SLI "Workflow", Randall Munroe, XKCD …but don’t be Source: https://xkcd.com/1172/ ! too much better Target or users will depend on it https://cre.page.link/art-of-slos-slides

  21. Error Budgets An SLO implies an acceptable level of unreliability This is a budget that can be allocated https://cre.page.link/art-of-slos-slides

  22. Implementation Mechanics Evaluate SLO performance over a set window , e.g. 28 days Remaining budget drives prioritization of engineering effort https://cre.page.link/art-of-slos-slides

  23. ITIL Approximation Service in SLO → most operational work is a standard change Service close to being out of SLO → revert to normal change (No, I don't understand the difference between "standard" and "normal" either…) https://cre.page.link/art-of-slos-slides

  24. What should we spend our error budget on? https://cre.page.link/art-of-slos-slides

  25. Error budgets can accommodate ⁄ releasing new features ⁄ expected system changes ⁄ inevitable failure in hardware, networks, etc. ⁄ planned downtime ⁄ risky experiments https://cre.page.link/art-of-slos-slides

  26. Benefits of error budgets ⁄ ⁄ Dev team becomes self-policing Common incentive for devs and SREs The error budget is a valuable resource for them Find the right balance between innovation and reliability ⁄ ⁄ Shared responsibility for system uptime Dev team can manage the risk themselves Infrastructure failures eat into the error budget They decide how to spend their error budget ⁄ Unrealistic reliability goals become unattractive These goals dampen the velocity of innovation https://cre.page.link/art-of-slos-slides

  27. Still with me? https://cre.page.link/art-of-slos-slides

  28. Activity Reliability Principles https://cre.page.link/art-of-slos-slides

  29. Dear Colleagues, The negative press from our recent outage has convinced me that we all need to take the reliability of our services more seriously. In this open letter, I want to lay down three reliability principles to guide your future decision making. https://cre.page.link/art-of-slos-slides

  30. The first principle concerns our users. 1. ... rebuild user trust by making a financial commitment to reliability. We let them down, but they deserve better. They deserve to be happy 2. ... find ways to help our users when using our services! tolerate or enjoy future outages. 3. ... meet our users expectations of reliability before building features. Our business must ... 4. ... build the features that make our users happy faster. 5. ... never suffer another outage, ever again! https://cre.page.link/art-of-slos-slides

  31. The second principle concerns the 1. … choose to fail fast and catch errors early through rapid iteration. way we build our services. We have to change our development process to 2. … have Ops engage in the design of incorporate reliability. new features to reduce risk. 3. … only release new features publicly when they are shown to be reliable. Our business must... 4. … build and release software in small, controlled steps. 5. … reduce feature iteration speed when our systems are unreliable. https://cre.page.link/art-of-slos-slides

  32. The third principle concerns our 1. … share responsibility for reliability between Ops and Dev teams. operational practices. What we're doing today isn't working. Our Ops 2. … tie operational response and teams are burned out and our team priorities to a reliability goal. incident rate is too high. We have to 3. … make our systems more resilient do things differently to improve! to failure to cut operational load. 4. … give Ops a veto on all releases to prevent failures reaching our users. Our business must... 5. … route negative complaints on Twitter directly to Ops pagers. https://cre.page.link/art-of-slos-slides

  33. To put these principles into practice, we are going to borrow some ideas from Google! The next step is to define some SLOs for our services and begin tracking our performance against them. Thanks for reading! Eleanor Exec , CEO https://cre.page.link/art-of-slos-slides

  34. Break! https://cre.page.link/art-of-slos-slides

  35. Choosing a Good SLI https://cre.page.link/art-of-slos-slides

  36. https://cre.page.link/art-of-slos-slides

  37. unhappy users time https://cre.page.link/art-of-slos-slides

  38. BAD GOOD metric metric time time https://cre.page.link/art-of-slos-slides

  39. BAD GOOD metric metric time time Variance obscures metric deterioration https://cre.page.link/art-of-slos-slides

  40. BAD GOOD metric metric time time Metric deterioration correlates with outage https://cre.page.link/art-of-slos-slides

  41. BAD GOOD metric metric ? ✓ time time Metric provides poor Metric provides good signal-to-noise ratio signal-to-noise ratio https://cre.page.link/art-of-slos-slides

  42. SLI SLO https://cre.page.link/art-of-slos-slides

  43. good events SLI : × 100% valid events https://cre.page.link/art-of-slos-slides

  44. 3–5 SLIs * * per user journey https://cre.page.link/art-of-slos-slides

  45. SLI SLO https://cre.page.link/art-of-slos-slides

  46. W hat performance does the business need? https://cre.page.link/art-of-slos-slides

  47. U ser expectations are strongly tied to past performance https://cre.page.link/art-of-slos-slides

  48. Continuous ? Improvement https://cre.page.link/art-of-slos-slides

  49. Information o verload? https://cre.page.link/art-of-slos-slides

  50. Developing SLOs and SLIs https://cre.page.link/art-of-slos-slides

  51. ? https://cre.page.link/art-of-slos-slides

  52. Our Game: Fang Faction Leaderboards Web Server Leaderboard Generation User Profiles Load Balancer Game Servers API Server https://cre.page.link/art-of-slos-slides

  53. https://fangfactiongame.com/profile/someuser SomeUser's Profjle Faction Name: Tribe of Frog Leader Name: SomeUser SomeUser Email Address: user@example.com Tribe of Frog Faction Score: 31337 Midwest Canyon Update 1. Tri-Bool 65535 2. Tri Repetae 61995 3. Triassic Five 52391 4. Tricksy Hobbits 37164 5. Tribe of Frog 31337 6. Trite Examples 29243 https://cre.page.link/art-of-slos-slides

  54. Loading a Profile Page Leaderboards Web Server Leaderboard Generation User Profiles Load Balancer Game Servers API Server https://cre.page.link/art-of-slos-slides

Recommend


More recommend