detangling complex systems
play

Detangling complex systems with compassion & production - PowerPoint PPT Presentation

Detangling complex systems with compassion & production excellence Liz Fong-Jones @lizthegrey #VelocityConf San Jose June 13, 2019 1 w/ illustrations by @emilywithcurls! Production is increasingly complex. 2 @lizthegrey at


  1. Detangling complex systems with compassion & production excellence Liz Fong-Jones @lizthegrey #VelocityConf San Jose June 13, 2019 1 w/ illustrations by @emilywithcurls!

  2. Production is increasingly complex. 2 @lizthegrey at #VelocityConf

  3. There is no 100% uptime. 3 @lizthegrey at #VelocityConf

  4. Our strategies need to evolve. 4 @lizthegrey at #VelocityConf

  5. Co "bought" DevOps. @lizthegrey at #VelocityConf 5

  6. Ordering the alphabet soup... 6 @lizthegrey at #VelocityConf

  7. Noisy alerts. Grumpy engineers. 7 @lizthegrey at #VelocityConf

  8. Walls of meaningless dashboards. 8 @lizthegrey at #VelocityConf

  9. Incidents take forever to fix. 9 @lizthegrey at #VelocityConf

  10. Everyone bugs the "expert". 10 @lizthegrey at #VelocityConf

  11. Deploys are unpredictable. 11 @lizthegrey at #VelocityConf

  12. There's no time to do projects... 12 @lizthegrey at #VelocityConf

  13. and when there's time, there's no plan. 13 @lizthegrey at #VelocityConf

  14. The team is struggling to hold on. 14 @lizthegrey at #VelocityConf

  15. What's Co missing? @lizthegrey at #VelocityConf 15

  16. Co forgot who operates systems. 16 @lizthegrey at #VelocityConf

  17. Tools aren't magical. 17 @lizthegrey at #VelocityConf

  18. Invest in people, culture, & process. 18 @lizthegrey at #VelocityConf

  19. Enter the art of Production Excellence. 19 @lizthegrey at #VelocityConf

  20. Make systems more reliable & friendly. 20 @lizthegrey at #VelocityConf

  21. ProdEx takes planning. 21 @lizthegrey at #VelocityConf

  22. Measure and act on what matters. 22 @lizthegrey at #VelocityConf

  23. Involve everyone. 23 @lizthegrey at #VelocityConf

  24. Build everyone's confidence. Encourage asking questions. 24 @lizthegrey at #VelocityConf

  25. How do we get started? 25 @lizthegrey at #VelocityConf

  26. Know when it's too broken. 26 @lizthegrey at #VelocityConf

  27. & be able to debug, together when it is. 27 @lizthegrey at #VelocityConf

  28. Eliminate (unnecessary) complexity. 28 @lizthegrey at #VelocityConf

  29. Our systems are always failing. 29 @lizthegrey at #VelocityConf

  30. What if we measure too broken? 30 @lizthegrey at #VelocityConf

  31. We need Service Level Indicators @lizthegrey at #VelocityConf 31

  32. Think in terms of events in context. 32 @lizthegrey at #VelocityConf

  33. Is this event good or bad? 33 @lizthegrey at #VelocityConf

  34. Are users grumpy? Ask your PM. 34 @lizthegrey at #VelocityConf

  35. What threshold buckets events? 35 @lizthegrey at #VelocityConf

  36. HTTP Code 200? Latency < 300ms? 36 @lizthegrey at #VelocityConf

  37. How many eligible events did we see? 37 @lizthegrey at #VelocityConf

  38. Availability: Good / Eligible Events 38 @lizthegrey at #VelocityConf

  39. Set a target Service Level Objective. 39 @lizthegrey at #VelocityConf

  40. Use a window and target percentage. 40 @lizthegrey at #VelocityConf

  41. 99.9% of events good in past 30 days. 41 @lizthegrey at #VelocityConf

  42. A good SLO barely keeps users happy. 42 @lizthegrey at #VelocityConf

  43. Drive alerting with SLOs. 43 @lizthegrey at #VelocityConf

  44. Is my service on fire? 44 @lizthegrey at #VelocityConf

  45. Error budget: allowed unavailability 45 @lizthegrey at #VelocityConf

  46. How long until I run out? 46 @lizthegrey at #VelocityConf

  47. Page if it's hours. Ticket if it's days. 47 @lizthegrey at #VelocityConf

  48. Data-driven business decisions. 48 @lizthegrey at #VelocityConf

  49. Is it safe to do this risky experiment? 49 @lizthegrey at #VelocityConf

  50. Should we invest in more reliability? 50 @lizthegrey at #VelocityConf

  51. Perfect SLO > Good SLO >>> No SLO 51 @lizthegrey at #VelocityConf

  52. Measure what you can today. 52 @lizthegrey at #VelocityConf

  53. Iterate to meet user needs. 53 @lizthegrey at #VelocityConf

  54. Only alert on what matters. 54 @lizthegrey at #VelocityConf

  55. SLIs & SLOs are only half the picture... @lizthegrey at #VelocityConf 55

  56. Our outages are never identical. 56 @lizthegrey at #VelocityConf

  57. Failure modes can't be predicted. 57 @lizthegrey at #VelocityConf

  58. Support debugging novel cases. In production. 58 @lizthegrey at #VelocityConf

  59. Allow forming & testing hypotheses. 59 @lizthegrey at #VelocityConf

  60. Dive into data to ask new questions. 60 @lizthegrey at #VelocityConf

  61. Our services must be observable. 61 @lizthegrey at #VelocityConf

  62. Can you examine events in context? 62 @lizthegrey at #VelocityConf

  63. Can you explain the variance? 63 @lizthegrey at #VelocityConf

  64. Can you mitigate impact & debug later? 64 @lizthegrey at #VelocityConf

  65. SLOs and Observability go together. 65 @lizthegrey at #VelocityConf

  66. But they alone don't create collaboration. @lizthegrey at #VelocityConf 66

  67. Debugging is not a solo activity. 67 @lizthegrey at #VelocityConf

  68. Debugging is for everyone. 68 @lizthegrey at #VelocityConf

  69. Collaboration is interpersonal. 69 @lizthegrey at #VelocityConf

  70. Operations must be sustainable. 70 @lizthegrey at #VelocityConf

  71. We learn better when we document. 71 @lizthegrey at #VelocityConf

  72. Fix hero culture. Share knowledge. 72 @lizthegrey at #VelocityConf

  73. Reward curiosity and teamwork. 73 @lizthegrey at #VelocityConf

  74. Learn from the past. Reward your future self. 74 @lizthegrey at #VelocityConf

  75. Outages don't repeat, but they rhyme. 75 @lizthegrey at #VelocityConf

  76. Risk analysis helps us plan. @lizthegrey at #VelocityConf 76

  77. Quantify risks by frequency & impact. 77 @lizthegrey at #VelocityConf

  78. Which risks are most significant? 78 @lizthegrey at #VelocityConf

  79. Address risks that threaten the SLO. 79 @lizthegrey at #VelocityConf

  80. Make the business case to fix them. 80 @lizthegrey at #VelocityConf

  81. And prioritize completing the work. 81 @lizthegrey at #VelocityConf

  82. Lack of observability is systemic risk. 82 @lizthegrey at #VelocityConf

  83. So is lack of collaboration. 83 @lizthegrey at #VelocityConf

  84. Season the alphabet soup with ProdEx 84 @lizthegrey at #VelocityConf

  85. Production Excellence brings teams closer together. Measure. Debug. Collaborate. Fix. lizthegrey.com; @lizthegrey 85 @lizthegrey at #VelocityConf

  86. @lizthegrey at #VelocityConf

  87. @lizthegrey at #VelocityConf

  88. @lizthegrey at #VelocityConf

  89. @lizthegrey at #VelocityConf

  90. @lizthegrey at #VelocityConf

Recommend


More recommend