heretical resilience
play

heretical resilience (to repair is human) Ryn Daniels - - PowerPoint PPT Presentation

heretical resilience (to repair is human) Ryn Daniels - @rynchantress QCon New York 2018 @rynchantress qcon nyc 2018 my side of the story AKA: A Dramatic blargh Retelling of The Time I Nearly Broke Etsy Dot Com @rynchantress qcon


  1. heretical resilience (to repair is human) Ryn Daniels - @rynchantress 
 QCon New York 2018

  2. @rynchantress qcon nyc 2018

  3. my side of the story AKA: A Dramatic blargh Retelling of The Time I Nearly Broke Etsy Dot Com @rynchantress qcon nyc 2018

  4. @rynchantress qcon nyc 2018

  5. apache versions @rynchantress qcon nyc 2018

  6. apache versions @rynchantress qcon nyc 2018

  7. @rynchantress qcon nyc 2018

  8. @rynchantress qcon nyc 2018

  9. blargh @rynchantress qcon nyc 2018

  10. blargh @rynchantress qcon nyc 2018

  11. @rynchantress qcon nyc 2018

  12. @rynchantress qcon nyc 2018

  13. @rynchantress qcon nyc 2018

  14. blargh @rynchantress qcon nyc 2018

  15. blargh @rynchantress qcon nyc 2018

  16. @rynchantress qcon nyc 2018

  17. @rynchantress qcon nyc 2018

  18. @rynchantress qcon nyc 2018

  19. @rynchantress qcon nyc 2018

  20. @rynchantress qcon nyc 2018

  21. + = + + = @rynchantress qcon nyc 2018

  22. @rynchantress qcon nyc 2018

  23. @rynchantress qcon nyc 2018

  24. + + = @rynchantress qcon nyc 2018

  25. @rynchantress qcon nyc 2018

  26. blargh @rynchantress qcon nyc 2018

  27. blargh @rynchantress qcon nyc 2018

  28. The Post-mortem aka: What the heck actually just happened? @rynchantress qcon nyc 2018

  29. The Post-mortem aka: What the heck actually just happened? aka: what did we learn? @rynchantress qcon nyc 2018

  30. how did the site stay up? @rynchantress qcon nyc 2018

  31. @rynchantress qcon nyc 2018

  32. @rynchantress qcon nyc 2018

  33. Lesson 1 Always keep 7 servers out of config management, just in case. @rynchantress qcon nyc 2018

  34. Lesson 1 Consider fallbacks 
 for automation @rynchantress qcon nyc 2018

  35. distrusting your automation • How will you detect problems? • How easily can you test your automation? • Can you turn the automation off? • Do you remember how to do the thing manually? @rynchantress qcon nyc 2018

  36. How did we respond so fast? @rynchantress qcon nyc 2018

  37. @rynchantress qcon nyc 2018

  38. blargh @rynchantress qcon nyc 2018

  39. Lesson 2 Create a Slack Team in charge of maintaining a proper amount of slack in case of incidents. @rynchantress qcon nyc 2018

  40. Lesson 2 maintain adaptive capacity @rynchantress qcon nyc 2018

  41. twiddling your thumbs • How do people ask each other for help? • Which teams have more or less slack? • What happens after work gets rearranged? @rynchantress qcon nyc 2018

  42. what couldn't we see? @rynchantress qcon nyc 2018

  43. @rynchantress qcon nyc 2018

  44. @rynchantress qcon nyc 2018

  45. @rynchantress qcon nyc 2018

  46. @rynchantress qcon nyc 2018

  47. Lesson 3 Buy a couple botnets to DDoS your monitoring tools every now and then. @rynchantress qcon nyc 2018

  48. Lesson 3 understand the dependencies 
 in your tooling @rynchantress qcon nyc 2018

  49. watching the world burn • What do your monitoring/automation/ 
 orchestration tools depend on? • Who watches the watchers? • How do you communicate internally and externally? • Do you have backup tools? @rynchantress qcon nyc 2018

  50. what actually went wrong with chef? @rynchantress qcon nyc 2018

  51. @rynchantress qcon nyc 2018

  52. Lesson 4 Always label your dragons. @rynchantress qcon nyc 2018

  53. Lesson 4 make informed decisions about which yaks to shave. @rynchantress qcon nyc 2018

  54. choosing your yaks wisely • Which teams have sufficient slack? • Can a problem be avoided if not solved? • What are the tradeoffs and opportunity costs? • Who has the precision yak razors? @rynchantress qcon nyc 2018

  55. who digs into the weird things? @rynchantress qcon nyc 2018

  56. Lesson 4.5 Hire the person who created the primary language your site is written in. 
 (This always scales.) @rynchantress qcon nyc 2018

  57. Lesson 4.5 Develop depth of 
 inter-team relationships @rynchantress qcon nyc 2018

  58. finding your own rasmus • Which areas only have one (or two) people who understand them? • How is information shared within your organization? • What behaviors are rewarded? @rynchantress qcon nyc 2018

  59. what happened afterwards? @rynchantress qcon nyc 2018

  60. @rynchantress qcon nyc 2018

  61. Lesson 5 Give people ill-fitting clothing when they mess up. @rynchantress qcon nyc 2018

  62. Lesson 5 encourage organizational learning @rynchantress qcon nyc 2018

  63. a warning to others • How do people respond to incidents? • What happens after an incident? • How are remediation items prioritized? • What happen to the bandaid solutions? @rynchantress qcon nyc 2018

  64. @rynchantress qcon nyc 2018

  65. technology can be robust.* only humans can be resilient. *for some already-known, pre-defined subset of problems @rynchantress qcon nyc 2018

  66. @rynchantress qcon nyc 2018

  67. 1. understand your automation 2. maintain adaptive capacity 3. know your dependencies 4. build cross-team relationships 5. always be learning @rynchantress qcon nyc 2018

  68. 1. understand your automation 2. maintain adaptive capacity 3. know your dependencies 4. build cross-team relationships 5. always be learning @rynchantress qcon nyc 2018

  69. Thank you! @rynchantress qcon nyc 2018

Recommend


More recommend