preparing for the unexpected
play

Preparing for the Unexpected Samuel Parkinson - PowerPoint PPT Presentation

Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon #prepfortheunexpected Photo by Hush Naidoo on Unsplash #qconlondon #prepfortheunexpected #qconlondon #prepfortheunexpected Lets start with a story


  1. Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon #prepfortheunexpected Photo by Hush Naidoo on Unsplash #qconlondon #prepfortheunexpected

  2. #qconlondon #prepfortheunexpected

  3. Let’s start with a story #qconlondon #prepfortheunexpected

  4. What’s the worst thing that could happen? #qconlondon #prepfortheunexpected

  5. #qconlondon #prepfortheunexpected

  6. #qconlondon #prepfortheunexpected

  7. #qconlondon #prepfortheunexpected

  8. #qconlondon #prepfortheunexpected

  9. #qconlondon #prepfortheunexpected

  10. ************* #qconlondon #prepfortheunexpected

  11. #qconlondon #prepfortheunexpected

  12. The FT.com zone was missing #qconlondon #prepfortheunexpected

  13. #qconlondon #prepfortheunexpected

  14. FT.com has over 5,100 subdomains 😭 #qconlondon #prepfortheunexpected

  15. This impacted the whole company #qconlondon #prepfortheunexpected

  16. #qconlondon #prepfortheunexpected

  17. #qconlondon #prepfortheunexpected

  18. 😲 #qconlondon #prepfortheunexpected

  19. We have never prepared for such an incident #qconlondon #prepfortheunexpected

  20. It’s a classic data loss situation #qconlondon #prepfortheunexpected

  21. #qconlondon #prepfortheunexpected

  22. Our provider had a partial backup #qconlondon #prepfortheunexpected

  23. But critical records we used for DNS load balancing were missing 👼 #qconlondon #prepfortheunexpected

  24. About 10 people worked to resolve the incident #qconlondon #prepfortheunexpected

  25. And over 30 people were online to follow along #qconlondon #prepfortheunexpected

  26. Most were not called, but still volunteered their time #qconlondon #prepfortheunexpected

  27. #qconlondon #prepfortheunexpected

  28. 4h 30m The first hour was a total outage. #qconlondon #prepfortheunexpected

  29. Lack of panic in the moment #qconlondon #prepfortheunexpected

  30. It was a slick operation and we recovered #qconlondon #prepfortheunexpected

  31. It took restoring from a backup and manual entry to get there #qconlondon #prepfortheunexpected

  32. We were focused on recovery, not what happened #qconlondon #prepfortheunexpected

  33. People were joining the incident to learn #qconlondon #prepfortheunexpected

  34. #qconlondon #prepfortheunexpected

  35. #qconlondon #prepfortheunexpected

  36. This is where we are today #qconlondon #prepfortheunexpected

  37. #qconlondon #prepfortheunexpected

  38. #qconlondon #prepfortheunexpected

  39. #qconlondon #prepfortheunexpected

  40. Photo by Victor Garcia on Unsplash #qconlondon #prepfortheunexpected

  41. Photo by Markus Spiske on Unsplash #qconlondon #prepfortheunexpected

  42. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  43. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  44. Internal FT Core Products Enterprise Services Customer Products Operations & Reliability FT Group Products #qconlondon #prepfortheunexpected

  45. Customer We are Products #qconlondon #prepfortheunexpected

  46. 45 engineers and counting 📉 #qconlondon #prepfortheunexpected

  47. And we own about 180 systems #qconlondon #prepfortheunexpected

  48. #qconlondon #prepfortheunexpected

  49. Split into 9 teams #qconlondon #prepfortheunexpected

  50. #qconlondon #prepfortheunexpected

  51. #qconlondon #prepfortheunexpected

  52. Operations monitor our entire estate 24/7 #qconlondon #prepfortheunexpected

  53. #qconlondon #prepfortheunexpected

  54. Our systems are a drop in the pond #qconlondon #prepfortheunexpected

  55. You build it, you run it #qconlondon #prepfortheunexpected

  56. Supporting our systems out-of-hours #qconlondon #prepfortheunexpected

  57. This is our approach to DevOps #qconlondon #prepfortheunexpected

  58. Our engineers wear many hats Photo by Joshua Coleman on Unsplash #qconlondon #prepfortheunexpected

  59. We’re putting on our incident management hat #qconlondon #prepfortheunexpected

  60. How do we do support out-of-hours? #qconlondon #prepfortheunexpected

  61. Our engineers volunteer to be part of the out-of-hours team #qconlondon #prepfortheunexpected

  62. We don’t have shifts #qconlondon #prepfortheunexpected

  63. We don’t have shifts #qconlondon #prepfortheunexpected

  64. Which means, we could all be unavailable #qconlondon #prepfortheunexpected

  65. What do we care about? #qconlondon #prepfortheunexpected

  66. We’re talking about our business capabilities #qconlondon #prepfortheunexpected

  67. We’re talking about our business capabilities #qconlondon #prepfortheunexpected

  68. What is an incident at the FT? #qconlondon #prepfortheunexpected

  69. Customer Products has two really important business capabilities #qconlondon #prepfortheunexpected

  70. 1. Our users can always read the news #qconlondon #prepfortheunexpected

  71. 2. Journalists must be able to publish the news #qconlondon #prepfortheunexpected

  72. If either of these go wrong we declare an incident #qconlondon #prepfortheunexpected

  73. #qconlondon #prepfortheunexpected

  74. #qconlondon #prepfortheunexpected

  75. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  76. What were our challenges? #qconlondon #prepfortheunexpected

  77. We were not immediately productive on call → #qconlondon #prepfortheunexpected

  78. We were not immediately productive on call We had an engineering mindset in an operations situation #qconlondon #prepfortheunexpected

  79. We were not immediately productive on call Because we don’t have any SRE or DevOps specialists #qconlondon #prepfortheunexpected

  80. “ ” I always start with the impact and the comms, they kinda jump in at the Tech. #qconlondon #prepfortheunexpected

  81. We were not immediately productive on call Our incident management process wasn’t second nature #qconlondon #prepfortheunexpected

  82. We had very few incidents in the first half of the year #qconlondon #prepfortheunexpected

  83. We had very few incidents in the first half of the year #qconlondon #prepfortheunexpected

  84. And we were down to 5 people on the out-of-hours support team #qconlondon #prepfortheunexpected

  85. So we needed to make out-of-hours team sustainable #qconlondon #prepfortheunexpected

  86. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  87. We surveyed engineers about helping out during an incident #qconlondon #prepfortheunexpected

  88. There were many people on the fence #qconlondon #prepfortheunexpected

  89. There were many people 7 people on the fence 3 people 6 people #qconlondon #prepfortheunexpected

  90. And they told us why #qconlondon #prepfortheunexpected

  91. “ ” I will need much more confidence in systems and domains knowledge. #qconlondon #prepfortheunexpected

  92. “ ” If I were to have a better understanding of how it works and what I would need to do, I would very likely join. #qconlondon #prepfortheunexpected

  93. We set out to convince people to join our out-of-hours team #qconlondon #prepfortheunexpected

  94. We built and ran incident workshops #qconlondon #prepfortheunexpected

  95. So our engineers are better prepared to take on incidents #qconlondon #prepfortheunexpected

  96. And we wrote a generic runbook for our microservices #qconlondon #prepfortheunexpected

  97. So engineers knew what they can do, and apply it to our ~180 systems #qconlondon #prepfortheunexpected

  98. We set out in the last 6 months of 2019 to address the situation #qconlondon #prepfortheunexpected

  99. Building your incident workshop → #qconlondon #prepfortheunexpected

  100. Building your incident workshop Don’t Panic! #qconlondon #prepfortheunexpected

Recommend


More recommend