learning from failure pkill indeedhi re 2wka2mm what
play

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would - PowerPoint PPT Presentation

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would catastrophic failure look like in your organization? Try and picture this. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm Tightly coupled Systems that are tightly coupled


  1. Our cognitive biases are useful adaptations but they often lead us astray during incident response. You don’t have to eliminate them but be aware of them. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  2. Availability heuristic Relying only on the ideas that come to mind when making decisions in uncertain situations. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  3. Focusing effect The tendency to place too much importance on one aspect of an event. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  4. Illusory correlation Inaccurately perceiving a relationship between two unrelated events. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  5. Confirmation bias The tendency to search for, interpret, focus on or discard evidence in a way that confirms one's preconceptions. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  6. The incident lifecycle revisited Responder joined later Detection Mitigation Cleanup Prevention N e w s y m p t o m e m Retro e Diagnosis r Recovery g e s Dev helped with identifying DNS issue in Slack @_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

  7. The Retrospective Process Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  8. The Retrospective Process Learning from incidents Retrospective start Remediation Retro report Urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled addressed

  9. The Retrospective Process Learning from incidents Testimony is most accurate within two weeks of return to normalization. Remediation Retro report Urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled addressed

  10. Debriefing Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  11. Debrief attendees + Debrief facilitator + Debrief facilitator trainee + Scribe + Incident owner + Incident participants + Retrospective owner + Subject matter experts Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  12. Qualities of debrief facilitators + Impartial : Not involved in the incident + Curious : Asks questions + Attentive : Listens + Respectful : Improves psychological safety + Thorough : Captures all relevant testimony + Patient : Mediates heated debate + Uses shared language : Sufficiently technical Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  13. Debrief agenda 1. Facilitator reviews the timeline 2. Facilitator interviews attendees 3. Call for clarifying questions Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  14. What questions should a facilitator ask? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  15. What was happening just before the incident? During the incident? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  16. Was there a call for assistance? How was it known who to contact? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  17. How could this incident have been worse ? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  18. How did we arrive at the decision to turn off the healthchecking in the load balancer? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  19. Debriefing tips + Start debriefs as soon as possible + Before the debrief + Send out questions to participants + Assess the comfort level of participants + Commit someone to scribe or record + Conduct 1:1 debriefs if necessary Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  20. The Retrospective Process Learning from incidents Interviews, narratives, contributing factors, latent threats, impact, remediation items Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

  21. Avoid counterfactuals + “...made a mistake by…” + “The developer carelessly…” + “... suboptimal decision-making...” + “... should have been obvious…” + “Could have prevented the outage…” + “... failed to verify the change...” Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  22. Root cause analysis is a fairy tale Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  23. Root cause is also an imprecise concept. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  24. So many choices... 1. Initiating cause 2. Most basic cause WIKIPEDIA’S DEFINITION OF 3. Earliest cause ROOT CAUSE 4. Deepest cause Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  25. 1. Initiating: Non-critical healthcheck dependency commit? 2. Most basic: Filesystem exhaustion WIKIPEDIA’S DEFINITION OF on build server? ROOT CAUSE 3. Earliest: The Big Bang?? 4. Deepest: The Human Condition??? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  26. Root cause analysis is too narrow in scope to maximize learning. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  27. Root cause analysis is too narrow in scope to maximize learning. It leaves important contributions unexplored. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  28. Root cause analysis is not blame-aware. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  29. The Five Whys is also problematic Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  30. Why? Why? Why? Why? Is the root cause hiding here Why? somewhere? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  31. Why? Why? Why? Universe of other Universe of other contributions contributions Why? Why? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  32. Fixating on root cause is an easy trap to fall into. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  33. Causal analysis and diagnosis are supremely important activities. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  34. What should we do instead? Locate contributing factors Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  35. Contributing factors + Artifact publishing script didn’t handle a certain exception + Builder used non-atomic filesystem writes + Filesystem filled up to 100% + Non-critical healthcheck dependency marked as REQUIRED + No fail-open pool in the DNS traffic director + Corrupt data artifact loaded into webapp without verification

  36. The Retrospective Process Learning from incidents Write report and assemble deliverables Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

  37. 1. Contributing factors 2. Remaining threats 3. Remediation items 4. Command line history RETROSPECTIVE DELIVERABLES 5. Chat transcripts 6. Graphs 7. Retrospective report Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  38. The Retrospective Process Learning from incidents Promote this material far and wide in your organization. Add this to your incident library. Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

  39. The Retrospective Process Learning from incidents These happen on the team level. This is where remediation owners are determined. Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

  40. + Execution is team dependent + Dive deep retrospective report + Assign owners for remediation REMEDIATION items MEETINGS + Discuss finer points of the contributing factors + Can continue in perpetuity Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  41. We don’t deeply know our systems. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  42. System as imagined System as found

  43. System as imagined System as found urgency: "Weak: Failure of this dependency urgency: "Required: would result in minor Failure of this functionality loss" dependency would result in complete system outage"

  44. Failure The best opportunity to gain an understanding about how our systems behave is through failure. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  45. Chaos testing Test in ALL environments with the goal of validating your hypothesis. Discovering things you didn’t know about your systems is a consequence. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  46. Failure Myth #5: Safety can be measured by the number of accidents that occur Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Recommend


More recommend