data driven postmortems
play

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH - PowerPoint PPT Presentation

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH $ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS


  1. DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH

  2. 
 $ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS Community Events

  3. Datadog Overview • SaaS based infrastructure and app monitoring • Open Source Agent • Time series data (metrics and events) • Processing nearly a trillion data points per day • Intelligent Alerting • We’re hiring! (www.datadoghq.com/careers/)

  4. “THE PROBLEMS WE WORK ON AT DATADOG ARE HARD AND OFTEN DON'T HAVE OBVIOUS, CLEAN- CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE YOUR TROUBLESHOOTING SKILLS, NO MATTER WHAT ROLE YOU WORK IN.” Internal Datadog Developer Guide

  5. “THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.” - Henry Ford

  6. POSTMORTEM “AN ANALYSIS OR DISCUSSION OF AN EVENT HELD SOON AFTER IT HAS OCCURRED, ESPECIALLY IN ORDER TO DETERMINE WHY IT WAS A FAILURE.” OXFORD ENGLISH DICTIONARY Oxford English Dictionary

  7. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES ▸ Culture WHAT IS DEVOPS? ▸ Automation ▸ Metrics ▸ Sharing

  8. DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES ▸ Culture OUR FOCUS AREA ▸ Sharing

  9. BLAMELESS POSTMORTEMS

  10. CULTURE & SHARING RESOURCES BLAMELESS POSTMORTEMS ▸ Blameless Postmortems by John Allspaw http://bit.ly/etsy-blameless ▸ The Human Side of Postmortems by Dave Zwieback http://bit.ly/human-postmortem

  11. CULTURE & SHARING ARE GREAT, BUT WHAT ABOUT METRICS

  12. Follow @honest_update on Twitter

  13. COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE SO INSTRUMENT ALL THE THINGS!

  14. METRICS 4 QUALITIES OF GOOD METRICS ▸ Well-understood ▸ Granular ▸ Tagged by scope ▸ Long-lived

  15. RECURSE UNTIL YOU FIND THE TECHNICAL CAUSE

  16. IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM

  17. HUMAN DATA DATA COLLECTION: WHO? ▸ Everyone! ▸ Responders ▸ Identifiers ▸ Affected Users

  18. HUMAN DATA DATA COLLECTION: WHAT? ▸ Their perspective ▸ What they did ▸ What they thought ▸ Why they thought/did it

  19. TECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES HUMAN ELEMENT

  20. JOYENT US-EAST-1 POST-MORTEM 2014 … we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously … Joyent Postmortem 
 http://bit.ly/joyent-post

  21. “WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.” RICHARD GUINDON

  22. “ONE PICTURE IS WORTH TEN THOUSAND WORDS” CHINESE PROVERB

  23. HUMAN DATA DATA COLLECTION: WHEN? ▸ As soon as possible. ▸ Memory drops sharply within 20 minutes ▸ Susceptibility to “false memory” increases

  24. HUMAN DATA DATA SKEW/CORRUPTION ▸ Stress ▸ Sleep deprivation ▸ Burnout

  25. HUMAN DATA DATA SKEW/CORRUPTION ▸ Blame/Fear of punitive action ▸ Bias ▸ Anchoring ▸ Hindsight ▸ Outcome ▸ Availability ▸ Recency

  26. HOW WE DO POSTMORTEMS AT DATADOG

  27. DATADOG POSTMORTEMS A FEW NOTES ▸ Postmortems emailed to company wide ▸ Scheduled recurring postmortem meetings

  28. DATADOG’S POSTMORTEM TEMPLATE (1/5) SUMMARY: WHAT HAPPENED? ▸ Describe what happened here at a high-level -- think of it as an abstract in a scientific paper. ▸ What was the impact on customers? ▸ What was the severity of the outage? ▸ What components were affected? ▸ What ultimately resolved the outage?

  29. DATADOG’S POSTMORTEM TEMPLATE (2/5) HOW WAS THE OUTAGE DETECTED? ▸ We want to make sure we detected the issue early and would catch the same issue if it were to repeat. ▸ Did we have a metric that showed the outage? ▸ Was there a monitor on that metric? ▸ How long did it take for us to declare an outage?

  30. DATADOG’S POSTMORTEM TEMPLATE (3/5) HOW DID WE RESPOND? ▸ Who was the incident owner & who else was involved? ▸ Slack archive links and timeline of events! ▸ What went well? ▸ What didn’t go so well?

  31. *Names changed

  32. CHATOPS ARCHIVES FTW! *Names changed

  33. TRACK LEARNINGS AS YOU GO *Names changed

  34. DATADOG’S POSTMORTEM TEMPLATE (4/5) WHY DID IT HAPPEN? ▸ Deep dive into the cause ▸ Examples from this incident: ▸ http://bit.ly/dd-statuspage ▸ http://bit.ly/alq-postmortem

  35. DATADOG’S POSTMORTEM TEMPLATE (5/5) HOW DO WE PREVENT IT IN THE FUTURE? ▸ Link to Github issues and Trello cards ▸ Now? ▸ Next? ▸ Later? ▸ Follow up notes

  36. *Names changed

  37. DATADOG’S POSTMORTEM TEMPLATE RECAP: ▸ What happened (summary)? ▸ How did we detect it? ▸ How did we respond? ▸ Why did it happen (deep dive)? ▸ Actionable next steps!

  38. KEEP LEARNING MORE RESOURCES ▸ The Infinite Hows - John Allspaw 
 http://bit.ly/infinite-hows 
 ▸ “Blameless” Postmortems don’t work - J Paul Reed 
 http://bit.ly/blameless-dont-work ▸ Monitoring 101 - Alexis Lê-Quôc 
 http://dtdg.co/monitoring-101-data

  39. QUESTIONS? @IRABINOVITCH LET’S TALK! @DATADOGHQ

Recommend


More recommend