acm org highlights
play

ACM.org Highlights For Scientists, Programmers, Designers, and - PowerPoint PPT Presentation

ACM.org Highlights For Scientists, Programmers, Designers, and Managers: Learning Center - https://learning.acm.org View past TechTalks & Podcasts with top inventors, innovators, entrepreneurs, & award winners Access to


  1. ACM.org Highlights For Scientists, Programmers, Designers, and Managers: • Learning Center - https://learning.acm.org • View past TechTalks & Podcasts with top inventors, innovators, entrepreneurs, & award winners • Access to O’Reilly Learning Platform – technical books, courses, videos, tutorials & case studies • Access to Skillsoft Training & ScienceDirect – vendor certification prep, technical books & courses • Ethical Responsibility – https://ethics.acm.org By the Numbers Popular Publications & Research Papers • • 2,200,000+ content readers Communications of the ACM - http://cacm.acm.org • 1,800,000+ DL research citations • Queue Magazine - http://queue.acm.org • $1,000,000 Turing Award prize • Digital Library - http://dl.acm.org • 100,000+ global members • 1160+ Fellows Major Conferences, Events, & Recognition • 700+ chapters globally • https://www.acm.org/conferences • 170+ yearly conferences globally • https://www.acm.org/chapters • 100+ yearly awards • https://awards.acm.org • 70+ Turing Award Laureates

  2. OOPS! Learning from surprise at Netflix Lorin Hochstein Sr. Software Engineer, Netflix

  3. Let’s talk about outages! @lhochstein

  4. At Netflix, we call them incidents @lhochstein

  5. Incidents are scary! @lhochstein

  6. The system did something we didn’t expect… @lhochstein

  7. …and a bad thing happened! @lhochstein

  8. Uncertainty makes people nervous @lhochstein

  9. We want closure @lhochstein

  10. How can we be confident this won’t happen again? @lhochstein

  11. We do an incident review @lhochstein

  12. Why did this happen? @lhochstein

  13. Do a root cause analysis @lhochstein

  14. Identify action items that will prevent reoccurrence @lhochstein

  15. We can now move past it @lhochstein

  16. Until the next one… @lhochstein

  17. … which is completely different @lhochstein

  18. We can get more out of incidents than preventing the last one @lhochstein

  19. @lhochstein

  20. Learning isn’t proportional to impact of an incident @lhochstein

  21. We can learn just as much from “incidents” where there is no business impact! @lhochstein

  22. An operational surprise @lhochstein

  23. OOPSies @lhochstein

  24. OOPS @lhochstein

  25. @lhochstein

  26. @lhochstein

  27. @lhochstein

  28. @lhochstein

  29. https://twitter.com/FakeRyanGosling/status/1106714429247221761 @lhochstein

  30. @lhochstein

  31. A play in three acts 1. What we hope to learn from OOPSies 2. What to ask when looking into how an OOPS happened 3. How to write up an OOPS

  32. I. What we hope to learn @lhochstein

  33. Fools learn from experience. I prefer to learn from the experience of others. – Otto von Bismarck (attributed) @lhochstein

  34. Identify gaps @lhochstein

  35. Tooling gaps @lhochstein

  36. @lhochstein

  37. @lhochstein

  38. Consider a cluster of servers Server group

  39. The size is configurable Server group EC2 128 Desired

  40. Netflix traffic varies over time

  41. Autoscaling sizes for you Server group Metrics EC2 Autoscaler Min 20 128 Desired Max 1000

  42. One day… @lhochstein

  43. 1000 12 128 Max Desired Min

  44. 1000 12 256 Max Desired Min

  45. One day… Server group Metrics EC2 Autoscaler Min 20 256 Desired Max 1000

  46. 1. EC2: Bring up new instances 256 Desired

  47. 2. Autoscaler fires: 256 → 128 128 Desired

  48. 2. EC2: terminate instances 128 Desired

  49. Server group User sees green → gray Metrics EC2 Autoscaler Min 20 128 Desired Max 1000

  50. @lhochstein

  51. Operational expertise gaps @lhochstein

  52. Resource gaps @lhochstein

  53. Beware the law of stretched systems! @lhochstein

  54. Every system is stretched to operate at its capacity @lhochstein

  55. Beware the law of fluency ! @lhochstein

  56. Hard to tell when a skilled engineer starts to become overloaded @lhochstein

  57. Build shared understanding @lhochstein

  58. It came as a surprise that X calls Y’s endpoint @lhochstein

  59. Facilitate skill transfer @lhochstein

  60. Learn by watching experts in action @lhochstein

  61. II. What to ask @lhochstein

  62. Do an investigation afterwards @lhochstein

  63. (but don’t call it that) @lhochstein

  64. “How did we get here?” @lhochstein

  65. How did X seem reasonable in the moment? @lhochstein

  66. What were all of the things that had to be true for the surprise to happen? @lhochstein

  67. Capture perspectives from multiple people @lhochstein

  68. III. How to write it up @lhochstein

  69. Narrative description @lhochstein

  70. Tell a good story @lhochstein

  71. Imagine new team member reading it @lhochstein

  72. Contributing factors @lhochstein

  73. Front50 provides an inconsistent view of application permissions, this triggered endless retries @lhochstein

  74. Similar feature was already in use, so enabling it here seemed low-risk @lhochstein

  75. X was out sick when the feature was deployed @lhochstein

  76. Mitigators @lhochstein

  77. Spinnaker's staging stack was not impacted, which gave us a backdoor way to monitor and make changes @lhochstein

  78. Demand Engineering has tooling & experience in changing size of many server groups automatically, which was sufficient to undo most bad changes @lhochstein

  79. Risks @lhochstein

  80. The regression occurred in an area of Spinnaker that is difficult to test @lhochstein

  81. Misconfigured pools and queues @lhochstein

  82. Difficulties in handling @lhochstein

  83. Observability blind spots: lack of metrics around connection pool or redis command usage made it difficult to determine redis usage change @lhochstein

  84. @lhochstein

  85. clouddriver was rolled out at 3pm and we were paged at 5:30pm, so not immediately clear that issue had to do with deployment @lhochstein

  86. If you only remember three things… • Any operational surprise is a potential opportunity for learning • Ask questions that answer “how did we get here?” • Tell a good story @lhochstein

  87. I want to learn more about learning more! • Etsy Debrief Facilitation Guide • The Field Guide To Understanding ‘Human Error’ by Sidney Dekker • http://resiliencepapers.club @lhochstein

  88. The Learning Continues… TechTalk Discourse: https://on.acm.org TechTalk Inquiries: learning@acm.org TechTalk Archives: https://learning.acm.org/techtalks Learning Center: https://learning.acm.org Professional Ethics: https://ethics.acm.org Queue Magazine: https://queue.acm.org

Recommend


More recommend