ACM.org Highlights For Scientists, Programmers, Designers, and Managers: • Learning Center - https://learning.acm.org • View past TechTalks & Podcasts with top inventors, innovators, entrepreneurs, & award winners • Access to O’Reilly Learning Platform – technical books, courses, videos, tutorials & case studies • Access to Skillsoft Training & ScienceDirect – vendor certification prep, technical books & courses • Ethical Responsibility – https://ethics.acm.org By the Numbers Popular Publications & Research Papers • • 2,200,000+ content readers Communications of the ACM - http://cacm.acm.org • 1,800,000+ DL research citations • Queue Magazine - http://queue.acm.org • $1,000,000 Turing Award prize • Digital Library - http://dl.acm.org • 100,000+ global members • 1160+ Fellows Major Conferences, Events, & Recognition • 700+ chapters globally • https://www.acm.org/conferences • 170+ yearly conferences globally • https://www.acm.org/chapters • 100+ yearly awards • https://awards.acm.org • 70+ Turing Award Laureates
OOPS! Learning from surprise at Netflix Lorin Hochstein Sr. Software Engineer, Netflix
Let’s talk about outages! @lhochstein
At Netflix, we call them incidents @lhochstein
Incidents are scary! @lhochstein
The system did something we didn’t expect… @lhochstein
…and a bad thing happened! @lhochstein
Uncertainty makes people nervous @lhochstein
We want closure @lhochstein
How can we be confident this won’t happen again? @lhochstein
We do an incident review @lhochstein
Why did this happen? @lhochstein
Do a root cause analysis @lhochstein
Identify action items that will prevent reoccurrence @lhochstein
We can now move past it @lhochstein
Until the next one… @lhochstein
… which is completely different @lhochstein
We can get more out of incidents than preventing the last one @lhochstein
@lhochstein
Learning isn’t proportional to impact of an incident @lhochstein
We can learn just as much from “incidents” where there is no business impact! @lhochstein
An operational surprise @lhochstein
OOPSies @lhochstein
OOPS @lhochstein
@lhochstein
@lhochstein
@lhochstein
@lhochstein
https://twitter.com/FakeRyanGosling/status/1106714429247221761 @lhochstein
@lhochstein
A play in three acts 1. What we hope to learn from OOPSies 2. What to ask when looking into how an OOPS happened 3. How to write up an OOPS
I. What we hope to learn @lhochstein
Fools learn from experience. I prefer to learn from the experience of others. – Otto von Bismarck (attributed) @lhochstein
Identify gaps @lhochstein
Tooling gaps @lhochstein
@lhochstein
@lhochstein
Consider a cluster of servers Server group
The size is configurable Server group EC2 128 Desired
Netflix traffic varies over time
Autoscaling sizes for you Server group Metrics EC2 Autoscaler Min 20 128 Desired Max 1000
One day… @lhochstein
1000 12 128 Max Desired Min
1000 12 256 Max Desired Min
One day… Server group Metrics EC2 Autoscaler Min 20 256 Desired Max 1000
1. EC2: Bring up new instances 256 Desired
2. Autoscaler fires: 256 → 128 128 Desired
2. EC2: terminate instances 128 Desired
Server group User sees green → gray Metrics EC2 Autoscaler Min 20 128 Desired Max 1000
@lhochstein
Operational expertise gaps @lhochstein
Resource gaps @lhochstein
Beware the law of stretched systems! @lhochstein
Every system is stretched to operate at its capacity @lhochstein
Beware the law of fluency ! @lhochstein
Hard to tell when a skilled engineer starts to become overloaded @lhochstein
Build shared understanding @lhochstein
It came as a surprise that X calls Y’s endpoint @lhochstein
Facilitate skill transfer @lhochstein
Learn by watching experts in action @lhochstein
II. What to ask @lhochstein
Do an investigation afterwards @lhochstein
(but don’t call it that) @lhochstein
“How did we get here?” @lhochstein
How did X seem reasonable in the moment? @lhochstein
What were all of the things that had to be true for the surprise to happen? @lhochstein
Capture perspectives from multiple people @lhochstein
III. How to write it up @lhochstein
Narrative description @lhochstein
Tell a good story @lhochstein
Imagine new team member reading it @lhochstein
Contributing factors @lhochstein
Front50 provides an inconsistent view of application permissions, this triggered endless retries @lhochstein
Similar feature was already in use, so enabling it here seemed low-risk @lhochstein
X was out sick when the feature was deployed @lhochstein
Mitigators @lhochstein
Spinnaker's staging stack was not impacted, which gave us a backdoor way to monitor and make changes @lhochstein
Demand Engineering has tooling & experience in changing size of many server groups automatically, which was sufficient to undo most bad changes @lhochstein
Risks @lhochstein
The regression occurred in an area of Spinnaker that is difficult to test @lhochstein
Misconfigured pools and queues @lhochstein
Difficulties in handling @lhochstein
Observability blind spots: lack of metrics around connection pool or redis command usage made it difficult to determine redis usage change @lhochstein
@lhochstein
clouddriver was rolled out at 3pm and we were paged at 5:30pm, so not immediately clear that issue had to do with deployment @lhochstein
If you only remember three things… • Any operational surprise is a potential opportunity for learning • Ask questions that answer “how did we get here?” • Tell a good story @lhochstein
I want to learn more about learning more! • Etsy Debrief Facilitation Guide • The Field Guide To Understanding ‘Human Error’ by Sidney Dekker • http://resiliencepapers.club @lhochstein
The Learning Continues… TechTalk Discourse: https://on.acm.org TechTalk Inquiries: learning@acm.org TechTalk Archives: https://learning.acm.org/techtalks Learning Center: https://learning.acm.org Professional Ethics: https://ethics.acm.org Queue Magazine: https://queue.acm.org
Recommend
More recommend