Detangling complex systems with compassion & production excellence Liz Fong-Jones @lizthegrey #VelocityConf San Jose June 13, 2019 1 w/ illustrations by @emilywithcurls!
Production is increasingly complex. 2 @lizthegrey at #VelocityConf
There is no 100% uptime. 3 @lizthegrey at #VelocityConf
Our strategies need to evolve. 4 @lizthegrey at #VelocityConf
Co "bought" DevOps. @lizthegrey at #VelocityConf 5
Ordering the alphabet soup... 6 @lizthegrey at #VelocityConf
Noisy alerts. Grumpy engineers. 7 @lizthegrey at #VelocityConf
Walls of meaningless dashboards. 8 @lizthegrey at #VelocityConf
Incidents take forever to fix. 9 @lizthegrey at #VelocityConf
Everyone bugs the "expert". 10 @lizthegrey at #VelocityConf
Deploys are unpredictable. 11 @lizthegrey at #VelocityConf
There's no time to do projects... 12 @lizthegrey at #VelocityConf
and when there's time, there's no plan. 13 @lizthegrey at #VelocityConf
The team is struggling to hold on. 14 @lizthegrey at #VelocityConf
What's Co missing? @lizthegrey at #VelocityConf 15
Co forgot who operates systems. 16 @lizthegrey at #VelocityConf
Tools aren't magical. 17 @lizthegrey at #VelocityConf
Invest in people, culture, & process. 18 @lizthegrey at #VelocityConf
Enter the art of Production Excellence. 19 @lizthegrey at #VelocityConf
Make systems more reliable & friendly. 20 @lizthegrey at #VelocityConf
ProdEx takes planning. 21 @lizthegrey at #VelocityConf
Measure and act on what matters. 22 @lizthegrey at #VelocityConf
Involve everyone. 23 @lizthegrey at #VelocityConf
Build everyone's confidence. Encourage asking questions. 24 @lizthegrey at #VelocityConf
How do we get started? 25 @lizthegrey at #VelocityConf
Know when it's too broken. 26 @lizthegrey at #VelocityConf
& be able to debug, together when it is. 27 @lizthegrey at #VelocityConf
Eliminate (unnecessary) complexity. 28 @lizthegrey at #VelocityConf
Our systems are always failing. 29 @lizthegrey at #VelocityConf
What if we measure too broken? 30 @lizthegrey at #VelocityConf
We need Service Level Indicators @lizthegrey at #VelocityConf 31
Think in terms of events in context. 32 @lizthegrey at #VelocityConf
Is this event good or bad? 33 @lizthegrey at #VelocityConf
Are users grumpy? Ask your PM. 34 @lizthegrey at #VelocityConf
What threshold buckets events? 35 @lizthegrey at #VelocityConf
HTTP Code 200? Latency < 300ms? 36 @lizthegrey at #VelocityConf
How many eligible events did we see? 37 @lizthegrey at #VelocityConf
Availability: Good / Eligible Events 38 @lizthegrey at #VelocityConf
Set a target Service Level Objective. 39 @lizthegrey at #VelocityConf
Use a window and target percentage. 40 @lizthegrey at #VelocityConf
99.9% of events good in past 30 days. 41 @lizthegrey at #VelocityConf
A good SLO barely keeps users happy. 42 @lizthegrey at #VelocityConf
Drive alerting with SLOs. 43 @lizthegrey at #VelocityConf
Is my service on fire? 44 @lizthegrey at #VelocityConf
Error budget: allowed unavailability 45 @lizthegrey at #VelocityConf
How long until I run out? 46 @lizthegrey at #VelocityConf
Page if it's hours. Ticket if it's days. 47 @lizthegrey at #VelocityConf
Data-driven business decisions. 48 @lizthegrey at #VelocityConf
Is it safe to do this risky experiment? 49 @lizthegrey at #VelocityConf
Should we invest in more reliability? 50 @lizthegrey at #VelocityConf
Perfect SLO > Good SLO >>> No SLO 51 @lizthegrey at #VelocityConf
Measure what you can today. 52 @lizthegrey at #VelocityConf
Iterate to meet user needs. 53 @lizthegrey at #VelocityConf
Only alert on what matters. 54 @lizthegrey at #VelocityConf
SLIs & SLOs are only half the picture... @lizthegrey at #VelocityConf 55
Our outages are never identical. 56 @lizthegrey at #VelocityConf
Failure modes can't be predicted. 57 @lizthegrey at #VelocityConf
Support debugging novel cases. In production. 58 @lizthegrey at #VelocityConf
Allow forming & testing hypotheses. 59 @lizthegrey at #VelocityConf
Dive into data to ask new questions. 60 @lizthegrey at #VelocityConf
Our services must be observable. 61 @lizthegrey at #VelocityConf
Can you examine events in context? 62 @lizthegrey at #VelocityConf
Can you explain the variance? 63 @lizthegrey at #VelocityConf
Can you mitigate impact & debug later? 64 @lizthegrey at #VelocityConf
SLOs and Observability go together. 65 @lizthegrey at #VelocityConf
But they alone don't create collaboration. @lizthegrey at #VelocityConf 66
Debugging is not a solo activity. 67 @lizthegrey at #VelocityConf
Debugging is for everyone. 68 @lizthegrey at #VelocityConf
Collaboration is interpersonal. 69 @lizthegrey at #VelocityConf
Operations must be sustainable. 70 @lizthegrey at #VelocityConf
We learn better when we document. 71 @lizthegrey at #VelocityConf
Fix hero culture. Share knowledge. 72 @lizthegrey at #VelocityConf
Reward curiosity and teamwork. 73 @lizthegrey at #VelocityConf
Learn from the past. Reward your future self. 74 @lizthegrey at #VelocityConf
Outages don't repeat, but they rhyme. 75 @lizthegrey at #VelocityConf
Risk analysis helps us plan. @lizthegrey at #VelocityConf 76
Quantify risks by frequency & impact. 77 @lizthegrey at #VelocityConf
Which risks are most significant? 78 @lizthegrey at #VelocityConf
Address risks that threaten the SLO. 79 @lizthegrey at #VelocityConf
Make the business case to fix them. 80 @lizthegrey at #VelocityConf
And prioritize completing the work. 81 @lizthegrey at #VelocityConf
Lack of observability is systemic risk. 82 @lizthegrey at #VelocityConf
So is lack of collaboration. 83 @lizthegrey at #VelocityConf
Season the alphabet soup with ProdEx 84 @lizthegrey at #VelocityConf
Production Excellence brings teams closer together. Measure. Debug. Collaborate. Fix. lizthegrey.com; @lizthegrey 85 @lizthegrey at #VelocityConf
@lizthegrey at #VelocityConf
@lizthegrey at #VelocityConf
@lizthegrey at #VelocityConf
@lizthegrey at #VelocityConf
@lizthegrey at #VelocityConf
Recommend
More recommend