Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage - PowerPoint PPT Presentation

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel

An Outage Danyel Fisher @fisherdanyel

Danyel Fisher @fisherdanyel

What do you do when things break? How bad was this break? Danyel Fisher @fisherdanyel

We need to improve Build new features! quality! Danyel Fisher @fisherdanyel

Management How broken is “too broken”? Engineering What does “good enough” mean? Clients and Users Combatting alert fatigue Danyel Fisher @fisherdanyel

A telemetry system produces events that correspond to real world use We can describe some of these events as eligible We can describe some of them as good Danyel Fisher @fisherdanyel

Given an event , is it eligible? Is it good? Eligible: “Had an http status code” Good: “... that was a 200, and was served under 500 ms ” Danyel Fisher @fisherdanyel

Minimum Quality ratio over a Number of bad events allowed. period of time Danyel Fisher @fisherdanyel

Deploy faster Room for experimentation Opportunity to tighten SLO Danyel Fisher @fisherdanyel

We always store incoming user 99.99% ~4.3 minutes data Default dashboards usually load 99.9% 45 minutes in < 1s 99% 7.3 hours Queries often return in < 10 s Danyel Fisher @fisherdanyel

User Data Throughput We blew through three months’ budget in those 12 minutes. Danyel Fisher @fisherdanyel

We dropped customer data Danyel Fisher @fisherdanyel

We dropped customer data We rolled it back (manually) We communicated to customers We halted deploys Danyel Fisher @fisherdanyel

We checked in code that didn’t build . We had experimental CI build wiring. Our scripts deployed empty binaries . There was no health check and rollback. Danyel Fisher @fisherdanyel

We stopped writing new features We prioritized stability We mitigated risks Danyel Fisher @fisherdanyel

SLOs allowed us to characterize what went wrong, how badly it went wrong, and how to prioritize repair Danyel Fisher @fisherdanyel

Learning from SLOs Danyel Fisher @fisherdanyel

Final point A one-line description of it Danyel Fisher @fisherdanyel

● Design Thinking Expressing and Viewing ● Burndown Alerts and Responding ● ● Learning from our Experiences ● Success Stories Danyel Fisher @fisherdanyel

Design Thinking and Task Analysis Understand user goals and needs Learn from informants and experts Collaborate with internal team Collect feedback and ideas externally Danyel Fisher @fisherdanyel

Displays and Views Danyel Fisher @fisherdanyel

See where the burndown was happening, explain why, and remediate Danyel Fisher @fisherdanyel

Expressing SLOs Event based Time based “How many events had a duration < “How many 5 minute periods, had a 500 ms” P95(duration) < 500 ms” Danyel Fisher @fisherdanyel

How do we express SLOs? Good events Bad events How often Time range Danyel Fisher @fisherdanyel

How do we express SLOs? Eligible: $name is “run_trigger_detailed” Good: $app.error does not exist Good events Bad events How often Time range Danyel Fisher @fisherdanyel

How do we express SLOs? Good events Bad events How often Time range Danyel Fisher @fisherdanyel

Status of an SLO Danyel Fisher @fisherdanyel

How have we done? Danyel Fisher @fisherdanyel

Where did it go? Danyel Fisher @fisherdanyel

When did the errors happen? Danyel Fisher @fisherdanyel

What went wrong? High dimensional data High cardinality data Danyel Fisher @fisherdanyel

Why did it happen? Danyel Fisher @fisherdanyel

See where the burndown was happening, explain why, and remediate Danyel Fisher @fisherdanyel

User Feedback “The Bubble Up in the SLO page is really powerful at highlighting what is contributing the most to missing our SLIs, it has definitely confirmed our assumptions.” Danyel Fisher @fisherdanyel

User Feedback “Your customers have to be happy... we have to have an understanding of the customer experience. … To the millisecond we knew what our percentage was of success versus failure .” -Josh Hull, Site Reliability Engineering Lead, Clover Health Danyel Fisher @fisherdanyel

User Feedback “The historical SLO chart also confirms a fix for a performance issue we did greatly contributed to the SLO compliance by showing a nice upward trend line. :)” Danyel Fisher @fisherdanyel

User Feedback “I’d love to drive alerts off our SLOs. Right now we don’t have anything to draw us in and have some alerts on the average error rate but they’re a little spiky to be useful. It would be great to get a better sense of when the budget is going and define alerts that way.” Danyel Fisher @fisherdanyel

Burndown Alerts Danyel Fisher @fisherdanyel

How is my system doing? Am I over budget? When will my alarm fail? Danyel Fisher @fisherdanyel

When will I fail? User goal: get alerts to exhaustion time Human-digestible units 24 hours: “I’ll take a look in the morning” 4 hours: “All hands on deck!” Danyel Fisher @fisherdanyel

How is my system doing? Am I over budget? When will my alarm fail? Danyel Fisher @fisherdanyel

Implementing Burn Alerts Run a 30 day query at a 5 minute resolution every minute Danyel Fisher @fisherdanyel

Caching is Fun! Danyel Fisher @fisherdanyel

Fun with Caching Vital to cache results … but not incomplete results … … at what resolution of cache? Danyel Fisher @fisherdanyel

Flappy Alerts “It’ll expire at 3:55” (We added a 10%ish buffer) “Wait, make that 4:05” “Nope, 3:55 again!” Danyel Fisher @fisherdanyel

Recovering from Bankruptcy A failure a month ago brought us to -169% and still hasn’t aged out? That means we don’t get alerts anymore Customer workaround: delete and re-create the SLO, thus blowing the cache Danyel Fisher @fisherdanyel

Learning from Experience Danyel Fisher @fisherdanyel

Volume is important Tolerate at least dozens of bad events per day Danyel Fisher @fisherdanyel

Faults

SLOs for Customer Service Remember that user having a bad day? ADD IMAGE Danyel Fisher @fisherdanyel

Blackouts are easy … but brownouts are much more interesting Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash Danyel Fisher @fisherdanyel

Timeline 1:29 am SLO alerts. “Maybe it’s just a blip” 1.5% brownout for 20 minutes 4:21 am Minor incident. “It might be an AWS problem” 6:25 am SLO alerts again. “Could it be ALB compat?” 9:55 am “Why is our system uptime dropping to zero?” It’s out of memory We aren’t alerting on that crash 10:32 am Fixed Danyel Fisher @fisherdanyel

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage - PowerPoint PPT Presentation

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel What do you do when things break? How bad was this break? Danyel Fisher @fisherdanyel

Student Learning Outcomes Course Objectives Course SLOs Program SLOs

Creating Dashboards of Direct and Creating Dashboards of Direct and Creating Dashboards of Direct

The Art of SLOs In the midst of chaos , there is also opportunity reliability Sun Tzu, The Art

Exploring Student Learning Objectives (SLOs) & Student Outcome Objectives (SOOs) American

Pitfalls in Using a case based approach, we will Arrhythmias review pitfalls in management of:

Knowledge Engineering Pitfalls Knowledge Engineering Pitfalls Which one is better to represent

SLOs: Assessment & Alignment of Outcomes San Bernardino Valley College January 10, 2013

Long-term SLOs for reclaimed cloud computing resources Carvalho et al. (2014) Christopher

SLO Assessment: Training the Trainers Marcy Alancraig Learning Outcomes Assessment Coordinator

SNC-Meister: Admitting More Tenants With Tail Latency SLOs Timothy Zhu Daniel S. Berger Mor

Cake: : Enabling Hig igh-level SLOs on Shared Sto torage Syste tems Andrew Wang, Shivaram

ITU on Measuring Speech Quality Measuring Perceived Quality Typically done by using standards

Measuring the Internet Project Introduction Mat Ford / David Belson measuring@isoc.org

Measuring What Matters Quality, Impact and Measuring Social Value Philip Angier, Angier Griffin

Measuring Environmental & Social Value Introduction Agenda Introductions What is

Measuring of homelessness in Belgium: pitfalls and challenges Prof. dr. Koen Hermans Belgian

Robotic Navigation Unit Team 42 Robotic Navigation Unit Dr. Crassidis Faculty Mentor

Springing into Summer How to support student learning during the summer May 16th, 2017, West

2017: A Survival Guide Learning from Brexit and Trump to help you succeed in a changed world A

A __________ F_ F____ F_ AFFF works by creating a _____ that stays

Policy Direction Workshop October 9, 2017 AGENDA Foundation Materials Regulatory &

Available in steel, 409 S.S., and aluminum body construction. Available in corrugated sides or

Who Are We? BENCOR In Florida, Special Pay plans in 63 Public School Systems Nearly 1

Financial Update Faculty Associates, Inc. Board of Directors Meeting June 14, 2018 HIGH:

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage - PowerPoint PPT Presentation

Pitfalls in Measuring SLOs Danyel Fisher @fisherdanyel An Outage Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel Danyel Fisher @fisherdanyel What do you do when things break? How bad was this break? Danyel Fisher @fisherdanyel

Student Learning Outcomes Course Objectives Course SLOs Program SLOs

Creating Dashboards of Direct and Creating Dashboards of Direct and Creating Dashboards of Direct

The Art of SLOs In the midst of chaos , there is also opportunity reliability Sun Tzu, The Art

Exploring Student Learning Objectives (SLOs) &amp; Student Outcome Objectives (SOOs) American

Pitfalls in Using a case based approach, we will Arrhythmias review pitfalls in management of:

Knowledge Engineering Pitfalls Knowledge Engineering Pitfalls Which one is better to represent

SLOs: Assessment &amp; Alignment of Outcomes San Bernardino Valley College January 10, 2013

Long-term SLOs for reclaimed cloud computing resources Carvalho et al. (2014) Christopher

SLO Assessment: Training the Trainers Marcy Alancraig Learning Outcomes Assessment Coordinator

SNC-Meister: Admitting More Tenants With Tail Latency SLOs Timothy Zhu Daniel S. Berger Mor

Cake: : Enabling Hig igh-level SLOs on Shared Sto torage Syste tems Andrew Wang, Shivaram

ITU on Measuring Speech Quality Measuring Perceived Quality Typically done by using standards

Measuring the Internet Project Introduction Mat Ford / David Belson measuring@isoc.org

Measuring What Matters Quality, Impact and Measuring Social Value Philip Angier, Angier Griffin

Measuring Environmental &amp; Social Value Introduction Agenda Introductions What is

Measuring of homelessness in Belgium: pitfalls and challenges Prof. dr. Koen Hermans Belgian

Robotic Navigation Unit Team 42 Robotic Navigation Unit Dr. Crassidis Faculty Mentor

Springing into Summer How to support student learning during the summer May 16th, 2017, West

2017: A Survival Guide Learning from Brexit and Trump to help you succeed in a changed world A

A ____________ F_____ F__________ F_______ AFFF works by creating a _______ that stays

Policy Direction Workshop October 9, 2017 AGENDA Foundation Materials Regulatory &amp;

Available in steel, 409 S.S., and aluminum body construction. Available in corrugated sides or

Who Are We? BENCOR In Florida, Special Pay plans in 63 Public School Systems Nearly 1

Financial Update Faculty Associates, Inc. Board of Directors Meeting June 14, 2018 HIGH:

Exploring Student Learning Objectives (SLOs) & Student Outcome Objectives (SOOs) American

SLOs: Assessment & Alignment of Outcomes San Bernardino Valley College January 10, 2013

Measuring Environmental & Social Value Introduction Agenda Introductions What is

A __________ F_ F____ F_ AFFF works by creating a _____ that stays

Policy Direction Workshop October 9, 2017 AGENDA Foundation Materials Regulatory &