A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram - PowerPoint PPT Presentation

TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S A D A R K A N D S T O R M Y N I G H T

@kiranb Kiran Bhattaram

It was a dark and stormy night; the rain fell in torrents — except at occasional intervals, when it was checked by a violent gust of wind which swept up the streets (for it is in London that our scene lies), rattling along the housetops, and fiercely agitating the scanty flame of the lamps that struggled against the darkness. B U LW E R - LY T T O N

DEFINITIONS What is operability? ▸ The ability to keep a system in a safe and reliable functioning condition, according to pre-defined operational requirements.

DEFINITIONS Characteristics of operability ▸ safety & reliability ▸ ease of upgrades ▸ scalability ▸ observability ▸ grace under pressure ▸ usability ▸ cultural practices around incidents ▸ AND MORE

DEFINITIONS Characteristics of an operable system ▸ Converge towards a stable state. ▸ Give operators visibility and tools. ▸ Designed to be usable and unsurprising.

DEFINITIONS Agenda Robustness Observability Usability Review!

1. ROBUSTNESS

STORY 1 THE TALE OF THE SYSTEM THAT COULDN’T GIVE ANYTHING UP

ROBUSTNESS Define your critical path.

ROBUSTNESS Harvest, Yield and Scalable Tolerant Systems Yield = successful requests != uptime total requests * dropping requests data available Harvest = total data * degrading response

ROBUSTNESS Controlling yield: load shedding upstream requests ▸ categories of load shedders: ▸ # of requests ▸ # of concurrent requests (protect against the long tail) ▸ overall fleet utilization (keep x% of workers for core traffic)

ROBUSTNESS Controlling harvest: circuit breakers ▸ stop calling a dependency if it seems down! ▸ what do you return? ▸ cached data ▸ nil ▸ or propagate the error upstream

ROBUSTNESS Controlling harvest: circuit breakers & compartmentalization http://idighardware.com/2013/10/fire-doors-everything-you-always-wanted-to-know-but-were-afraid-to-ask/

ROBUSTNESS Putting it all together: giving things up ▸ Combine harvest/yield degradation in different ways to protect the critical path ▸ Monitor any degradation! ▸ Dark launch your rate limiters to check what they’d block.

ROBUSTNESS Robustness, in review Converge to a stable state. ▸ know how the system sheds load ▸ know how it reacts to downstream failures

2. OBSERVABILITY

STORY 2 THE TALE OF THE FRACTAL QUEUE

OBSERVABILITY Instrument EVERYTHING ▸ especially with queues ▸ percentiles, not averages ▸ don’t intermingle logs (keep a searchable trace ID on requests)

OBSERVABILITY Over-collect data, but build dashboards carefully ▸ work metrics ▸ is the system doing the thing it’s supposed to? ▸ resource metrics ▸ how are the components of the system behaving? ▸ build your dashboard with work metrics first.

STORY 4 THE TALE OF THE 64 ALERT WEEK

OBSERVABILITY Don’t normalize deviance

OBSERVABILITY Knowing what to alert on ▸ Monitor the alert volume of your system! ▸ Pages should be actionable and represent user pain.

OBSERVABILITY Observability: what we learned ▸ Kiran has a special vendetta against unmonitored queues. ▸ Building good dashboards: work metrics & resource metrics. ▸ Monitor alert volume, too!

3. USABILITY

USABILITY A quick side note: Nielsen Heuristics 6. Recognition vs. recall 6. Recognition vs. recall 1. Visibility of system status 1. Visibility of system status 7. Flexibility and efficiency of use 2. Match between system and the real world 8. Aesthetic and minimalist design 3. User control and freedom 9. Help users recognize, diagnose, 9. Help users recognize, 3. User control and freedom and recover from errors diagnose, and recover from 4. Consistency and standards errors 10. Help and documentation 5. Error prevention 5. Error prevention

Story 5: the tale of the special snowflake service

USABILITY Heuristic 4. Consistency and Standards ▸ pattern-matching across similar systems is really valuable! ▸ Choose boring technology: spend your innovation tokens wisely!

OBSERVABILITY Heuristic 3. User control and freedom ▸ Tooling is a part of the service! ▸ relatedly, deploy mechanisms are related to availability! ▸ Give operators the ability to change operational parameters.

STORY 6 THE TALE OF THE OPS SPELL BOOK

USABILITY Heuristic 6. Recognition v. recall ▸ Keep checklists minimal and heavily automated. ▸ long flowcharts in a runbook are :( ▸ relatedly: scripting user communications is helpful.

USABILITY Heuristic 1. Visibility of system status ▸ which of these are changes to production? ▸ config changes ▸ deploys ▸ utility script runs ▸ failovers ▸ adding/decreasing capacity

STORY 7 THE TALE OF THE AMBIGUOUS ERROR MESSAGE

USABILITY Heuristic 9. Help users recognize, diagnose, and recover from errors ▸ error messages are a crucial part of your interface ▸ Writing a good alert message: ▸ expressed in plain language, precisely indicate the problem, and constructively suggest a solution (runbooks!) ▸ (ex.) CRITICAL: Served 5% 5xx results in the last 5 minutes! <link to runbook>

USABILITY Usability, in review ▸ Operational experience matters! Consider: ▸ whether the system follows general conventions. ▸ how it alerts operators to errors clearly and unambiguously. ▸ how minimal and usable the tooling is.

REVIEW Review ▸ Robustness ▸ Does your system converge to a stable state? ▸ Observability ▸ Can you infer what the internal state of the system looks like? ▸ Usability ▸ Do your operators have control over the state of the system? Do you adhere to general standards?

: ( STORY THE LAST THE TALE OF THE SAD QUEUE

STORY THE LAST A DARK AND STORMY NIGHT

REVIEW Resources ▸ Harvest, Yield, and Scalable Tolerant Systems (Brewer & Fox) ▸ How Complex Systems Fail (Cook) ▸ "Going solid": a model of system dynamics and consequences for patient safety (Cook) ▸ Nielsen’s Usability Heuristics ▸ Choose Boring Technology (Dan McKinley) ▸ Site Reliability Engineering: How Google Runs Production Systems ▸ Stripe’s (upcoming) rate limiting blog post ▸ Collection of postmortems (Dan Luu)

REVIEW On Designing and Deploying Internet-Scale Services, James Hamilton ▸ list of best practices, from design, to upgrades, to incident response

T H A N K S ! Thanks to Ines Sombra, Charity Majors, Alyssa Frazee, Rachel Sanders, and Andy Bonventre for review!

STUFF I COULDN’T GET TO APPENDIX

OBSERVABILITY decouple deploys from releases ▸ get a minimal version in dark-reads into production asap ▸ corollary: have good kill switches! ▸ Know what rollbacks look like

OBSERVABILITY collect operational metrics in this shadow phase ▸ Gain historical knowledge of what the system’s healthy state looks like. ▸ Tweak your alerts and SLAs. ▸ Gameday the system! Write runbooks!

A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram - PowerPoint PPT Presentation

TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram It was a dark and stormy night; the rain fell in torrents except at occasional intervals, when it was checked

Roy Ben-Shalom UCSF Neurology Department CIN Center for Integrative Neuroscience Neural vs

State Learning Program Apprentice at School PUBLIC SCHOOL NETWORK OF CEAR STATE (1st, 2nd and

Remarks of Jim Hoecker Husch Blackwell LLP Counsel, WIRES Former Chairman, Federal Energy

CONFIDENTIAL 1 How to Secure Devices in a Smart City IoT devices in a Zero-Trust manufacturing

Demystifying LEED for Homes v4.0 Jay Hall Jay Hall & Associates, Inc. October 31, 2013

One Neurosurgery Movement http://www.oneneurosurgery.com We Recognize AANS and CNS have

BUILDING OPTIX SHADERS FROM MDL MATERIALS Detlef Rttger, NVIDIA Andreas Mank, ESI Group

SUGGESTED : STANDARD FOR THE COMPILATION OF THE SHUTDOWN PRESENTATION Introduction The

How the HotSpot and Graal JVMs Execute Java Code James Gough - @Jim__Gough About Me University

Christopher Lavin, Marc Padilla, Jaren Lamprecht, Philip Lundrigan Brent Nelson and Brad Hutchings

Using the new OTC guidance to help compile a successful application An Update from the

OpenMP Offloading Verification and Validation: Workflow and Road to 5.0 Thomas Huber & Joshua

Formal verification of an optimizing compiler or: a software-proof codesign approach to the

Verification Verification and and Validation Validation 1 /41 1 /41 C2 C2 Overview

MANEUVER BASED VALIDATION OF BMW xDRIVE VARIANTS BY USING VIRTUAL VEHICLE INTEGRATION AND HIL TEST

Comprehensive and systematic validation of independent safety analysis tools (COVA) SAFIR2018

EAST Collection Analysis Rick Lugg & Ruth Fischer Sustainable Collection Services/OCLC

Amesbury School District Aspen/X2 Update Amesbury School Information System October 1, 2013

example, 30 60; 15 75, 22.5 67.5. It is clear

Immunology and Immunotoxicity Immunology and Immunotoxicity of Nanomedicines of Nanomedicines

IMPAWATT IMPAWATT IMPlementAtion Work and Actions To change the energy culture

CPs Complements Is everything that goes here a CP? Complement Phrases help us build

fast-growing tree species in northern Europe Lars Rytter, Rose-Marie Rytter, Lars-Gran Stener

Building Bridges from Education to Economic Prosperity YouthForce NOLA Overview May 3, 2019 WHY

Sambuz

Useful Links

Newsletter

Mail Us

A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram - PowerPoint PPT Presentation

TA L E S O F O P E R A B I L I T Y A N T I - PA T T E R N S A D A R K A N D S T O R M Y N I G H T @kiranb Kiran Bhattaram It was a dark and stormy night; the rain fell in torrents except at occasional intervals, when it was checked

Roy Ben-Shalom UCSF Neurology Department CIN Center for Integrative Neuroscience Neural vs

State Learning Program Apprentice at School PUBLIC SCHOOL NETWORK OF CEAR STATE (1st, 2nd and

Remarks of Jim Hoecker Husch Blackwell LLP Counsel, WIRES Former Chairman, Federal Energy

CONFIDENTIAL 1 How to Secure Devices in a Smart City IoT devices in a Zero-Trust manufacturing

Demystifying LEED for Homes v4.0 Jay Hall Jay Hall &amp; Associates, Inc. October 31, 2013

One Neurosurgery Movement http://www.oneneurosurgery.com We Recognize AANS and CNS have

BUILDING OPTIX SHADERS FROM MDL MATERIALS Detlef Rttger, NVIDIA Andreas Mank, ESI Group

SUGGESTED : STANDARD FOR THE COMPILATION OF THE SHUTDOWN PRESENTATION Introduction The

How the HotSpot and Graal JVMs Execute Java Code James Gough - @Jim__Gough About Me University

Christopher Lavin, Marc Padilla, Jaren Lamprecht, Philip Lundrigan Brent Nelson and Brad Hutchings

Using the new OTC guidance to help compile a successful application An Update from the

OpenMP Offloading Verification and Validation: Workflow and Road to 5.0 Thomas Huber &amp; Joshua

Formal verification of an optimizing compiler or: a software-proof codesign approach to the

Verification Verification and and Validation Validation 1 /41 1 /41 C2 C2 Overview

MANEUVER BASED VALIDATION OF BMW xDRIVE VARIANTS BY USING VIRTUAL VEHICLE INTEGRATION AND HIL TEST

Comprehensive and systematic validation of independent safety analysis tools (COVA) SAFIR2018

EAST Collection Analysis Rick Lugg &amp; Ruth Fischer Sustainable Collection Services/OCLC

Amesbury School District Aspen/X2 Update Amesbury School Information System October 1, 2013

example, 30 60; 15 75, 22.5 67.5. It is clear

Immunology and Immunotoxicity Immunology and Immunotoxicity of Nanomedicines of Nanomedicines

IMPAWATT IMPAWATT IMPlementAtion Work and Actions To change the energy culture

CPs Complements Is everything that goes here a CP? Complement Phrases help us build

fast-growing tree species in northern Europe Lars Rytter, Rose-Marie Rytter, Lars-Gran Stener

Building Bridges from Education to Economic Prosperity YouthForce NOLA Overview May 3, 2019 WHY

Sambuz

Useful Links

Newsletter

Mail Us

Demystifying LEED for Homes v4.0 Jay Hall Jay Hall & Associates, Inc. October 31, 2013

OpenMP Offloading Verification and Validation: Workflow and Road to 5.0 Thomas Huber & Joshua

EAST Collection Analysis Rick Lugg & Ruth Fischer Sustainable Collection Services/OCLC