Designing Services for Resilience Experiments: Lessons from Netflix - PowerPoint PPT Presentation

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js

So, how can teams design services for resilience testing? ● Failure Injection Enabled

So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled

So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled ● Fallback Paths ○ And ways to discover them

So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled ● Fallback Paths ○ And ways to discover them ● Proper monitoring ○ Key business metrics to look for

So, how can teams design services for resilience testing? ● Failure Injection Enabled ● RPC enabled ● Fallback Paths ○ And ways to discover them ● Proper monitoring ○ Key business metrics to look for ● Proper timeouts ○ And ways to discover them

Known Ways to Increase Confidence in Resilience

Known Ways to Increase Confidence in Resilience ● Unit Tests

Known Ways to Increase Confidence in Resilience ● Integration Tests

New Ways to Increase Confidence in Resilience ● Chaos Experiments

SPS: Key Business Metric

Chaos Engineering: Netflix’s ChAP 100% API Personalization

Chaos Engineering: Netflix’s ChAP 98% Gateway API Personalization 1% API Control

Chaos Engineering: Netflix’s ChAP 98% Gateway API Personalization 1% API Control 1% API Exp

Monitoring

Monitoring SHORTED

1. Have Failure Injection Testing Enabled.

Sample Failure Injection Library https://github.com/norajones/FailureInjectionLibrary

Types of Chaos Failures

Criteria&API

Automating Creation of Chaos Experiments

2. Have Good Monitoring in Place for Configuration Changes.

Have Good Monitoring in Place ● RPC Enabled

Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands

Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks

Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks ● Timeouts

Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks ● Timeouts ● Retries

Have Good Monitoring in Place ● RPC Enabled ○ Associated Hystrix Commands ■ Associated Fallbacks ● Timeouts ● Retries ● All in One Place!

RPC/Ribbon ● Java library managing REST clients to/from different services ● Fast failing/fallback capability

RPC/Ribbon Timeouts

RPC Timeouts At what point does the service give up?

Retries Immediately retrying a failure after an operation is not usually a great idea.

Retries Understand the logic between your timeouts and your retries.

Circuit Breakers/Fallback Paths

Hystrix Commands/Fallback Paths If your service is non-critical, ensure that there are fallback paths in place.

Fallback Strategies Static Content Fallback Cache Service

Fallback Strategies Know what your fallback strategy is and how to get that information.

3. Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

ChAP’s Monocle

There isn’t always money in microservices

Criticality Score

Criticality Score RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

Chaos Success Stories

“We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was resolved before it resulted in any availability incident!”

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful...

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback was retrying the call, causing an increase in license requests.”

Don’t lose sight of your company’s customers.

@nora_js Takeaways ● Designing for resiliency testability is a shared responsibility. ● Configuration changes can cause outages. ● Have explicit monitoring in place on antipatterns in configuration changes.

Questions? @nora_js

Designing Services for Resilience Experiments: Lessons from Netflix - PowerPoint PPT Presentation

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js So, how can teams

Childrens Resilience Initiative One Communitys Response to ACEs through Resilience 1

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

How do we assess resilience? Paul Ryan, Australian Resilience Centre Allyson Quinlan, Resilience

Do Now: Resilience 1. Create a Circle Map for resilience. 2. Look at the pictures. What

resilience Professor Kate Thomas c.p.thomas@bham.ac.uk What is resilience? Resilience is the

Class 14 Slides SLIDE what is the designing principle how does designing principle

Experiments on deflection of charged Experiments on deflection of charged Experiments on

May 2018 ALL THINGS ADAPTED LESSONS What are adapted lessons? therapeutic music lessons

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Developing Resilience Parent Council Meeting What is resilience . Resilience is the ability to

Mission: Resilience Keeping going & not giving up! We are going to learn all about the

East Central Flo lorida Regional Resilience Coll llaborative Resilience EDA funded

Resilience Is Key CHORUS AMERICA VIRTUAL CONFERENCE JUNE 2020 NADINE WETHINGTON FORTE

SAME Resilience Webinar Dedicated to National Security Since 1920 JETC Resilience Program Wed, 23

Kent Resilience Forum Activity 2017 Steve Scully KCC Senior Resilience Officer Kent Resilience

From BBB to Urban Resilience Dr. Joe Leitmann Team Leader , Urban Resilience GFDRR/World Bank

The COVID-19 Storm and Recession Aftermath Weathering the Storm and Cleanup: Bankruptcy,

Planning and Execution (CHAP-E) J. Benton, John Kaneshige, David Smith, Chris Plaunt, Leslie

CHAP Overview 1 These are needs from recent policies and different venues that would use

Advanced Placement Java Chapter 10 Introduction to Arrays Objectives Write programs

The Fairshare Model: A Performance-Based Capital Structure for Venture-Stage Initial Public

Pollution Control Standards for Discharges to the Ohio River This hearing will begin shortly.

Michigan Childrens Healthcare Access Program (MI-CHAP) Asthma Initiative of Michigan

Linsey Cresswell-Commercial Director. Four Centre's: One Vision The Centres are essential

Designing Services for Resilience Experiments: Lessons from Netflix - PowerPoint PPT Presentation

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js So, how can teams

Childrens Resilience Initiative One Communitys Response to ACEs through Resilience 1

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

How do we assess resilience? Paul Ryan, Australian Resilience Centre Allyson Quinlan, Resilience

Do Now: Resilience 1. Create a Circle Map for resilience. 2. Look at the pictures. What

resilience Professor Kate Thomas c.p.thomas@bham.ac.uk What is resilience? Resilience is the

Class 14 Slides SLIDE what is the designing principle how does designing principle

Experiments on deflection of charged Experiments on deflection of charged Experiments on

May 2018 ALL THINGS ADAPTED LESSONS What are adapted lessons? therapeutic music lessons

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Developing Resilience Parent Council Meeting What is resilience . Resilience is the ability to

Mission: Resilience Keeping going &amp; not giving up! We are going to learn all about the

East Central Flo lorida Regional Resilience Coll llaborative Resilience EDA funded

Resilience Is Key CHORUS AMERICA VIRTUAL CONFERENCE JUNE 2020 NADINE WETHINGTON FORTE

SAME Resilience Webinar Dedicated to National Security Since 1920 JETC Resilience Program Wed, 23

Kent Resilience Forum Activity 2017 Steve Scully KCC Senior Resilience Officer Kent Resilience

From BBB to Urban Resilience Dr. Joe Leitmann Team Leader , Urban Resilience GFDRR/World Bank

The COVID-19 Storm and Recession Aftermath Weathering the Storm and Cleanup: Bankruptcy,

Planning and Execution (CHAP-E) J. Benton, John Kaneshige, David Smith, Chris Plaunt, Leslie

CHAP Overview 1 These are needs from recent policies and different venues that would use

Advanced Placement Java Chapter 10 Introduction to Arrays Objectives Write programs

The Fairshare Model: A Performance-Based Capital Structure for Venture-Stage Initial Public

Pollution Control Standards for Discharges to the Ohio River This hearing will begin shortly.

Michigan Childrens Healthcare Access Program (MI-CHAP) Asthma Initiative of Michigan

Linsey Cresswell-Commercial Director. Four Centre's: One Vision The Centres are essential

Mission: Resilience Keeping going & not giving up! We are going to learn all about the