: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT

Whoops, something went wrong… Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.

Functional Sharding RPC tuning Shard A Shard B Shard C Client Server Bulkheads & Fallbacks

How to stay up in spite of change and turmoil? How to fail well? How to help teams build more resilient systems? Non-Critical Critical Service Chaos Service Owner. Owner. Engineer.

Service Criticality Driver_free_car.jpg, CC BY-SA 3.0, BP63Vincente 2015, Wikimedia

Service Criticality KPI = Playback Starts Per Second ( SPS ) Service A Service F Service B Service C Service G Non-critical Service D Service E Critical

Non-Critical Service Critical Service Chaos Owner. Owner. Engineer.

Badging

My service is non-critical, who needs Chaos? How do you know your service is non-critical?

https://github.com/Netflix/Hystrix Insights Bulkheads Circuit Breakers Timeouts Fallbacks

Badging Service (Non-Critical) API Service Fallback Badging Service

Surprise! Badging is Critical! API Service Fallback Badging Service

Gaps in Traditional Testing ● Environmental factors may differ between test and production (config, data, etc.) ● Systems behave differently under load than they do in a single unit or integration test ● Users react differently to failures than you expect.

How to fail well? ● Functioning fallbacks. ● Use Chaos to close gaps in traditional testing methods. Non-Critical Service Critical Service Chaos Owner. Owner. Engineer.

Critical Service Non-Critical Chaos Owner. Service Owner. Engineer.

Protect your service (and your customers)

How can I decrease the blast radius of failures? How about functional sharding!

Playback Service Architecture API Service API Service API Service Playback Service URL Service

CRITICAL NON-CRITICAL Customer Experience or Streaming Performance Impact Impact

Playback Service Functional Shards API Service API Service API Service Critical Playback Non-Critical Service Playback Service Critical URL Non-Critical Service URL Service

CC BY-NC 2.5, Randall Munroe, xkcd.com

Experimenting with Shards API Service API Service API Service Critical Playback Non-Critical Service Playback Service Non-Critical URL Service URL Service

Customer Behavior Insights API Service API Service API Service Critical Playback Non-Critical Service Playback Service 25% More Non-Critical Traffic URL Service URL Service

How do I confirm my system is tuned properly? Inject latency, of course!

Dependency Tuning Playback Customer Tag Service Service ● Retries ● Timeouts ● Load balancing strategies ● Concurrency limits ● Circuit breakers

Calendar*, CC BY 2.0, Dafne Cholete 2011, Flikr

Playback Service → Customer Tag Service Playback Service Customer Tag Service

Latency Injection - Round 1 Playback Service Customer Tag Service

Latency Injection - Round 2 Playback Service Customer Tag Service

Latency Injection - Round 2 300ms timeout Playback Service 350ms Out of time!! 1. Customer 2. URL Service Tag Service

Continuous Experimentation FTW! ● Fewer changes between experiments make it easier to isolate the regression. ● Fine-grained experiments scope the investigation (as opposed to outages where there are lots of red-herrings) .

How to stay up in spite of change and turmoil? ● Functional sharding for fault isolation. ● Tune RPC calls. ● Use Chaos to validate config and resiliency strategies. Critical Service Non-Critical Chaos Owner. Service Owner. Engineer.

Chaos Non-Critical Critical Service Engineer. Service Owner. Owner.

How do you help teams build more resilient systems? We need to do more of the heavy lifting. Perhaps the Principles of Chaos can help!

Principles of Chaos ● Minimize Blast Radius ● Build a Hypothesis around Steady State Behavior ● Vary Real-world Events ● Run Experiments in Production ● Automate Experiments to Run Continuously https://principlesofchaos.org/

Test v. Production Rock-em, CC BY-SA 2.0, Ariel Waldmane 2009, Flikr

How can we Minimize Blast Radius? Safety, safety, safety!!

Kill Switch

Canary Strategy Service A Service B Service C 0.5% 0.5% Service B (Control) Service B (Experiment)

Limit Impact Runs In Progress Experiment Cluster Status Latency api-prod In Progress Latency dredd-prod In Progress Failure api-prod Queued

Limit When Experiments can Run Safety First during the Holidays

Ensure Failures are Addressed

Fail Open 1. Control errors too high. 2. Errors in chaos code unrelated to the experiment in question. 3. Platform components crashing (monitoring, worker nodes, etc).

How should we Build a Hypothesis around Steady Observability is key! State? Add effective monitoring, analysis, and insights.

Insights

Automated Canary Analysis (ACA) https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69

ChAP ACA Configurations Validate the experiment itself Validate the real-time monitoring didn’t miss anything Check for service failures even if they didn’t cause an impact in KPIs See if your service is approaching an unhealthy state

How do you Vary Real-world Events in an automated fashion? By carefully designing and prioritizing your experiments, of course!

Understand the Service Under Test Dependency Insights: ● Timeouts ● Retries ● % of Requests Involved Requests Per Second ● ● Latency ● Hystrix Commands ○ Fallbacks Timeouts ○

Evaluate Safety NOT SAFE TO FAIL!!!

Can more automation eventually lead to fewer experiments?

Prioritize Experiments Retries Traffic Percentage Failure Latency Experiment Type Aging

Generate Experiments Failure Failure Latency Latency

Is it time to Run Experiments in Production? Here we go!

What happened? 14 0 Vulnerabilities Outages Tooling Confidence Gaps

Example Finding Playback Service No Fallback! 376 ms Latency License Service

88.85% of cluster traffic Circuit Breaker s t u o e m i T 10 threads Thread Pool Rejections

Fully validated fix in tool before rollout!

After a day's worth of data, the results are looking fantastic. Every negative metric [for that Hystrix command] had a drastic improvement, and some by an order of magnitude. --Robert Reta, Playback Licensing

What else can be safer?

How do you help teams build more resilient systems? ● Apply the “Principles of Chaos” to tooling. ● Manage the heavy lifting. Chaos Non-Critical Critical Service Engineer. Service Owner. Owner.

You Must be This Tall to Ride?

How to stay up in spite of change and turmoil? ● Functional sharding for fault isolation. ● Tune RPC calls. ● Use Chaos to validate config and resiliency strategies. How to fail well? How to help teams build more ● Functioning fallbacks. resilient systems? Use Chaos to close gaps in ● ● Apply the “Principles of Chaos” to traditional testing methods. tooling. ● Manage the heavy lifting. Non-Critical Critical Service Chaos Service Owner. Owner. Engineer.

You Can Either Curl Up In A Ball And Die… Or You Can Stand Up And Say, “We’re Different. We’re The Strong Ones, And You Can’t Break Us!” Haley Tucker Senior Software Engineer Chaos Engineering @hwilson1204

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error Were having trouble playing this title right now. Please try again later or select a different title. Functional Sharding RPC tuning Shard A

FISHERMANS BEND RECAST A MAJOR BREAK FROM CAR DEPENDENCY Victorian Transport Infrastructure

Nason LWP First Bend Nason LWP First Bend Nason LWP First Bend Reach geomorphology

Spot Mr Whoops Mistakes Activity 1 My favorite day of the week has always been a Sunday. Even

Can Personalized Learning and Project Based Learning Truly Coexist? River Bend MS

What is a suffix? Choose the correct answer. A race where two peoples legs are tied together.

Fort Bend I Bend ISD 2019 Texas Academic Performance Report & Public Hearing February 10,

How To Not Burn Out or Whoops, Too Late, Now What? If I look at the mass, I will never act. If

Digital Learning Department Welcome to Fort Bend ISD! Digital Learning Department Provide

Bend your minds to holy learning that you may escape the fretting moth of littleness of mind that

I wonder if I could ask for your help? Im sometimes a little clumsy with my spelling. Could

Coastal Bend Bays & Estuaries Program Protecting and Restoring the Bays & Estuaries of

GLOBAL PARTNERSHIP INITIATIVE Purpose e for or Travel el to o China Fort Bend County and Fort

Spot Mr. Whoops Mistakes Activity 1 Last week, I entered a photography competision in my local

Bends Transportation Future Boyd Acres Neighborhood Association Monday, September 10, 2018

Fort Bend Seniors Meals on Wheels Presentation | United Way Committee United Way Service Center

The Silk Road OPSEC Fail Whoops! Disclaimer Any opinions presented in this talk by the

CS 730/730W/830: Intro AI Break HMMs 1 handout: slides final blog entries were due Wheeler

Identify the Break-Even Point 1 What does it mean to break-even? 2

MAS439 Lecture 5 Quotient Rings October 11th Announcements Homework 1: Whoops I didnt ask

Whoops! Where did my architecture go? Approaches to architecture management for Java and Spring

GLOBAL PARTNERSHIP INITIATIVE Purpose e for or T Travel t el to C o China a and T Tai

Gees Bend FERRY Battery Conversion Presented by: Tim Aguirre, HMS Ferries 09.25.2019

Fort Bend County Future Growth Implications Todd LaRue, Principal, RCLCO Fort Bend County

A tentative Summary mon$1 tue$2 wed$3 thu$4 fri$5 Morning 8:15$9:009 registration

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error Were having trouble playing this title right now. Please try again later or select a different title. Functional Sharding RPC tuning Shard A

FISHERMANS BEND RECAST A MAJOR BREAK FROM CAR DEPENDENCY Victorian Transport Infrastructure

Nason LWP First Bend Nason LWP First Bend Nason LWP First Bend Reach geomorphology

Spot Mr Whoops Mistakes Activity 1 My favorite day of the week has always been a Sunday. Even

Can Personalized Learning and Project Based Learning Truly Coexist? River Bend MS

What is a suffix? Choose the correct answer. A race where two peoples legs are tied together.

Fort Bend I Bend ISD 2019 Texas Academic Performance Report &amp; Public Hearing February 10,

How To Not Burn Out or Whoops, Too Late, Now What? If I look at the mass, I will never act. If

Digital Learning Department Welcome to Fort Bend ISD! Digital Learning Department Provide

Bend your minds to holy learning that you may escape the fretting moth of littleness of mind that

I wonder if I could ask for your help? Im sometimes a little clumsy with my spelling. Could

Coastal Bend Bays &amp; Estuaries Program Protecting and Restoring the Bays &amp; Estuaries of

GLOBAL PARTNERSHIP INITIATIVE Purpose e for or Travel el to o China Fort Bend County and Fort

Spot Mr. Whoops Mistakes Activity 1 Last week, I entered a photography competision in my local

Bends Transportation Future Boyd Acres Neighborhood Association Monday, September 10, 2018

Fort Bend Seniors Meals on Wheels Presentation | United Way Committee United Way Service Center

The Silk Road OPSEC Fail Whoops! Disclaimer Any opinions presented in this talk by the

CS 730/730W/830: Intro AI Break HMMs 1 handout: slides final blog entries were due Wheeler

Identify the Break-Even Point 1 What does it mean to break-even? 2

MAS439 Lecture 5 Quotient Rings October 11th Announcements Homework 1: Whoops I didnt ask

Whoops! Where did my architecture go? Approaches to architecture management for Java and Spring

GLOBAL PARTNERSHIP INITIATIVE Purpose e for or T Travel t el to C o China a and T Tai

Gees Bend FERRY Battery Conversion Presented by: Tim Aguirre, HMS Ferries 09.25.2019

Fort Bend County Future Growth Implications Todd LaRue, Principal, RCLCO Fort Bend County

A tentative Summary mon$1 tue$2 wed$3 thu$4 fri$5 Morning 8:15$9:009 registration

Fort Bend I Bend ISD 2019 Texas Academic Performance Report & Public Hearing February 10,

Coastal Bend Bays & Estuaries Program Protecting and Restoring the Bays & Estuaries of