Availability, Latency and Cost: Withstanding Regional Outages - PowerPoint PPT Presentation

Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com

What to expect ● Why? ● Overview! ● Algebraic Models ○ Availability! ○ Latency! ○ Cost! ● Architecture! @aaronblohowiak

You never let a serious crisis go to waste. And what I mean by that it's an opportunity to do things you think you could not do before. - Rahm Emanuel

Good, not great. @aaronblohowiak

Good, not great. 1. Instability @aaronblohowiak

Good, not great. 1. Instability 2. Infrequency @aaronblohowiak

Good, not great. 1. Instability 2. Infrequency 3. GOTO 1. @aaronblohowiak

Source: https://martinfowler.com/bliki/FrequencyReducesDifficulty.html

One of my favorite soundbites is: if it hurts, do it more often. - Martin Fowler

Operational Burden 1. Alerts @aaronblohowiak

Operational Burden 1. Alerts 2. Canaries @aaronblohowiak

Operational Burden 1. Alerts 2. Canaries 3. WoW Metrics @aaronblohowiak

From Burden to Advantage @aaronblohowiak

In general, freedom and rapid recovery is better than trying to prevent error. We are in a creative business, not a safety-critical business. - jobs.netflix.com/culture

Overview

Problem Description Number of Regions @aaronblohowiak

@aaronblohowiak

100% Capacity @aaronblohowiak

Problem Description Number of Regions @aaronblohowiak

N+1 Architecture @aaronblohowiak

100% 1+0 (no spare) @aaronblohowiak

100% 100% 1+1 @aaronblohowiak

100% 100% 1+1 = 200% @aaronblohowiak

2+1 50% 50% 50% @aaronblohowiak

2+1 = 150% 50% 50% 50% @aaronblohowiak

2+1 = 150% ?!?!?!?!?! 50% 50% 50% @aaronblohowiak

2+1 Overview @aaronblohowiak

@aaronblohowiak

Excess Risk @aaronblohowiak

@aaronblohowiak

Algebraic Models

All models are wrong but some are useful - George Box

Availability

Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

@aaronblohowiak

Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

@aaronblohowiak

Distribution of Change Number of Regions Balance of Traffic Empirical Risk @aaronblohowiak

Latency

Which Latency?

Normal vs Failover

Latency ??? Availability Cost @aaronblohowiak

If you’re successful, hourly demand maps to population by longitude. - Blohowiak’s Third Law

Measuring Latency @aaronblohowiak

@aaronblohowiak

Measuring Latency @aaronblohowiak

2+1 50% 50% 50% @aaronblohowiak

@aaronblohowiak

In N+1 Architecture, minimal failover overhead is 1/N. @aaronblohowiak

In N+1 Architecture, minimal failover overhead is 1/N. Cost = 100% + 1/N @aaronblohowiak

In N+1 Architecture, minimal failover overhead is 1/N. Cost = 100% + 1/N If costs are pure throughput @aaronblohowiak

Throughput Portion 100% Database Portion

2+1 @aaronblohowiak

2+1 All data everywhere

2+1 All data everywhere >150%

Data Base Portion Region Replication Factor @aaronblohowiak

In RRF=All T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * (N + 1) Total = T + D @aaronblohowiak

@aaronblohowiak

In RRF=2 T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * 2 Total = T + D @aaronblohowiak

@aaronblohowiak

Cost Summary ● 50% throughput overhead plus tripled database cost for 3-region RRF=all. ● 25% throughput overhead plus doubled database cost for 5-region RRF=2, plus a lot of complexity. @aaronblohowiak

Architecture

Multi-Site Fault Isolation ● No cross-region Requests! ● Stateless or Async* Replication! ○ Cache Replication! ● Change One Region at a Time! @aaronblohowiak

To shard or not to shard? That is the question. @aaronblohowiak

To shard or not to shard? That is the question. Steering ● @aaronblohowiak

To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● @aaronblohowiak

To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● @aaronblohowiak

To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● Satellites ● @aaronblohowiak

To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● Satellites ● Graph vs Multi-tenant ● @aaronblohowiak

How to RRF=2 with 1/N overhead? Central Savior ● Ring ● Custom Hashing ● @aaronblohowiak

Central Savior @aaronblohowiak

Ring Regions @aaronblohowiak

One More Thing @aaronblohowiak

What percentage of your outages come from regional failures? @aaronblohowiak

Many of the availability benefits come from isolation, not regions. @aaronblohowiak

What percentage of your outages come from database failures? @aaronblohowiak

Maybe for you and your org having logical stacks makes the most sense. @aaronblohowiak

Closing Thoughts @aaronblohowiak

Questions? @aaronblohowiak

Availability, Latency and Cost: Withstanding Regional Outages - PowerPoint PPT Presentation

Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com What to expect Why? Overview! Algebraic Models Availability! Latency! Cost! Architecture! @aaronblohowiak Why? You

Stronger Public Key Encryption Schemes Withstanding RAM Scraper Like Attacks Prof. C.Pandu Rangan

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Availability Knob Flexible User-Defined Availability in the Cloud Mohammad Shahrad and David

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

for High Availability Martin Thompson - @mjpt777 What Is High Availability ?

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

and Written Description Withstanding 112(a) Rejections and Attacks on Patent Validity and

Employee Severance Agreements and Section 409A Deferred Compensation: Withstanding Heightened IRS

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Chapter 4 Chapter 4 Marginal Costing and Cost-Volume-Profit Analysis Cost behaviour Cost

Chapter 4: Implementing High Availability and Redundancy in a Campus Network CCNP-RS SWITCH

Inte gr ate d Appr oac he s to Pove r ty Re duc tion at the Ne ighbor hood L e ve l A

State and Regional Data Business Plans presented to presented by FHWA Data Peer Exchange Anita

DEVELOPMENT Presented by: RAQUEL C. ANZURES, EnP Assistant Vice President Development Bank of

Financial Planning Considerations for Families with Special Needs Anthony B. Soldano

How to Participate Today Audio Modes Listen using Mic & Speakers Or, select

Food Security Issues as a Key Element in the Solution of Socio- Economic Problems of the

DISTRICT COMPREHENSIVE IMPROVEMENT PLAN(DCIP) JULY 23, 2018 MIDDLE STATES ACCREDITATION Pathway

DISTRICT COMPREHENSIVE IMPROVEMENT PLAN(DCIP) JULY 24, 2017 MIDDLE STATES ACCREDITATION Pathway

Availability, Latency and Cost: Withstanding Regional Outages - PowerPoint PPT Presentation

Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com What to expect Why? Overview! Algebraic Models Availability! Latency! Cost! Architecture! @aaronblohowiak Why? You

Stronger Public Key Encryption Schemes Withstanding RAM Scraper Like Attacks Prof. C.Pandu Rangan

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Availability Knob Flexible User-Defined Availability in the Cloud Mohammad Shahrad and David

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

for High Availability Martin Thompson - @mjpt777 What Is High Availability ?

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

and Written Description Withstanding 112(a) Rejections and Attacks on Patent Validity and

Employee Severance Agreements and Section 409A Deferred Compensation: Withstanding Heightened IRS

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Chapter 4 Chapter 4 Marginal Costing and Cost-Volume-Profit Analysis Cost behaviour Cost

Chapter 4: Implementing High Availability and Redundancy in a Campus Network CCNP-RS SWITCH

Inte gr ate d Appr oac he s to Pove r ty Re duc tion at the Ne ighbor hood L e ve l A

State and Regional Data Business Plans presented to presented by FHWA Data Peer Exchange Anita

DEVELOPMENT Presented by: RAQUEL C. ANZURES, EnP Assistant Vice President Development Bank of

Financial Planning Considerations for Families with Special Needs Anthony B. Soldano

How to Participate Today Audio Modes Listen using Mic &amp; Speakers Or, select

Food Security Issues as a Key Element in the Solution of Socio- Economic Problems of the

DISTRICT COMPREHENSIVE IMPROVEMENT PLAN(DCIP) JULY 23, 2018 MIDDLE STATES ACCREDITATION Pathway

DISTRICT COMPREHENSIVE IMPROVEMENT PLAN(DCIP) JULY 24, 2017 MIDDLE STATES ACCREDITATION Pathway

How to Participate Today Audio Modes Listen using Mic & Speakers Or, select