availability latency and cost withstanding regional
play

Availability, Latency and Cost: Withstanding Regional Outages - PowerPoint PPT Presentation

Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com What to expect Why? Overview! Algebraic Models Availability! Latency! Cost! Architecture! @aaronblohowiak Why? You


  1. Availability, Latency and Cost: Withstanding Regional Outages @aaronblohowiak aaronb@netflix.com

  2. What to expect ● Why? ● Overview! ● Algebraic Models ○ Availability! ○ Latency! ○ Cost! ● Architecture! @aaronblohowiak

  3. Why?

  4. You never let a serious crisis go to waste. And what I mean by that it's an opportunity to do things you think you could not do before. - Rahm Emanuel

  5. Good, not great. @aaronblohowiak

  6. Good, not great. 1. Instability @aaronblohowiak

  7. Good, not great. 1. Instability 2. Infrequency @aaronblohowiak

  8. Good, not great. 1. Instability 2. Infrequency 3. GOTO 1. @aaronblohowiak

  9. Source: https://martinfowler.com/bliki/FrequencyReducesDifficulty.html

  10. One of my favorite soundbites is: if it hurts, do it more often. - Martin Fowler

  11. Operational Burden 1. Alerts @aaronblohowiak

  12. Operational Burden 1. Alerts 2. Canaries @aaronblohowiak

  13. Operational Burden 1. Alerts 2. Canaries 3. WoW Metrics @aaronblohowiak

  14. From Burden to Advantage @aaronblohowiak

  15. In general, freedom and rapid recovery is better than trying to prevent error. We are in a creative business, not a safety-critical business. - jobs.netflix.com/culture

  16. Overview

  17. Problem Description Number of Regions @aaronblohowiak

  18. @aaronblohowiak

  19. @aaronblohowiak

  20. 100% Capacity @aaronblohowiak

  21. Problem Description Number of Regions @aaronblohowiak

  22. N+1 Architecture @aaronblohowiak

  23. 100% 1+0 (no spare) @aaronblohowiak

  24. 100% 100% 1+1 @aaronblohowiak

  25. 100% 100% 1+1 = 200% @aaronblohowiak

  26. 2+1 50% 50% 50% @aaronblohowiak

  27. 2+1 = 150% 50% 50% 50% @aaronblohowiak

  28. 2+1 = 150% ?!?!?!?!?! 50% 50% 50% @aaronblohowiak

  29. 2+1 Overview @aaronblohowiak

  30. @aaronblohowiak

  31. @aaronblohowiak

  32. @aaronblohowiak

  33. Excess Risk @aaronblohowiak

  34. @aaronblohowiak

  35. @aaronblohowiak

  36. @aaronblohowiak

  37. Algebraic Models

  38. All models are wrong but some are useful - George Box

  39. Availability

  40. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  41. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  42. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  43. @aaronblohowiak

  44. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  45. Distribution of Change Number of Regions Balance of Traffic @aaronblohowiak

  46. @aaronblohowiak

  47. Distribution of Change Number of Regions Balance of Traffic Empirical Risk @aaronblohowiak

  48. Latency

  49. Which Latency?

  50. Normal vs Failover

  51. Latency ??? Availability Cost @aaronblohowiak

  52. If you’re successful, hourly demand maps to population by longitude. - Blohowiak’s Third Law

  53. Measuring Latency @aaronblohowiak

  54. @aaronblohowiak

  55. @aaronblohowiak

  56. @aaronblohowiak

  57. Measuring Latency @aaronblohowiak

  58. Measuring Latency @aaronblohowiak

  59. Cost

  60. 2+1 50% 50% 50% @aaronblohowiak

  61. @aaronblohowiak

  62. In N+1 Architecture, minimal failover overhead is 1/N. @aaronblohowiak

  63. In N+1 Architecture, minimal failover overhead is 1/N. Cost = 100% + 1/N @aaronblohowiak

  64. In N+1 Architecture, minimal failover overhead is 1/N. Cost = 100% + 1/N If costs are pure throughput @aaronblohowiak

  65. 100%

  66. Throughput Portion 100% Database Portion

  67. 2+1 @aaronblohowiak

  68. 2+1 All data everywhere

  69. 2+1 All data everywhere >150%

  70. Data Base Portion Region Replication Factor @aaronblohowiak

  71. In RRF=All T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * (N + 1) Total = T + D @aaronblohowiak

  72. @aaronblohowiak

  73. In RRF=2 T is Throughput Cost T = (1 - DBP) * (1 + 1/N) D is DB Cost D = DBP * 2 Total = T + D @aaronblohowiak

  74. @aaronblohowiak

  75. @aaronblohowiak

  76. Cost Summary ● 50% throughput overhead plus tripled database cost for 3-region RRF=all. ● 25% throughput overhead plus doubled database cost for 5-region RRF=2, plus a lot of complexity. @aaronblohowiak

  77. Architecture

  78. Multi-Site Fault Isolation ● No cross-region Requests! ● Stateless or Async* Replication! ○ Cache Replication! ● Change One Region at a Time! @aaronblohowiak

  79. To shard or not to shard? That is the question. @aaronblohowiak

  80. To shard or not to shard? That is the question. Steering ● @aaronblohowiak

  81. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● @aaronblohowiak

  82. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● @aaronblohowiak

  83. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● Satellites ● @aaronblohowiak

  84. To shard or not to shard? That is the question. Steering ● Rebalancing & Rehoming ● Cost ● Satellites ● Graph vs Multi-tenant ● @aaronblohowiak

  85. How to RRF=2 with 1/N overhead? Central Savior ● Ring ● Custom Hashing ● @aaronblohowiak

  86. Central Savior @aaronblohowiak

  87. Central Savior @aaronblohowiak

  88. Ring Regions @aaronblohowiak

  89. Ring Regions @aaronblohowiak

  90. Ring Regions @aaronblohowiak

  91. One More Thing @aaronblohowiak

  92. What percentage of your outages come from regional failures? @aaronblohowiak

  93. Many of the availability benefits come from isolation, not regions. @aaronblohowiak

  94. What percentage of your outages come from database failures? @aaronblohowiak

  95. Maybe for you and your org having logical stacks makes the most sense. @aaronblohowiak

  96. Closing Thoughts @aaronblohowiak

  97. Questions? @aaronblohowiak

Recommend


More recommend