evolve or die
play

Evolve or Die High-Availability Design Principles Drawn from - PowerPoint PPT Presentation

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Googles Network Infrastructure Network availability is the


  1. Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

  2. Network availability is the biggest challenge facing large content and cloud providers today 2

  3. Why? The push towards higher 9s of availability At four 9s availability ❖ Outage budget is 4 mins per month At five 9s availability ❖ Outage budget is 24 seconds per month 3

  4. How do providers achieve these levels? By learning from failures 4

  5. Paper’s What has Google Learnt from Focus Failures? Why is high What are the What design network characteristics principles can availability a of network achieve high challenge? availability availability? failures? 5

  6. Velocity of Evolution Scale Management Complexity Why is high network availability a challenge? 6

  7. Evolution Network hardware evolves continuously Capacity Jupiter Watchtowe r Firehose 1.0 Saturn Firehose 1.1 4 Post Time 7

  8. Evolution So does network software QUIC gRPC Jupiter Freedome Andromeda BwE B4 Watchtower Google Global 2014 Cache 2012 2010 2008 2006 8

  9. Evolution New hardware and software can ❖ Introduce bugs ❖ Disrupt existing software Result: Failures! 9

  10. Other ISPs B4 B2 Scale and Complexity Data centers 10

  11. Scale and Complexity Design Differences B4 and Data Centers ❖ Use merchant silicon chips ❖ Centralized control planes B2 ❖ Vendor gear ❖ Decentralized control plane 11

  12. Scale and Complexity Design Differences These differences increase management complexity and pose availability challenges 12

  13. The Management Plane Manages network Management Plane Software evolution 13

  14. Management Plane Temporarily Operations remove from service Connect a new data Upgrade B4 or data Drain or undrain links, center to B2 and B4 center control plane switches, routers, software services Many operations require multiple steps and can take hours or days 14

  15. The Low-level abstractions for Management Plane management operations ❖ Command-line interfaces to high capacity routers A small mistake by operator can impact a large part of network 15

  16. Duration, Severity, Prevalence Root-cause Categorization Why is high network availability a challenge? What are the characteristics of network availability failures? 16

  17. Key Takeaway Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily 17

  18. We analyzed over 100 Post- mortem reports written over a 2 year period 18

  19. Blame-free process What is a Post-mortem? Carefully curated description of a previously unseen failure that had significant availability impact Helps learn from failures 19

  20. What a Post- Mortem Description of failure, with detailed timeline Contains Root-cause(s) confirmed by reproducing the failure Discussion of fixes, follow up action items 20

  21. Failure Examples Examples and Impact ❖ Entire control plane fails ❖ Upgrade causes backbone traffic shift ❖ Multiple top-of-rack switches fail Impact ❖ Data center goes offline ❖ WAN capacity falls below demand ❖ Several services fail concurrently 21

  22. Key Quantitative 70% of failures occur when management plane operation is Results in progress Evolution impacts availability Failures are everywhere: all three networks and three planes see comparable failure rates No silver bullet 80% of failure durations between 10 and 100 minutes Need fast recovery 22

  23. Root causes Lessons learned from root causes motivate availability design principles 23

  24. Re-Think Management Plane Why is high network availability Avoid and Mitigate Large Failures a challenge? Evolve or Die What are the characteristics of network availability failures? What design principles can achieve high availability? 24

  25. Re-think the Management Plane 25

  26. Availability Principle Operator types wrong CLI command, runs wrong script Backbone router fails Minimize Operator Intervention 26

  27. Availability Principle Necessary for upgrade-in-place To upgrade part of a large device… ❖ Line card, block of Clos fabric … proceed while rest of device carries traffic ❖ Enables higher availability 27

  28. Availability Principle Risky! Ensure residual capacity > demand Early risk assessments were manual High packet loss Assess risk continuously 28

  29. Re-think the Management “Intent” Plane I want to upgrade this router Management Plane Software Management Device Tests to Verify Operations Configurations Operation 29

  30. Re-think the Management Device Tests to Verify Management Operations Configurations Operation Plane Management Plane Run-time Apply Configuration Perform management Automated operation Risk Verify operation Assessment Minimize Assess Operator Risk Intervention Continuously 30

  31. Avoid and Mitigate Large Failures 31

  32. Availability Principle B4 and data-centers have dedicated control- plane network ❖ Failure of this can bring down entire control plane Contain failure Fail open radius 32

  33. Fail Open Centralized Control Plane Data center Traffic Exceedingly tricky! Preserve forwarding state of all switches ❖ Fail-open the entire data center 33

  34. Availability Principle A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center Design fallback strategies 34

  35. Design Fallback Strategies B4 A large section of the WAN fails, so demand exceeds capacity 35

  36. Design Fallback Strategies B4 Can shift large B2 traffic volumes from many data centers Fallback to B2! 36

  37. Design Fallback Strategies When centralized traffic engineering fails... ❖ … fallback to IP routing Big Red Buttons ❖ For every new software upgrade, design controls so operator can initiate fallback to “safe” version 37

  38. Evolve or Die! 38

  39. We cannot treat a change to the network as an exceptional event 39

  40. Evolve or Die Make change the common case Make it easy and safe to evolve the network daily ❖ Forces management automation ❖ Permits small, verifiable changes 40

  41. Conclusion Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily 41

  42. Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure Presentation template from SlidesCarnival

  43. Older Slides 43

  44. Popular root- cause categories Cabling error, interface card failure, cable cut…. 44

  45. Popular root- cause categories Operator types wrong CLI command, runs wrong script 45

  46. Popular root- cause categories Incorrect demand or capacity estimation for upgrade-in-place 46

  47. Upgrade in place 47

  48. Assessing Risk Correctly Residual Capacity? Demand? Varies by interconnect Can change dynamically 48

  49. Popular root- cause categories Hardware or link layer failures in control plane network 49

  50. Popular root- cause categories Two control plane components have inconsistent views of control plane state, caused by bug 50

  51. Popular root- cause categories Running out of memory, CPU, OS resources (threads)... 51

  52. Lessons from Failures The role of evolution The prevalence of Long failure durations in failures large, severe, failures ▸ Recover fast ▸ Rethink the ▸ Prevent and Management mitigate large Plane failures 52

  53. High-level Management “Intent” Plane Abstractions I want to upgrade this router Why is this difficult? Modern high capacity routers: ❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 100s of thousands of lines 53

  54. High-level Management “Intent” Plane Abstractions I want to upgrade this router Management Plane Software Management Device Tests to Verify Operations Configurations Operation 54

  55. Management Management Device Tests to Verify Plane Operations Configurations Operation Automation Management Plane Software Apply Configuration Perform management operation Verify operation Minimize Assess Operator Risk Intervention Continuously 55

  56. Large Control Centralized Control Plane Plane Failures 56

  57. Contain the blast radius Centralized Centralized Control Plane Control Plane Smaller failure impact, but increased complexity 57

  58. Fail-Open Centralized Control Plane Preserve forwarding state of all switches ❖ Fail-open the entire fabric 58

  59. Defensive One piece of this large update Control-Plane seems wrong!! Design Topology TE Server Modeler BwE Gateway 59

  60. Trust but Verify Let me check the correctness of the update... Topology TE Server Modeler BwE Gateway 60

Recommend


More recommend