Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Network availability is the biggest challenge facing large content and cloud providers today 2
Why? The push towards higher 9s of availability At four 9s availability ❖ Outage budget is 4 mins per month At five 9s availability ❖ Outage budget is 24 seconds per month 3
How do providers achieve these levels? By learning from failures 4
Paper’s What has Google Learnt from Focus Failures? Why is high What are the What design network characteristics principles can availability a of network achieve high challenge? availability availability? failures? 5
Velocity of Evolution Scale Management Complexity Why is high network availability a challenge? 6
Evolution Network hardware evolves continuously Capacity Jupiter Watchtowe r Firehose 1.0 Saturn Firehose 1.1 4 Post Time 7
Evolution So does network software QUIC gRPC Jupiter Freedome Andromeda BwE B4 Watchtower Google Global 2014 Cache 2012 2010 2008 2006 8
Evolution New hardware and software can ❖ Introduce bugs ❖ Disrupt existing software Result: Failures! 9
Other ISPs B4 B2 Scale and Complexity Data centers 10
Scale and Complexity Design Differences B4 and Data Centers ❖ Use merchant silicon chips ❖ Centralized control planes B2 ❖ Vendor gear ❖ Decentralized control plane 11
Scale and Complexity Design Differences These differences increase management complexity and pose availability challenges 12
The Management Plane Manages network Management Plane Software evolution 13
Management Plane Temporarily Operations remove from service Connect a new data Upgrade B4 or data Drain or undrain links, center to B2 and B4 center control plane switches, routers, software services Many operations require multiple steps and can take hours or days 14
The Low-level abstractions for Management Plane management operations ❖ Command-line interfaces to high capacity routers A small mistake by operator can impact a large part of network 15
Duration, Severity, Prevalence Root-cause Categorization Why is high network availability a challenge? What are the characteristics of network availability failures? 16
Key Takeaway Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily 17
We analyzed over 100 Post- mortem reports written over a 2 year period 18
Blame-free process What is a Post-mortem? Carefully curated description of a previously unseen failure that had significant availability impact Helps learn from failures 19
What a Post- Mortem Description of failure, with detailed timeline Contains Root-cause(s) confirmed by reproducing the failure Discussion of fixes, follow up action items 20
Failure Examples Examples and Impact ❖ Entire control plane fails ❖ Upgrade causes backbone traffic shift ❖ Multiple top-of-rack switches fail Impact ❖ Data center goes offline ❖ WAN capacity falls below demand ❖ Several services fail concurrently 21
Key Quantitative 70% of failures occur when management plane operation is Results in progress Evolution impacts availability Failures are everywhere: all three networks and three planes see comparable failure rates No silver bullet 80% of failure durations between 10 and 100 minutes Need fast recovery 22
Root causes Lessons learned from root causes motivate availability design principles 23
Re-Think Management Plane Why is high network availability Avoid and Mitigate Large Failures a challenge? Evolve or Die What are the characteristics of network availability failures? What design principles can achieve high availability? 24
Re-think the Management Plane 25
Availability Principle Operator types wrong CLI command, runs wrong script Backbone router fails Minimize Operator Intervention 26
Availability Principle Necessary for upgrade-in-place To upgrade part of a large device… ❖ Line card, block of Clos fabric … proceed while rest of device carries traffic ❖ Enables higher availability 27
Availability Principle Risky! Ensure residual capacity > demand Early risk assessments were manual High packet loss Assess risk continuously 28
Re-think the Management “Intent” Plane I want to upgrade this router Management Plane Software Management Device Tests to Verify Operations Configurations Operation 29
Re-think the Management Device Tests to Verify Management Operations Configurations Operation Plane Management Plane Run-time Apply Configuration Perform management Automated operation Risk Verify operation Assessment Minimize Assess Operator Risk Intervention Continuously 30
Avoid and Mitigate Large Failures 31
Availability Principle B4 and data-centers have dedicated control- plane network ❖ Failure of this can bring down entire control plane Contain failure Fail open radius 32
Fail Open Centralized Control Plane Data center Traffic Exceedingly tricky! Preserve forwarding state of all switches ❖ Fail-open the entire data center 33
Availability Principle A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center Design fallback strategies 34
Design Fallback Strategies B4 A large section of the WAN fails, so demand exceeds capacity 35
Design Fallback Strategies B4 Can shift large B2 traffic volumes from many data centers Fallback to B2! 36
Design Fallback Strategies When centralized traffic engineering fails... ❖ … fallback to IP routing Big Red Buttons ❖ For every new software upgrade, design controls so operator can initiate fallback to “safe” version 37
Evolve or Die! 38
We cannot treat a change to the network as an exceptional event 39
Evolve or Die Make change the common case Make it easy and safe to evolve the network daily ❖ Forces management automation ❖ Permits small, verifiable changes 40
Conclusion Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily 41
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure Presentation template from SlidesCarnival
Older Slides 43
Popular root- cause categories Cabling error, interface card failure, cable cut…. 44
Popular root- cause categories Operator types wrong CLI command, runs wrong script 45
Popular root- cause categories Incorrect demand or capacity estimation for upgrade-in-place 46
Upgrade in place 47
Assessing Risk Correctly Residual Capacity? Demand? Varies by interconnect Can change dynamically 48
Popular root- cause categories Hardware or link layer failures in control plane network 49
Popular root- cause categories Two control plane components have inconsistent views of control plane state, caused by bug 50
Popular root- cause categories Running out of memory, CPU, OS resources (threads)... 51
Lessons from Failures The role of evolution The prevalence of Long failure durations in failures large, severe, failures ▸ Recover fast ▸ Rethink the ▸ Prevent and Management mitigate large Plane failures 52
High-level Management “Intent” Plane Abstractions I want to upgrade this router Why is this difficult? Modern high capacity routers: ❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 100s of thousands of lines 53
High-level Management “Intent” Plane Abstractions I want to upgrade this router Management Plane Software Management Device Tests to Verify Operations Configurations Operation 54
Management Management Device Tests to Verify Plane Operations Configurations Operation Automation Management Plane Software Apply Configuration Perform management operation Verify operation Minimize Assess Operator Risk Intervention Continuously 55
Large Control Centralized Control Plane Plane Failures 56
Contain the blast radius Centralized Centralized Control Plane Control Plane Smaller failure impact, but increased complexity 57
Fail-Open Centralized Control Plane Preserve forwarding state of all switches ❖ Fail-open the entire fabric 58
Defensive One piece of this large update Control-Plane seems wrong!! Design Topology TE Server Modeler BwE Gateway 59
Trust but Verify Let me check the correctness of the update... Topology TE Server Modeler BwE Gateway 60
Recommend
More recommend