B4 and After : Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN
(“Chi”) Chi-yao Hong , Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, Amin Vahdat On behalf of many others in: Google Network Infrastructure and Network SREs
99.99% availability 99.9% availability 99% availability First-generation toward Stargate B4 network J-POP highly available, massive-scale Saturn network 2018 y p o 2017 c 2016 2015 r k o w e t n 2014 2013 2012 2011 >100x more traffic 3
Previous B4 paper published in SIGCOMM 2013 4
Background: B4 with SDN Traffic Engineering (TE) Deployed in 2012 Site-level tunnels (tunnels & tunnel splits) Central Per-Site 12-site Topology TE Domain TE Controller Controllers Demand Matrix (via Google BwE) 5
Background: B4 with SDN Traffic Engineering (TE) Deployed in 2012 Key Takeaways: High efficiency : Lower per-byte cost compared ❏ with B2 (Google global backbone running RSVP TE on vendor gears) Deterministic convergence : Fast, global TE ❏ optimization and failure handling Rapid software iteration: ~1 month for developing ❏ and deploying a median-size software features 6
But, it also comes with new challenges 7
Grand Challenge #1: High Availability Requirements B4 initially Service Availability Application Examples Class SLO had 99% Search ads, DNS, WWW 99.99% SC4 availability in 2013 Proto service backend, Email 99.95% SC3 Ads database replication 99.9% SC2 Search index copies, logs 99% SC1 Bulk transfer N/A SC0 8
Very demanding goal, given: inherent unreliability of long-haul links ● necessary management operations ● Service Availability Application Examples Class SLO B4 initially Search ads, DNS, WWW 99.99% SC4 had 99% Proto service backend, Email 99.95% availability SC3 Ads database replication 99.9% SC2 Search index copies, logs 99% SC1 Bulk transfer N/A SC0 9
Grand Challenge #2: Scale Requirements our bandwidth requirement doubled every ~9 months 10
traffic increased by >100x in 5 years 11
Grand Challenge #2: Scale Requirements Scale increased across dimensions: our bandwidth #Cluster prefixes: 8x ● requirement doubled #B4 sites: 3x ● every ~9 months #Control domains: 16x ● #Tunnels: 60x ● 12
Other challenges: No disruption to existing traffic, maintain high cost efficiency and high feature velocity 13
To meet these demanding requirements, we’ve had to aggressively develop many point solutions 14
1. Flat topology scales poorly and hurts availability 2. Solving capacity asymmetry Lessons problem in hierarchical topology is key to achieve high availability at Learned scale 3. Scalable switch forwarding rule management is essential to hierarchical TE 15
5.12 / 6.4 Tbps To WAN (other B4 sites) Saturn CF CF CF CF First-generation B4 site fabric BF BF BF BF Site 5.12 Tbps To Clusters Site Site Site B4 WAN 16
5.12 / 6.4 Tbps To WAN (other B4 sites) Saturn CF CF CF CF First-generation B4 site fabric BF BF BF BF Site 5.12 Tbps To Clusters Site Scaling option #1 : Site Site Add more chassis--Up to 8 chassis per Saturn fabric B4 WAN 17
Scaling option #2 : Build multiple B4 sites in close proximity Slower central TE controller Site Site Limited switch table limit Site Site Site Site Complicated capacity planning and job allocation 18
Jumpgate: Two-layer Topology Jumpgate Site 80 Tbps toward WAN / clusters / sidelinks Supernode spine switches x16 edge switches x32 19
Jumpgate: Two-layer Topology Jumpgate Site Support horizontal scaling by adding more supernodes to a site 80 Tbps toward WAN / clusters / sidelinks Support vertical scaling by upgrading a supernode in place to new generation Supernode spine switches x16 Improve availability with granular, per-supernode control domain edge switches x32 20
1. Flat topology scales poorly and hurts availability 2. Solving capacity asymmetry Lessons problem in hierarchical topology is key to achieve high Learned availability at scale 3. Scalable switch forwarding rule management is essential to hierarchical TE 21
Site A Site B Site C 4 4 1 1 4 4 4 4 4 4 sum of supernode-level link capacity 16 16 Site A Site B Site C 16 22
Site A Site B Site C 2 2 1 1 2 2 2 2 2 2 Abstract loss 43% = (14-8) / 14 Bottleneck! 14? 16 Site A Site B Site C 8 8 23
100% capacity loss in 18% cases Cumulative function of site-level links and 2% capacity loss topology events at median case due to striping inefficiency Site-level link capacity loss due to topology abstraction / total capacity [log 10 scale] 24
Solution = Sidelinks + Supernode-level TE 25
Site A Site B Site C 3.5 3.5 1 1 3.5 3.5 3.5 3.5 3.5 3.5 ● 57% toward next site ● 43% toward self site 26
Solution = Sidelinks + Supernode-level TE Multi-layer TE (Site-level & supernode-level) turns out to be challenging! 27
Design Proposals Hierarchical Tunneling Supernode-level TE Site-level tunnels + Supernode-level tunnels Supernode-level sub-tunnels Two layers of IP Scaling challenges: encapsulation lead to Increase path allocation inefficient hashing run time by 188x longer 28
Site A Site B (4 supernodes) (2 supernodes) x x 4x x x Assume balanced ingress traffic Tunnel Split Group (TSG) Supernode-level traffic splits; Maximize admissible No packet encapsulation; demand subject to fairness Calculated per site-level link and link capacity constraint 29
Site A Site B (4 supernodes) (2 supernodes) Greedy Exhaustive Waterfill Algorithm Iteratively allocate each flow on their direct path (w/o sidelinks) or alternatively on their indirect paths (w/ sidelinks on source site) until any flow cannot be allocated further Provably Take less than 1 Low abstraction forwarding loop second to run capacity loss free 30
< 2% 100% loss loss Cumulative function of site-level links and topology events Site-level link capacity loss due to topology abstraction / total capacity [log 10 scale] 31
TSG Sequencing Problem A1 B1 A1 B1 A2 B2 A2 B2 Current TSGs Target TSGs Bad properties Forwarding Loop Blackhole during update: 32
Dependency Graph based TSG Update 1. Map target TSGs to a supernode dependency graph 2. Apply TSG update in reverse topological ordering* * Share ideas with work in IGP updates: Francois & Bonaventure, Avoiding Transient Loops during IGP ● convergence in IP Networks, INFOCOM’05 Vanbever et al., Seamless Network-wide IGP Migrations, ● SIGCOMM’11 Loop-free and no Requires no One or two steps in extra blackhole packet tagging >99.7% of TSG ops 33
1. Flat topology scales poorly and hurts availability 2. Solving capacity asymmetry Lessons problem in hierarchical topology is key to achieve high availability at Learned scale 3. Scalable switch forwarding rule management is essential to hierarchical TE 34
Multi-stage Hashing across Switches in Clos Supernode 1. Ingress traffic at edge switches: Supernode a. Site-level tunnel split x16 b. TSG site-level split (to self-site or next-site) 2. At spine switches: x32 a. TSG supernode-level split b. Egress edge switch split 3. Egress traffic at edge switches: a. Egress port/trunk split Enable hierarchical TE at scale: Overall throughput improved by >6% B4 Site 35
99% availability 99.9% availability 99.99% availability Jumpgate: Two-layer topology toward Flat topology J-POP Stargate highly available, massive-scale Saturn network c i f f a r t e r o m x 0 0 1 > 2018 copy 2017 2016 2015 network 2014 2013 2012 2011 Efficient switch rule management TSG: & more service SDN TE tunneling Hierarchical TE classes Two service classes 36
Conclusions Highly available WAN with plentiful bandwidth offers ❏ unique benefits to many cloud services (e.g., Spanner) Future Work--Limit the blast radius of rare yet ❏ catastrophic failures Reduce dependencies across components ❏ Network operation via per-QoS canary ❏ 37
B4 and After : Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN Before After Copy network with 99% availability High-available network with 99.99% availability Inter-DC WAN with moderate number of sites 100x more traffic, 60x more tunnels Saturn: flat site topology & Jumpgate: hierarchical topology & per-site domain TE controller granular TE control domain Site-level tunneling in conjunction with Site-level tunneling supernode-level TE (“Tunnel Split Group”) Multi-stage hashing across switches in Clos Tunnel splits implemented at ingress switches supernode
Switch Pipeline ACL ECMP Encap Supernode (Flow Match) (Port Hashing) (+Tunnel IP) x16 x32 B4 Site 39
Recommend
More recommend