Faild Controller Controller Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table hosts send health status to controller ‣ on drain, update ARP entry ‣ balance virtual next hops across available servers
isn’t this just consistent hashing?
isn’t this just consistent hashing? yes, but we can extend mechanism and avoid resets entirely
Faild Controller Controller Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c b 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c b 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Faild Controller Controller current host Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c b 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c b 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Faild Controller Controller current host Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c c b 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c a b 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Faild Controller Controller current host Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c c 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Faild Controller Controller Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c c 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Faild Controller Controller previous host Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c c b 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c a b 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Faild Controller Controller Destination prefix Next hop IP IP address MAC address Destination prefix Next hop IP IP address MAC address 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:a:a a a 192.168.0.0/24 10.0.1.A 10.0.1.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.2.A 10.0.2.A xx:xx:xx:xx:a:a a a 192.168.0.0/24 10.0.1.B 10.0.2.A xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:b:c c b 192.168.0.0/24 10.0.1.B 10.0.1.B xx:xx:xx:xx:xx:c a b 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:b:a 192.168.0.0/24 10.0.2.B 10.0.2.B xx:xx:xx:xx:xx:a 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:c:c c c 192.168.0.0/24 10.0.1.C 10.0.1.C xx:xx:xx:xx:xx:c A B C 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:c:c c c 192.168.0.0/24 10.0.2.C 10.0.2.C xx:xx:xx:xx:xx:c FIB ARP table FIB ARP table embed mapping history in MAC address ‣ append previous target as part of MAC address ‣ still results in resets, but… ‣ …conveys necessary information down to the host
Host processing C Destination MAC address Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing C Destination MAC address c != b Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing C Destination MAC address Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing C Destination MAC address Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing C Destination MAC address Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing C Destination MAC address Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing C Destination MAC address Match xx:xx:xx:xx:c:b previous? Current target Previous target SYN packet? Process A B C Destina Local socket? xx:xx:xx:xx: Redirect
Host processing B C ess Match Match previous? previous? SYN packet? Process SYN packet? Process A B C Destination MAC address Local socket? Local socket? xx:xx:xx:xx:b:b Redirect Redirect
Host processing B C ess Match Match previous? previous? b == b SYN packet? Process SYN packet? Process A B C Destination MAC address Local socket? Local socket? xx:xx:xx:xx:b:b C Redirect Redirect
Host processing B C ess Match Match previous? previous? b == b SYN packet? Process SYN packet? Process A B C Destination MAC address Local socket? Local socket? xx:xx:xx:xx:b:b Redirect Redirect
Host processing median difference: 14µs 1 . 0 Low latency Cumulative probability 0 . 8 ‣ expected case: switches do all heavy lifting ‣ worst case: detour routing costs 14 μ s 0 . 6 0 . 4 Negligible impact on CPU utilization Steady state 0 . 2 ‣ impact only when refilling Draining ‣ peak CPU utilization below 0.3% 0 . 0 40 60 80 100 120 140 160 180 Round Trip Time [ µ s]
Host processing Low latency Steady state ‣ expected case: switches do all heavy lifting Estimated PDF ‣ worst case: detour routing costs 20 μ s Drain Negligible impact on CPU utilization Refill ‣ impact only when refilling (transient) ‣ peak CPU utilization below 0.3% 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 CPU utilization [%]
Timeline 2012 2014 2016 2018
Timeline deployed globally 2012 2014 2016 2018
Timeline 3x 10 14 deployed globally requests per day 2012 2014 2016 2018
we suspect it works
Assumption #1 hash buckets are equally loaded
Hashing 4 k Requests per second 3 k 2 k 0 5 10 15 20 25 30 Time [min] Implications for capacity planning ‣ you are bound by most loaded host in a cluster
Hashing 4 k Requests per second 3 k 2 k 0 5 10 15 20 25 30 Time [min] Implications for capacity planning ‣ you are bound by most loaded host in a cluster
Uneven hashing 1 . 8 Inject synthetic, equally distributed traffic 1 . 6 Normalized bucket load 1 . 4 1 . 2 1 . 0 0 . 8 0 . 6 0 . 4 0 . 2 0 50 100 150 200 250 Rank of nexthop
Uneven hashing 1 . 8 1 . 8 Inject synthetic, equally distributed traffic 1 . 6 1 . 6 Normalized bucket load Normalized bucket load 1 . 4 1 . 4 1 . 2 1 . 2 1 . 0 1 . 0 0 . 8 0 . 8 0 . 6 0 . 6 0 . 4 0 . 4 0 . 2 0 . 2 0 0 50 50 100 100 150 150 200 200 250 250 Rank of nexthop Rank of nexthop
Uneven hashing 1 . 8 1 . 8 Inject synthetic, equally distributed traffic 1 . 6 1 . 6 Normalized bucket load Normalized bucket load 1 . 4 1 . 4 Significant skew 1 . 2 1 . 2 ‣ most loaded bucket 6 times more loaded 1 . 0 1 . 0 than the least loaded 0 . 8 0 . 8 0 . 6 0 . 6 0 . 4 0 . 4 0 . 2 0 . 2 0 0 50 50 100 100 150 150 200 200 250 250 Rank of nexthop Rank of nexthop
Uneven hashing 1 . 8 1 . 8 Inject synthetic, equally distributed traffic 1 . 6 1 . 6 Normalized bucket load Normalized bucket load 1 . 4 1 . 4 Significant skew 1 . 2 1 . 2 ‣ most loaded bucket 6 times more loaded 1 . 0 1 . 0 than the least loaded 0 . 8 0 . 8 Behaviour can depend on number of nexthops 0 . 6 0 . 6 ‣ some buckets received no traffic for specific 0 . 4 0 . 4 number of configured nexthops 0 . 2 0 . 2 0 0 50 50 100 100 150 150 200 200 250 250 Rank of nexthop Rank of nexthop
Assumption #2 switches hash identically
Hash polarization
Hash polarization
Hash polarization
Hash polarization
Hash polarization Vendors were told hash polarization was bad ‣ in many cases you can’t configure seed ‣ in one case, you can configure the seed, but vendor additionally uses boot order of linecards to add entropy
Assumption #3 packets in a flow use same network path
Nope, things break Fragmentation ‣ returning ICMP packets hash on outer header ‣ took draft to IETF in 2014 ECN ‣ some middleboxes hash on TOS field ‣ ended up turning ECN negotation off, breaks anycast too ‣ still looking for vendor(s) behind this, affected multiple ISPs SYN proxies ‣ recent trend in enterprise appliances ‣ route lookup after connection handoff results in new path ‣ one vendor fixed implementation
paper has lots more stuff ‣ SYN cookie handling ‣ ARP reconfiguration measurements ‣ evaluation of switch and host draining ‣ switch controller details ‣ host-side implementation quirks ‣ ECMP skew results ‣ switch memory ‣ real flow measurements ‣ vendors that don’t test their products ‣ …
NSDI the value is not in the implementation
NSDI the value is in the design
Recommend
More recommend