Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. Mukerjee* † , Christopher Canel* , Weiyang Wang ○ , Daehyeok Kim* ‡ , Srinivasan Seshan*, Alex C. Snoeren ○ *Carnegie Mellon University, ○ UC San Diego, † Nefeli Networks, ‡ Microsoft Research February 26, 2020
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network 60GHz free-space optical wireless circuits optics
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network Available bandwidth ToR switch ToR switch Time Server 1 Server 1 … … … Server M Server M Rack 1 Rack N [Liu, NSDI ’14]
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network Available bandwidth ToR switch ToR switch Time Server 1 Server 1 … … … Server M Server M Rack 1 Rack N [Liu, NSDI ’14]
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network Available bandwidth ToR switch ToR switch Time Server 1 Server 1 … … … Server M Server M Rack 1 Rack N [Liu, NSDI ’14]
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network Available bandwidth ToR switch ToR switch Time Server 1 Server 1 … … … Server M Server M Rack 1 Rack N [Liu, NSDI ’14]
Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network ToR switch ToR switch Server 1 Server 1 RDCN is a black box: … … … Do not segregate flows between networks Server M Server M Rack 1 Rack N [Liu, NSDI ’14]
2010: RDCNs speed up DC workloads Hybrid networks achieve higher performance on datacenter workloads Packet network Hybrid network (c-Through) Full bisection bandwidth network [Wang, SIGCOMM ’10]
Today’s RDCNs recon fj gure 10x as often Advances in circuit switch technology have led to a 10x reduction in recon fj guration delay ⇒ today, circuits can recon fj gure much more frequently 2010 Today 180 μ s Available bandwidth Available bandwidth 10ms Time Time Better for datacenters: More fm exibility to support dynamic workloads Better for hosts: Less data must be available to saturate higher bandwidth NW [Porter, SIGCOMM ’13]
Short-lived circuits pose a problem for TCP 16 fm ows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s 100 No TCP variant makes use of Average circuit utilization (%) 75 the high-bandwidth circuits 50 55 55 54 53 53 53 52 52 51 51 50 49 49 49 46 42 25 26 0 r c g c p d p a s p v o e s o d h b i a i i n l n l n d c e c l o a b b o b b b g t t e e o e c e n u y a c h e p y r v w c i d h l l a v s l t i c h s s g e w i h 7C3 variant
TCP cannot ramp up during short circuits achieved bandwidth (BW) = slope what we expect 8x BW reality W B x 1 180 μ s no circuit circuit no circuit
What is the problem? All TCP variants are designed to adapt to changing network conditions • E.g., congestion, bottleneck links, RTT But bandwidth fm uctuations in modern RDCN are an order of magnitude more frequent (10x shorter circuit duration) and more substantial (10x higher bandwidth) than TCP is designed to handle • RDCNs break the implicit assumption of relatively-stable network conditions This requires an order-of-magnitude shift in how fast TCP reacts
This talk: Our 2-part solution In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively • High utilization, at the cost of tail latency At endhosts: New TCP variant, reTCP , that explicitly reacts to circuit state changes • Mitigates tail latency penalty The two techniques can be deployed separately, but work best together
This talk: Our 2-part solution In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively • High utilization, at the cost of tail latency At endhosts: New TCP variant, reTCP , that explicitly reacts to circuit state changes • Mitigates tail latency penalty The two techniques can be deployed separately, but work best together
Naïve idea: Enlarge switch bu ff ers Want we want: TCP’s congestion window ( cwnd ) to parallel the BW fm uctuations First attempt: Make cwnd large all the time How? Use large ToR bu ff ers Bandwidth cwnd Available bandwidth desired cwnd large, static bu ff ers Time Time
Naïve idea: Enlarge switch bu ff ers low BDP Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch high BDP
Naïve idea: Enlarge switch bu ff ers Larger ToR bu ff ers low BDP increase utilization of the high-BDP circuit network Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch high BDP Bandwidth
Naïve idea: Enlarge switch bu ff ers low BDP Latency Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch high BDP Bandwidth
Large queues increase utilization… 100 100 100 80 77 Average circuit utilization (%) 60 49 40 31 20 21 0 4 8 16 32 64 128 6tatic buffer size (Sackets) 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s
…but result in high latency Median latency 99th percentile latency latency 1000 1000 6tatic buffers (vary size) 0edian latency ( μ s) 6tatic buffers (vary size) 99th Sercentile latency ( μ s) 500 500 0 0 20 40 60 80 100 20 40 60 80 100 Average circuit utilization (%) Average circuit utilization (%) How can we improve this latency? 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s
Use large bu ff ers only when circuit is up Dynamic bu ff er resizing: Before a circuit begins, transparently enlarge ToR bu ff ers Full circuit utilization with a latency degradation only during ramp-up period Bandwidth cwnd Available bandwidth desired cwnd large, static bu ff ers dynamic bu ff ers Time Time resize!
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Circuit coming! Time Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch ToR bu ff er ToR bu ff er Sender Receiver Circuit Switch
Resize ToR bu ff ers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch
Con fj guring dynamic bu ff er resizing How long in advance should ToR bu ff ers resize ( 𝝊 )? • Long enough for TCP to grow cwnd to the circuit BDP How large should ToR bu ff ers grow to? • circuit BDP = 80 Gb/s ⨉ 40 µs = 45 9000-byte packets For our con fj guration, the ToR bu ff ers must hold ~40 packets to achieve 90% utilization, which requires 1800 µs of prebu ff ering We resize ToR bu ff ers between sizes of 16 and 50 packets
How long in advance to resize, 𝝊 ? no circuit circuit 180 μ s ToR buffer size (packets) 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets
How long in advance to resize, 𝝊 ? no circuit circuit 180 μ s achieved bandwidth (BW) = slope ToR buffer size (packets) 8x BW 1x BW 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets
How long in advance to resize, 𝝊 ? no circuit circuit 180 μ s achieved bandwidth (BW) = slope ToR buffer size (packets) 49% 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets
Recommend
More recommend