adapting tcp for recon fj gurable datacenter networks
play

Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. - PowerPoint PPT Presentation

Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. Mukerjee* , Christopher Canel* , Weiyang Wang , Daehyeok Kim* , Srinivasan Seshan*, Alex C. Snoeren *Carnegie Mellon University, UC San Diego, Nefeli Networks,


  1. Adapting TCP for Recon fj gurable Datacenter Networks Matthew K. Mukerjee* † , Christopher Canel* , Weiyang Wang ○ , Daehyeok Kim* ‡ , Srinivasan Seshan*, Alex C. Snoeren ○ *Carnegie Mellon University, ○ UC San Diego, † Nefeli Networks, ‡ Microsoft Research February 26, 2020

  2. Reconfigurable Datacenter Network (RDCN) all-to-all higher bandwidth, … connectivity between certain racks Packet Switch Packet Switch Circuit Switch Packet Network Available bandwidth ToR switch ToR switch Time Server 1 Server 1 RDCN is a black box: … … … 60GHz free-space optical Do not segregate flows between networks wireless circuits optics Server M Server M Rack 1 Rack N [Liu, NSDI ’14]

  3. 2010: RDCNs speed up DC workloads Hybrid networks achieve higher performance on datacenter workloads Packet network Hybrid network (c-Through) Full bisection bandwidth network [Wang, SIGCOMM ’10]

  4. Today’s RDCNs reconfigure 10x as often Advances in circuit switch technology have led to a 10x reduction in reconfiguration delay ⇒ today, circuits can reconfigure much more frequently 2010 Today 180 μ s Available bandwidth Available bandwidth 10ms Time Time Better for datacenters: More flexibility to support dynamic workloads Better for hosts: Less data must be available to saturate higher bandwidth NW [Porter, SIGCOMM ’13]

  5. Short-lived circuits pose a problem for TCP 16 fm ows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s 100 No TCP variant makes use of Average circuit utilization (%) 75 the high-bandwidth circuits 50 55 55 54 53 53 53 52 52 51 51 50 49 49 49 46 42 25 26 0 r c g c p d p a s p v o e s o d h b i a i i n l n l n d c e c l o a b b o b b b g t t e e o e c e n u y a c h e p y r v w c i d h l l a v s l t i c h s s g e w i h 7C3 variant

  6. TCP cannot ramp up during short circuits achieved bandwidth (BW) = slope what we expect 8x BW reality W B x 1 180 μ s no circuit circuit no circuit

  7. What is the problem? All TCP variants are designed to adapt to changing network conditions • E.g., congestion, bottleneck links, RTT But bandwidth fm uctuations in modern RDCN are an order of magnitude more frequent (10x shorter circuit duration) and more substantial (10x higher bandwidth) than TCP is designed to handle • RDCNs break the implicit assumption of relatively-stable network conditions This requires an order-of-magnitude shift in how fast TCP reacts

  8. This talk: Our 2-part solution In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively • High utilization, at the cost of tail latency At endhosts: New TCP variant, reTCP , that explicitly reacts to circuit state changes • Mitigates tail latency penalty The two techniques can be deployed separately, but work best together

  9. Naïve idea: Enlarge switch buffers Want we want: TCP’s congestion window ( cwnd ) to parallel the BW fluctuations First attempt: Make cwnd large all the time How? Use large ToR buffers Bandwidth cwnd Available bandwidth desired cwnd large, static bu ff ers Tim Time

  10. Naïve idea: Enlarge switch buffers low BDP Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch high BDP

  11. Naïve idea: Enlarge switch buffers Larger ToR buffers low BDP increase utilization of the high-BDP circuit network Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch high BDP Bandwidth

  12. Naïve idea: Enlarge switch buffers low BDP Latency Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch Circuit Switch high BDP Bandwidth

  13. Large queues increase utilization… 100 100 100 80 77 Average circuit utilization (%) 60 49 40 31 20 21 0 4 8 16 32 64 128 6tatic buffer size (Sackets) 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s

  14. …but result in high latency Median latency 99th percentile latency latency 1000 1000 6tatic buffers (vary size) 0edian latency ( μ s) 6tatic buffers (vary size) 99th Sercentile latency ( μ s) 500 500 0 0 20 40 60 80 100 20 40 60 80 100 Average circuit utilization (%) Average circuit utilization (%) How can we improve this latency? 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s

  15. Use large buffers only when circuit is up Dynamic bu ff er resizing: Before a circuit begins, transparently enlarge ToR buffers Full circuit utilization with a latency degradation only during ramp-up Bandwidth cwnd Available bandwidth period desired cwnd large, static bu ff ers dynamic bu ff ers Tim Time resize!

  16. Resize ToR buffers before circuit begins cwnd Circuit coming! Time Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch

  17. Resize ToR buffers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch

  18. Resize ToR buffers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er Receiver ToR bu ff er Circuit Switch Circuit Switch

  19. Resize ToR buffers before circuit begins cwnd Time Packet Switch Sender ToR bu ff er ToR bu ff er Receiver Circuit Switch Circuit Switch

  20. Configuring dynamic buffer resizing How long in advance should ToR bu ff ers resize ( 𝝊 )? • Long enough for TCP to grow cwnd to the circuit BDP How large should ToR bu ff ers grow to? • circuit BDP = 80 Gb/s ⨉ 40 µs = 45 9000-byte packets For our configuration, the ToR buffers must hold ~40 packets to achieve 90% utilization, which requires 1800 µs of prebuffering We resize ToR buffers between sizes of 16 and 50 packets

  21. How long in advance to resize, 𝝊 ? no circuit circuit 180 μ s achieved bandwidth (BW) = slope ToR buffer size (packets) 8x BW util./latency trade-o ff too early: extra queuing too late: low util. 98% 91% 65% 1x BW 49% 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

  22. 1800 μ s of prebuffering yields 91% util. 100 98 97 97 96 91 86 80 Average circuit utilizatiRn (%) 79 72 60 65 60 49 40 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 6 9 2 5 8 1 4 7 0 1 1 1 2 2 2 3 5esize time ( μ s) 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

  23. Latency degradation during ramp-up Median latency 99th percentile latency 1000 0ediDn lDtency ( μ s) 1000 6tDtic buffers (vDry size) 6tDtic buffers (vDry size) 99th Sercentile DynDmic buffers (vDry τ ) DynDmic buffers (vDry τ ) lDtency ( μ s) 2.3x increase 500 500 0 0 20 40 60 80 100 20 40 60 80 100 AverDge circuit utilizDtion (%) AverDge circuit utilizDtion (%) We cannot use large queues for so long. Can we get the same high utilization with shorter prebu ff ering? 16 flows from rack 1 to rack 2; packet network: 10 Gb/s; circuit network: 80 Gb/s; small buffers: 16 packets; large buffers: 50 packets

  24. This talk: Our 2-part solution In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively • High utilization, at the cost of tail latency At endhosts: New TCP variant, reTCP , that explicitly reacts to circuit state changes • Mitigates tail latency penalty The two techniques can be deployed separately, but work best together

  25. This talk: Our 2-part solution In-network: Use information about upcoming circuits to transparently “trick” TCP into ramping up more aggressively • High utilization, at the cost of tail latency At endhosts: New TCP variant, reTCP , that explicitly reacts to circuit state changes • Mitigates tail latency penalty The two techniques can be deployed separately, but work best together

  26. reTCP: Rapidly grow cwnd before a circuit 1) Communicate circuit state to sender TCP 2) Sender TCP reacts by multiplicatively increasing/decreasing cwnd Bandwidth cwnd Available bandwidth desired cwnd large, static bu ff ers dynamic bu ff ers dynamic bu ff ers + reTCP Time Time resize!

  27. reTCP: Explicit circuit state feedback Reuse existing ECN-Echo (ECE) bit Circuit coming! 0 ACK Packet Switch Sender ToR bu ff er ToR bu ff er Receiver reTCP Circuit Switch marks:

  28. reTCP: Explicit circuit state feedback Reuse existing ECN-Echo (ECE) bit Circuit coming! 1 ACK Packet Switch Sender ToR bu ff er Receiver ToR bu ff er reTCP 0 → 1 Circuit Switch increase! marks: 0

  29. reTCP: Explicit circuit state feedback Reuse existing ECN-Echo (ECE) bit ACK Packet Switch Sender ToR bu ff er Receiver 1 ToR bu ff er reTCP Circuit Switch Circuit Switch marks: 1

  30. reTCP: Explicit circuit state feedback Reuse existing ECN-Echo (ECE) bit 0 ACK Packet Switch Sender ToR bu ff er ToR bu ff er Receiver reTCP 1 → 0 Circuit Switch Circuit Switch decrease! marks: 1 1

  31. Single multiplicative increase/decrease On 0 → 1 transitions: On 1 → 0 transitions: cwnd = cwnd ⨉ 𝛽 cwnd = cwnd / 𝛽 𝛽 depends on ratio of circuit BDP to ToR queue capacity: • Circuit network BDP: 45 packets • Small ToR queue capacity: 16 packets We use 𝛽 = 2 More advanced forms of feedback are possible

Recommend


More recommend