compsci 514 computer networks lecture 14 datacenter
play

CompSci 514: Computer Networks Lecture 14 Datacenter Transport - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang Roadmap Clos topology Datacenter TCP Re-architecting datacenter networks and stacks for low latency and high performance Best Paper award,


  1. CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

  2. Roadmap • Clos topology • Datacenter TCP • Re-architecting datacenter networks and stacks for low latency and high performance – Best Paper award, SIGCOMM’17

  3. Motivation for Clos topology • Clos topology aims to achieve the performance of a cross-bar switch • When the number of ports n is large, it is hard to build such a nxn switch

  4. Clos topology • A multi-stage switching network • A path from any input port to any output port • Each switch has a small number of ports

  5. Roadmap • Clos topology • Datacenter TCP • Re-architecting datacenter networks and stacks for low latency and high performance – Best Paper award, SIGCOMM’17

  6. Datacenter Impairments • Incast • Queue Buildup • Buffer Pressure 6

  7. Queue Buildup Sender 1 • Big flows buildup queues. Ø Increased latency for short flows. Receiver • Measurements in Bing cluster Sender 2 Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms 7

  8. Data Center Transport Requirements 1. High Burst Tolerance – Incast due to Partition/Aggregate is common. 2. Low Latency – Short flows, queries 3. High Throughput – Continuous data updates, large file transfers The challenge is to achieve these three together. 8

  9. Tension Between Requirements High Throughput Low Latency High Burst Tolerance Deep Buffers: Shallow Buffers: Ø Queuing Delays Ø Bad for Bursts & Objective: Increase Latency Throughput Low Queue Occupancy & High Throughput DCTCP AQM – RED: Reduced RTO min Ø Avg Queue Not Fast (SIGCOMM ‘09) Enough for Incast Ø Doesn’t Help Latency 9

  10. The DCTCP Algorithm 10

  11. Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2 11

  12. Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. Cwnd Buffer Size B Throughput 100% 17

  13. Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. Cwnd Buffer Size B Throughput 100% 17

  14. Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. • Can’t rely on stat-mux benefit in the DC. – Measurements show typically 1-2 big flows at each server , at most 4. 17

  15. Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. • Can’t rely on stat-mux benefit in the DC. – Measurements show typically 1-2 big flows at each server , at most 4. Real Rule of Thumb: B Low Variance in Sending Rate → Small Buffers Suffice 17

  16. Two Key Ideas 1. React in proportion to the extent of congestion, not its presence . ü Reduces variance in sending rates, lowering queuing requirements. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5% 2. Mark based on instantaneous queue length. ü Fast feedback to better deal with bursts. 18

  17. Data Center TCP Algorithm B K Don’t Switch side: Mark Mark – Mark packets when Q ueue Length > K. Sender side: – Maintain running average of fraction of packets marked ( α ) . In each RTT: The picture can't be displayed. Ø Adaptive window decreases: – Note: decrease factor between 1 and 2. 19

  18. Working with delayed acks ��������������� ��������������� ��������������� �������������� ������� �������� ������� �������� ���������� ���������� ������ ������ ��������������� �������������� Figure 10: Two state ACK generation state machine.

  19. DCTCP in Action (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB 20

  20. Why it Works 1.High Burst Tolerance ü Large buffer headroom → bursts fit. ü Aggressive marking → sources react before packets are dropped. 2. Low L atency ü Small buffer occupancies → low queuing delay. 3. High Throughput ü ECN averaging → smooth rate adjustments, low variance. 21

  21. Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Window Size W*+1 W* (W*+1)(1-α/2) Time 22

  22. Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Packets sent in this Window Size RTT are marked. W*+1 W* (W*+1)(1-α/2) Time 22

  23. Analysis • Q(t) = NW(t) − C × RTT • The key observation is that with synchronized senders, the queue size exceeds the marking threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their window sizes accordingly. • S(W 1 ,W 2 )=(W 22 −W 12 )/2 • Critical window size when ECN marking occurs: W ∗ =(C×RTT+K)/N

  24. • α = S(W ∗ ,W ∗ + 1)/S((W ∗ + 1)(1 − α/2),W ∗ + 1) • α 2 (1 − α/4) = (2W ∗ + 1)/(W ∗ + 1)2 ≈ 2/W ∗ • α ≈ sqrt(2/W ∗ ) • Single flow oscillation – D = (W ∗ +1)−(W ∗ +1)(1−α/2) A = ND = N ( W ∗ + 1) α / 2 ≈ N √ 2 W ∗ 2 = 1 p 2 N ( C × RT T + K ) , (8) 2 T C = D = 1 p 2( C × RT T + K ) /N (in RTTs). (9) 2 Finally, using (3), we have: Q max = N ( W ∗ + 1) − C × RT T = K + N. (10)

  25. Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Q min = Q max − A (11) = K + N − 1 p 2 N ( C × RTT + K ) . (12) 2 Minimizing Qmin 85% Less Buffer than TCP 22

  26. Evaluation • Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory • Numerous micro-benchmarks – Throughput and Queue Length – Fairness and Convergence – Multi-hop – Incast – Static vs Dynamic Buffer Mgmt – Queue Buildup – Buffer Pressure • Cluster traffic benchmark 23

  27. Cluster Traffic Benchmark • Emulate traffic within 1 Rack of Bing cluster – 45 1G servers, 10G server for external traffic • Generate query, and background traffic – Flow sizes and arrival times follow distributions seen in Bing • Metric: We use RTO min = 10ms for both TCP & DCTCP. – Flow completion time for queries and background flows. 24

  28. Baseline Background Flows Query Flows 25

  29. Baseline Background Flows Query Flows ü Low latency for short flows. 25

  30. Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. 25

  31. Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. ü High burst tolerance for query flows. 25

  32. Scaled Background & Query 10x Background, 10x Query Query Short messages 26

  33. Conclusions • DCTCP satisfies all our requirements for Data Center packet transport. ü Handles bursts well ü Keeps queuing delays low ü Achieves high throughput • Features: ü Very simple change to TCP and a single switch parameter. ü Based on mechanisms already available in Silicon. 27

  34. Comments • Real world data • A novel idea • Comprehensive evaluation • Didn’t compare with the scheme of eliminating RTOmin and microsecond RTT measurement • Deadline-based scheduling research

  35. Discussion • How does DCTCP differ from TCP? • Will DCTCP work well on the Internet? Why? • Is there a tradeoff between generality and performance?

  36. Re-architecting datacenter networks and stacks for low latency and high performance Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik

  37. Motivation • Low latency • High throughput

  38. Design assumptions • Clos Topology • Designer can change end system protocol stacks as well as switches

  39. • https://www.youtube.com/watch?v=OI3mh1V x8xI

  40. Discussion • Will NDP work well on the Internet? Why? • Is there a tradeoff between generality and performance? • Will it work well on non-clos topology?

  41. Summary • How to overcome the transport challenges in DC networks • DCTCP – Use the fraction of CE marked packets to estimate congestion – Smoothing sending rates • NDP – Start, spray, trim

Recommend


More recommend