CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang
Roadmap • Clos topology • Datacenter TCP • Re-architecting datacenter networks and stacks for low latency and high performance – Best Paper award, SIGCOMM’17
Motivation for Clos topology • Clos topology aims to achieve the performance of a cross-bar switch • When the number of ports n is large, it is hard to build such a nxn switch
Clos topology • A multi-stage switching network • A path from any input port to any output port • Each switch has a small number of ports
Roadmap • Clos topology • Datacenter TCP • Re-architecting datacenter networks and stacks for low latency and high performance – Best Paper award, SIGCOMM’17
Datacenter Impairments • Incast • Queue Buildup • Buffer Pressure 6
Queue Buildup Sender 1 • Big flows buildup queues. Ø Increased latency for short flows. Receiver • Measurements in Bing cluster Sender 2 Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms 7
Data Center Transport Requirements 1. High Burst Tolerance – Incast due to Partition/Aggregate is common. 2. Low Latency – Short flows, queries 3. High Throughput – Continuous data updates, large file transfers The challenge is to achieve these three together. 8
Tension Between Requirements High Throughput Low Latency High Burst Tolerance Deep Buffers: Shallow Buffers: Ø Queuing Delays Ø Bad for Bursts & Objective: Increase Latency Throughput Low Queue Occupancy & High Throughput DCTCP AQM – RED: Reduced RTO min Ø Avg Queue Not Fast (SIGCOMM ‘09) Enough for Incast Ø Doesn’t Help Latency 9
The DCTCP Algorithm 10
Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2 11
Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. Cwnd Buffer Size B Throughput 100% 17
Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. Cwnd Buffer Size B Throughput 100% 17
Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. • Can’t rely on stat-mux benefit in the DC. – Measurements show typically 1-2 big flows at each server , at most 4. 17
Small Queues & TCP Throughput: The Buffer Sizing Story • Bandwidth-delay product rule of thumb: – A single flow needs buffers for 100% Throughput. • Appenzeller rule of thumb (SIGCOMM ‘04): – Large # of flows: is enough. • Can’t rely on stat-mux benefit in the DC. – Measurements show typically 1-2 big flows at each server , at most 4. Real Rule of Thumb: B Low Variance in Sending Rate → Small Buffers Suffice 17
Two Key Ideas 1. React in proportion to the extent of congestion, not its presence . ü Reduces variance in sending rates, lowering queuing requirements. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5% 2. Mark based on instantaneous queue length. ü Fast feedback to better deal with bursts. 18
Data Center TCP Algorithm B K Don’t Switch side: Mark Mark – Mark packets when Q ueue Length > K. Sender side: – Maintain running average of fraction of packets marked ( α ) . In each RTT: The picture can't be displayed. Ø Adaptive window decreases: – Note: decrease factor between 1 and 2. 19
Working with delayed acks ��������������� ��������������� ��������������� �������������� ������� �������� ������� �������� ���������� ���������� ������ ������ ��������������� �������������� Figure 10: Two state ACK generation state machine.
DCTCP in Action (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB 20
Why it Works 1.High Burst Tolerance ü Large buffer headroom → bursts fit. ü Aggressive marking → sources react before packets are dropped. 2. Low L atency ü Small buffer occupancies → low queuing delay. 3. High Throughput ü ECN averaging → smooth rate adjustments, low variance. 21
Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Window Size W*+1 W* (W*+1)(1-α/2) Time 22
Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Packets sent in this Window Size RTT are marked. W*+1 W* (W*+1)(1-α/2) Time 22
Analysis • Q(t) = NW(t) − C × RTT • The key observation is that with synchronized senders, the queue size exceeds the marking threshold K for exactly one RTT in each period of the saw-tooth, before the sources receive ECN marks and reduce their window sizes accordingly. • S(W 1 ,W 2 )=(W 22 −W 12 )/2 • Critical window size when ECN marking occurs: W ∗ =(C×RTT+K)/N
• α = S(W ∗ ,W ∗ + 1)/S((W ∗ + 1)(1 − α/2),W ∗ + 1) • α 2 (1 − α/4) = (2W ∗ + 1)/(W ∗ + 1)2 ≈ 2/W ∗ • α ≈ sqrt(2/W ∗ ) • Single flow oscillation – D = (W ∗ +1)−(W ∗ +1)(1−α/2) A = ND = N ( W ∗ + 1) α / 2 ≈ N √ 2 W ∗ 2 = 1 p 2 N ( C × RT T + K ) , (8) 2 T C = D = 1 p 2( C × RT T + K ) /N (in RTTs). (9) 2 Finally, using (3), we have: Q max = N ( W ∗ + 1) − C × RT T = K + N. (10)
Analysis • How low can DCTCP maintain queues without loss of throughput? • How do we set the DCTCP parameters? Ø Need to quantify queue size oscillations (Stability). Q min = Q max − A (11) = K + N − 1 p 2 N ( C × RTT + K ) . (12) 2 Minimizing Qmin 85% Less Buffer than TCP 22
Evaluation • Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory • Numerous micro-benchmarks – Throughput and Queue Length – Fairness and Convergence – Multi-hop – Incast – Static vs Dynamic Buffer Mgmt – Queue Buildup – Buffer Pressure • Cluster traffic benchmark 23
Cluster Traffic Benchmark • Emulate traffic within 1 Rack of Bing cluster – 45 1G servers, 10G server for external traffic • Generate query, and background traffic – Flow sizes and arrival times follow distributions seen in Bing • Metric: We use RTO min = 10ms for both TCP & DCTCP. – Flow completion time for queries and background flows. 24
Baseline Background Flows Query Flows 25
Baseline Background Flows Query Flows ü Low latency for short flows. 25
Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. 25
Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. ü High burst tolerance for query flows. 25
Scaled Background & Query 10x Background, 10x Query Query Short messages 26
Conclusions • DCTCP satisfies all our requirements for Data Center packet transport. ü Handles bursts well ü Keeps queuing delays low ü Achieves high throughput • Features: ü Very simple change to TCP and a single switch parameter. ü Based on mechanisms already available in Silicon. 27
Comments • Real world data • A novel idea • Comprehensive evaluation • Didn’t compare with the scheme of eliminating RTOmin and microsecond RTT measurement • Deadline-based scheduling research
Discussion • How does DCTCP differ from TCP? • Will DCTCP work well on the Internet? Why? • Is there a tradeoff between generality and performance?
Re-architecting datacenter networks and stacks for low latency and high performance Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik
Motivation • Low latency • High throughput
Design assumptions • Clos Topology • Designer can change end system protocol stacks as well as switches
• https://www.youtube.com/watch?v=OI3mh1V x8xI
Discussion • Will NDP work well on the Internet? Why? • Is there a tradeoff between generality and performance? • Will it work well on non-clos topology?
Summary • How to overcome the transport challenges in DC networks • DCTCP – Use the fraction of CE marked packets to estimate congestion – Smoothing sending rates • NDP – Start, spray, trim
Recommend
More recommend