lecture 18 congestion control in data center networks
play

Lecture 18: Congestion Control in Data Center Networks 1 Overview - PowerPoint PPT Presentation

Lecture 18: Congestion Control in Data Center Networks 1 Overview Why is the problem different from that in the Internet? What are possible solutions? 2 DC Traffic Patterns In-cast applications Client send queries to servers


  1. Lecture 18: Congestion Control in Data Center Networks 1

  2. Overview • Why is the problem different from that in the Internet? • What are possible solutions? 2

  3. DC Traffic Patterns • In-cast applications – Client send queries to servers – Responses are synchronized • Few overlapping long flows – According to DCTCP’s measurement 3

  4. Data Center TCP (DCTCP) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan Microsoft Research Stanford University 4

  5. Data Center Packet Transport • Large purpose-built DCs – Huge investment: R&D, business • Transport inside the DC – TCP rules (99.9% of traffic) • How’s TCP doing? 5

  6. TCP in the Data Center • We’ll see TCP does not meet demands of apps. – Suffers from bursty packet drops, Incast [SIGCOMM ‘09], ... – Builds up large queues: Ø Adds significant latency. Ø Wastes precious buffers, esp. bad with shallow-buffered switches. • Operators work around TCP problems. ‒ Ad-hoc, inefficient, often expensive solutions ‒ No solid understanding of consequences, tradeoffs 6

  7. Roadmap • What’s really going on? – Interviews with developers and operators – Analysis of applications – Switches: shallow-buffered vs deep-buffered – Measurements • A systematic study of transport in Microsoft’s DCs – Identify impairments – Identify requirements • Our solution: Data Center TCP 7

  8. Case Study: Microsoft Bing • Measurements from 6000 server production cluster • Instrumentation passively collects logs ‒ Application-level ‒ Socket-level ‒ Selected packet-level • More than 150TB of compressed data over a month 8

  9. Partition/Aggregate Application Structure 1. TLA Picasso Art is… Deadline = 250ms 2. Art is a lie… ….. 3. Picasso • Time is money ……… MLA MLA 1. Art is a lie… Ø Strict deadlines (SLAs) 1. Deadline = 50ms 2. The chief… 2. ….. 3. • Missed deadline ….. 3. Ø Lower quality result “Everything you can imagine is real.” “The chief enemy of creativity is “It is your work in life that is the “I'd like to live as a poor man “Art is a lie that makes us “Computers are useless. “Inspiration does exist, “Bad artists copy. Deadline = 10ms They can only give you answers.” but it must find you working.” ultimate seduction.“ with lots of money.“ Good artists steal.” realize the truth. good sense.“ Worker Nodes 9

  10. Generality of Partition/Aggregate • The foundation for many large-scale web applications . – Web search, Social network composition, Ad selection, etc. • Example: Facebook Internet Partition/Aggregate ~ Multiget – Aggregators: Web Servers Web – Workers: Memcached Servers Servers Memcached Protocol Memcached Servers 10

  11. Workloads • Partition/Aggregate Delay-sensitive (Query) • Short messages [50KB-1MB] Delay-sensitive ( C oordination, Control state) • Large flows [1MB-50MB] Throughput-sensitive ( D ata update) 11

  12. Impairments • Incast • Queue Buildup • Buffer Pressure 12

  13. Incast Worker 1 • Synchronized mice collide. Ø Caused by Partition/Aggregate. Aggregator Worker 2 Worker 3 RTO min = 300 ms Worker 4 TCP timeout 13

  14. Incast Really Happens MLA Query Completion Time (ms) • Requests are jittered over 10ms window. 99.9 th percentile is being tracked. Jittering trades off median against high percentiles. • Jittering switched off around 8:30 am. 14

  15. InCast: Goodput collapses as senders increase 15

  16. InCast: Synchronized timeouts 16

  17. Queue Buildup Sender 1 • Big flows buildup queues. Ø Increased latency for short flows. Receiver Sender 2 • Measurements in Bing cluster Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms 17

  18. Data Center Transport Requirements 1. High Burst Tolerance – Incast due to Partition/Aggregate is common. 2. Low Latency – Short flows, queries 3. High Throughput – Continuous data updates, large file transfers The challenge is to achieve these three together. 18

  19. Tension Between Requirements High Throughput Low Latency High Burst Tolerance Deep Buffers: Shallow Buffers: Ø Queuing Delays Ø Bad for Bursts & Objective: Increase Latency Throughput Low Queue Occupancy & High Throughput DCTCP AQM – RED: Reduced RTO min Ø Avg Queue Not Fast (SIGCOMM ‘09) Enough for Incast Ø Doesn’t Help Latency 19

  20. The DCTCP Algorithm 20

  21. Review: The TCP/ECN Control Loop Sender 1 ECN = Explicit Congestion Notification ECN Mark (1 bit) Receiver Sender 2 21

  22. Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: • – A single flow needs buffers for 100% Throughput. Cwnd Buffer Size B Throughput 100% 17

  23. Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: • – A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): • – Large # of flows: is enough. Cwnd Buffer Size B Throughput 100% 17

  24. Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: • – A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): • – Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC. • – Measurements show typically 1-2 big flows at each server , at most 4. 17

  25. Small Queues & TCP Throughput: The Buffer Sizing Story Bandwidth-delay product rule of thumb: • – A single flow needs buffers for 100% Throughput. Appenzeller rule of thumb (SIGCOMM ‘04): • – Large # of flows: is enough. Can’t rely on stat-mux benefit in the DC. • – Measurements show typically 1-2 big flows at each server , at most 4. Real Rule of Thumb: B Low Variance in Sending Rate → Small Buffers Suffice 17

  26. Two Key Ideas 1. React in proportion to the extent of congestion, not its presence . ü Reduces variance in sending rates, lowering queuing requirements. ECN Marks TCP DCTCP 1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40% 0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5% 2. Mark based on instantaneous queue length. ü Fast feedback to better deal with bursts. 18

  27. Data Center TCP Algorithm B K Don’t Switch side: Mark Mark – Mark packets when Queue Length > K. Sender side: – Maintain running average of fraction of packets marked ( α ) . In each RTT: This image cannot currently be displayed. Ø Adaptive window decreases: – Note: decrease factor between 1 and 2. 19

  28. DCTCP in Action (Kbytes) Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB 20

  29. Why it Works 1. High Burst Tolerance ü Large buffer headroom → bursts fit. ü Aggressive marking → sources react before packets are dropped. 2. Low Latency ü Small buffer occupancies → low queuing delay. 3. High Throughput ü ECN averaging → smooth rate adjustments, low variance . 21

  30. Evaluation Implemented in Windows stack. • Real hardware, 1Gbps and 10Gbps experiments • – 90 server testbed – Broadcom Triumph 48 1G ports – 4MB shared memory – Cisco Cat4948 48 1G ports – 16MB shared memory – Broadcom Scorpion 24 10G ports – 4MB shared memory Numerous micro-benchmarks • – Throughput and Queue Length – Fairness and Convergence – Multi-hop – Incast – Static vs Dynamic Buffer Mgmt – Queue Buildup – Buffer Pressure Cluster traffic benchmark • 23

  31. Cluster Traffic Benchmark • Emulate traffic within 1 Rack of Bing cluster – 45 1G servers, 10G server for external traffic • Generate query, and background traffic – Flow sizes and arrival times follow distributions seen in Bing • Metric: – Flow completion time for queries and background flows. We use RTO min = 10ms for both TCP & DCTCP. 24

  32. Baseline Background Flows Query Flows 25

  33. Baseline Background Flows Query Flows ü Low latency for short flows. 25

  34. Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. 25

  35. Baseline Background Flows Query Flows ü Low latency for short flows. ü High throughput for long flows. ü High burst tolerance for query flows. 25

  36. Scaled Background & Query 10x Background, 10x Query 26

  37. Scalability 37

  38. Conclusions • DCTCP satisfies all our requirements for Data Center packet transport. ü Handles bursts well ü Keeps queuing delays low ü Achieves high throughput • Features: ü Very simple change to TCP and a single switch parameter. ü Based on mechanisms already available in Silicon. 27

  39. Discussion • What if traffic patterns change? – E.g., many overlapping flows • What do you like/dislike? 39

Recommend


More recommend