on paravirualizing tcp congestion control in xen virtual
play

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines - PowerPoint PPT Presentation

On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013


  1. On ParaVirualizing TCP: Congestion Control in Xen Virtual Machines Luwei Cheng, Cho-Li Wang, Francis C.M. Lau Department of Computer Science The University of Hong Kong Xen Project Developer Summit 2013 Edinburgh, UK, October 24-25, 2013

  2. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments

  3. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments

  4. Physical datacenter Virtualized datacenter Core switch Core switch ToR ToR . . . . . . switches switches … … … … … … Servers in a rack Servers in a rack VM VM VM VM VM VM  A set of physical machines  A set of virtual machines  Network delays:  Network delays: propagation delays of the additional delays due to physical network/switch virtualization overhead

  5. Virtualization brings “delays” VM VM VM VM delay VM VM Hypervisor pCPU pCPU  1. I/O virtualization overhead (PV or HVM) – Guest VMs are unable to directly access the hardware. – Additional data movement between dom0 and domUs. – HVM: Passthrough I/O can avoid it  2. VM scheduling delays – Multiple VMs share one physical core

  6. Virtualization brings “delays” Avg: 0.374ms Avg: 0.147ms [PM  PM] [1VM  1VM]  Delays of I/O virtualization (PV guests): < 1ms Peak: 60ms Peak: 30ms [1VM  2VMs] [1VM  3VMs]  VM scheduling delays: 10× ms – Queuing delays  VM scheduling delays  The dominant factor to network RTT

  7. Network delays in public clouds [INFOCOM’10] [HPDC’10]

  8. Incast network congestion • A special form of network congestion, typically seen in distributed processing applications (scatter-gather). – Barrier-synchronized request workloads – The limited buffer space of the switch output port can be easily overfilled by simultaneous transmissions. • Application-level throughput (goodput) can be orders of magnitude lower than the link capacity. [SIGCOMM’09]

  9. Solutions for physical clusters  Prior works: none of them can fully eliminate the throughput collapse. – Increase switch buffer size – Limited transmit – Reduce duplicate ACK threshold – Disable slow-start – Randomize timeout value – Reno, NewReno, SACK  The dominate factor : once the packet loss happens, whether the sender can know it as soon as possible. – In case of “tail loss”, the sender can only count on the retransmit timer’s firing. Two representative papers:  Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems [FAST’08].  Understanding TCP Incast Throughput Collapse in Datacenter Networks [WREN’09].

  10. Solutions for physical clusters (cont’d)  Significantly reducing RTO min has been shown to be a safe and effective approach. [SIGCOMM’09]  Even with ECN support in hardware switch, a small RTO min still shows apparent advantages. [DCTCP, SIGCOMM’10] [SIGCOMM’09] [DCTCP, SIGCOMM’10] RTO min in a virtual cluster? Not well studied.

  11. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments

  12. Pseudo-congestion NO network congestion, still RTT spikes. Red points: measured RTTs Blue points: 30ms VM calculated RTO values 30ms VM RTO min =200ms RTO min =100ms 30ms VM pCPU TCP’s low-pass filter 3VMs per core RTO = SRTT + 4* RTTVAR Lower-bound: RTO min R etransmit T ime O ut RTO min =10ms RTO min =1ms A small RTO min  frequent spurious RTOs

  13. Pseudo-congestion (cont’d) A small RTO min : A big RTO min : serious spurious throughput collapse RTOs with largely with heavy network varied RTTs. congestion.  “adjusting RTO min : a tradeoff between timely response with premature timeouts, and there is NO optimal balance between the two.” -- Allman and Paxson [SIGCOMM’99] Virtualized datacenters  A new instantiation

  14. Sender-side vs. Receiver-side The scheduling delays to the sender VM The scheduling delays to the receiver VM To transmit 4000 1MB data blocks 3VMs  1VM 1VM  3VMs Freq. 1086 1× RTOs 677 0 2× RTOs 673 0 3× RTOs 196 0 4× RTOs 30 RTO only happens once a time Successive RTOs are normal

  15. A micro-view with tcpdump snd.una : the first sent but unacknowledged byte. snd.nxt : the next byte that will be sent. x10 6 9.1 snd.nxt 9 snd.una 8.9 RTO happens twice, The sender The receiver before the receiver An ACK arrives 8.8 VM has been VM has been VM wakes up. before the sender stopped. stopped. VM wakes up. 8.7 snd.nxt 8.6 snd.una 8.5 RTO happens just after the sender VM wakes up. 8.4 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 Time (ms) vs. sequence number (from the sender VM) Time (ms) vs. ACK number (from the receiver VM) When the sender VM is preempted When the receiver VM is preempted  The generation and the  The ACK’s arrival time is not delayed, but the receiving return of the ACKs will be time is too late . delayed.  RTOs must happen on the  From TCP’s perspective, RTO should not be triggered. sender’s side.

  16. The sender-side problem: OS reasons TCP receiver ACK ACK ACK Physical network deliver VM scheduling latency wait .. Driver ACK domain Network IRQ : Buffer data data data receive ACK; 2 Within Spurious RTO ! hypervisor Expire time clear clear Timer IRQ : timer timer 1 TCP RTO happens! Timer Timer Timer sender VM1 is running VM2 is running VM3 is running VM1 is running Scheduling VM2 is waiting VM3 is waiting VM1 is waiting VM2 is waiting queue VM3 is waiting VM1 is waiting VM2 is waiting VM3 is waiting  After the VM wakes up, both TIMER and NET are pending.  RTO happens just before the ACK enters the VM  The reasons due to common OS design – Timer interrupt is executed before other interrupts – Network processing is a little late (bottom half)

  17. To detect spurious RTOs  Two well-known detection algorithms: F-RTO and Eifel – Eifel performs much worse than F-RTO in some situations, e.g. with bursty packet loss [CCR’03] – F-RTO is implemented in Linux [3VMs  1VM] [1VM  3VMs] c Low Low detection detection rate rate  F-RTO interacts badly with delayed ACK (ACK coalescing) – Reducing delayed ACK timeout value does NOT help. Disabling delayed ACK seems to be helpful

  18. Delayed ACK vs. CPU overhead Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK  Significant CPU overhead

  19. Delayed ACK vs. CPU overhead delack-200ms delack-1ms w/o delack delack-200ms delack-1ms w/o delack Total ACKs 229,650 244,757 2,832,260 Total ACKs 252,278 262,274 2,832,179 Disabling delayed ACK: 11~13× more ACKs are sent Sender VM Receiver VM Sender VM Receiver VM Disabling delayed ACK  Significant CPU overhead

  20. Outline  Motivation – Physical datacenter vs. Virtualized datacenter – Incast congestion  Understand the Problem – Pseudo-congestion – Sender-side vs. Receiver-side  PVTCP – A ParaVirtualized TCP – Design, Implementation, Evaluation  Questions & Comments

  21. PVTCP – A ParaVirtualized TCP  Observation – Spurious RTOs only happen when the sender/receiver VM just experienced a scheduling delays.  Main Idea – If we can detect such moment , and let the guest OS be aware of this, there is a chance to handle the problem. “ the more information about current network conditions available to a transport protocol, the more efficiently it can use the network to transfer its data .” -- Allman and Paxson [SIGCOMM’99]

  22. Detect the VM’s wakeup moment VM is VM is NOT VM is running running running 30ms VM jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ jiffies++ Guest OS jiffies += 60 (HZ=1000) 30ms VM Time 30ms VM . . . Virtual timer IRQs Virtual timer IRQs pCPU (every 1ms) (every 1ms) 3VMs per core one-shot timer Hypervisor Acute increase of the system clock ( jiffies )  The VM just wakes up

  23. PVTCP – the sender VM is preempted  Spurious RTOs can be avoided. No need to detect them at all! TCP receiver ACK ACK ACK Physical network VM scheduling latency deliver wait .. Driver ACK domain Network IRQ : Buffer data data data 2 receive ACK; Within Spurious RTO ! hypervisor Expire time clear clear Timer IRQ : timer timer 1 TCP RTO happens! Timer Timer Timer sender VM1 is running VM2 is running VM3 is running VM1 is running TCP Timer Start Expiry time time

Recommend


More recommend