reducing web latency the virtue of gentle aggression
play

Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , - PowerPoint PPT Presentation

Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Yuchung Cheng, Neal Cardwell, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan USC & Google August 14,


  1. Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Yuchung Cheng, Neal Cardwell, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan USC & Google August 14, 2013

  2. We can improve Google’s response time by 23% Across billions of client requests, we improved the mean response time by 23%. We achieved this by ONLY speeding up 6% of the transfers, all of them experienced packet loss. Improvement is in the tail: We halved latency in the 99th percentile. For latency-sensitive services faster transfers mean a better user experience. 2

  3. Ways to Reduce Latency: The State of the Art High loss, high delay 3

  4. Ways to Reduce Latency: The State of the Art High loss, Low loss, shorter delay multiplexed Improve the proximity of services to the user Leverage multi-stage connections 4

  5. Evaluating TCP Performance High loss, Low loss, shorter delay multiplexed Analyzed billions of flows carrying Web traffic between Google and clients 5

  6. Transfers With Loss Are Too Slow Loss makes Web latency 5 times slower Delays caused by TCP loss [Delay graph] detection and recovery 6% of transfers between Google and clients are lossy 6

  7. Retransmission Timeouts Are Expensive 77% of losses are recovered by retransmission timeouts Retransmission timeouts can be 200 times larger than the RTT Caused by high RTT variance, or lack of samples 7

  8. Tail Drops Are Expensive (Single) tail packet drop is very common Tail packets are twice as likely to be dropped compared to packets early in a burst 35% of lossy bursts observe only one packet loss 8

  9. Our Motivation and Goal Loss significantly slows down transfers. Due to frequent recovery via slow RTOs. Caused by tail loss. Our Goal: Approaching the ideal of loss detection and recovery without delay. Without making the protocol too aggressive. 9

  10. Design Space Level of Aggression Decreasing Increased Increased Phase slightly slightly greatly Startup / IW 10 Short flows Steady TCP Vegas CUBIC state Relentless / Decongestion Loss Moderation Recovery Loss DDoS Defense Timeout by Offense 10

  11. Setting Backend Frontend Server Server Public Network Private Network Controlling server only Controlling client and server Preference for solutions Latency-sensitive traffic is a without client changes and small portion of traffic mix middlebox compatibility

  12. Setting Backend Frontend Server Server Public Network Private Network Trigger fast retransmit Avoid retransmissions Reactive Proactive by retransmitting the through packet tail packet early duplication Add redundancy to enable recovery without retransmission, Corrective or trigger fast retransmit

  13. Setting Backend Frontend Server Server Public Network Private Network Trigger fast retransmit Avoid retransmissions Reactive Proactive by retransmitting the through packet tail packet early duplication Add redundancy to enable recovery without retransmission, Corrective or trigger fast retransmit

  14. Reactive Receiver does not know about the loss 1 - 3 and therefore cannot send signals back Wait time until RTO 1 14

  15. Reactive Retransmit new packet or previous (tail) packet after 1 - 3 two RTTs Can trigger selective Wait for acknowledgement two RTTs indicating loss 3 1 2 Fast Speeds up loss retransmit detection 15

  16. Reactive: Detecting Masked Losses 1 - 3 Cannot ignore the case where a packet loss is recovered by the Reactive probe Wait for two RTTs 3 Count ACKs and reduce congestion window if only one ACK for tail packet received 16

  17. Reactive: Detecting Masked Losses 1 - 3 1 - 3 2 2 K K C C A A Wait for two RTTs 3 3 3 K C A 3 K C 3 A K C A One ACK only: Loss → Reduce Two ACKs: No loss congestion window 17

  18. Setting Backend Frontend Server Server Public Network Private Network Trigger fast retransmit Avoid retransmissions Reactive Proactive by retransmitting the through packet tail packet early duplication Add redundancy to enable recovery without retransmission, Corrective or trigger fast retransmit

  19. Setting Backend Frontend Server Server Public Network Private Network Trigger fast retransmit Avoid retransmissions Reactive Proactive by retransmitting the through packet tail packet early duplication Add redundancy to enable recovery without retransmission, Corrective or trigger fast retransmit

  20. Proactive 1 - 3 Avoid almost all retransmissions through packet duplication Wait time until RTO 3 20

  21. Proactive 1 Avoid almost all retransmissions 1 (DUP) through packet duplication 2 2 (DUP) 3 Duplicates are used if original 3 (DUP) transmission was lost Avoids loss detection and recovery 21

  22. A/B Experiment Setup Frontend Backend Server Server Default Default Reactive Proactive Experimented in production environment serving billions of queries (millions of queries are sampled) 22

  23. Impact of Reactive and Proactive 15-day experiment, 2.6 million queries sampled: mean response time reduced by 23% 99th percentile response time reduced by 47% Impact of Proactive: Retransmission rates on the backend connection dropped from 0.99% to 0.09% Impact of Reactive: Almost 50% of retransmission timeouts on the frontend connection are converted to fast retransmits 23

  24. Corrective: The Middle Way Proactive avoids Reactive speeds up loss detection and loss detection, but still recovery, but has requires recovery 100% overhead Corrective 24

  25. Corrective: Forward Error Correction in TCP 1 - 3 Add redundancy to enable recovery without retransmission Wait time until RTO 1 25

  26. Corrective: Forward Error Correction in TCP Encodes previously transmitted 1 - 3 segments in few coded segments ENCODED XOR coding can recover single packet loss at the receiver Signaling of recovery status to the sender to enforce congestion No loss control or fast retransmit detection required Speeds up loss detection and recovery 26

  27. Evaluation: Corrective Synthetic workloads (fixed-size single queries) Network emulator Web page downloads (complex multi-resource queries) 27

  28. Loading nytimes.com with Corrective Tail latency reduced by more than 20% But: performance slightly worse on loss-free connections 28

  29. Dealing with Middleboxes Protocol changes need to account for middlebox interference We designed our modules for middlebox compatibility or graceful fallback to standard TCP 29

  30. Dealing with Middleboxes Unknown option in data Require option in all packet is stripped packets Resend lost segment to ACK number is rewritten update middlebox state for unseen sequences Modified retransmission Detect tampering through payload is rejected checksum 30

  31. Conclusion In a measurement study analyzing billions of flows in a Google’ s production environment we found that Analysis of loss patterns motivated three designs to improve latency: Reactive, Proactive, and Corrective Reactive and Proactive improved Google’s mean response time by 23% Reactive and Corrective are IETF Internet Drafts. Reactive is implemented and enabled by default in Linux 3.10 31

  32. Reducing Web Latency: The Virtue of Gentle Aggression Tobias Flach , Nandita Dukkipati, Andreas Terzis, Barath Raghavan, Yuchung Cheng, Neal Cardwell, Ankur Jain, Shuai Hao, Ethan Katz-Bassett, and Ramesh Govindan USC & Google August 14, 2013

  33. Additional Slides

  34. Why aren’t you just using a more aggressive RTO value? Delays can be the result of delayed ACKs Increases the risk of spurious retransmissions Severely impacts TCP performance due to potentially larger number of unnecessary retransmissions and reduction of the congestion window 34

  35. Why are you doing Corrective on the Transport Layer? Application Layer Transport Layer Applications can selectively Has necessary data to protect important data parts configure and tune Corrective (e.g. packets with higher loss probability, congestion Reliable transport protocol window size, loss rate, RTT) would recover redundant data Does not know which packets Additional protocol are prone to loss complexity

  36. Design Space Level of Aggression Decreasing Increased Increased Phase slightly slightly greatly Startup / IW Proactive 10 Short flows Steady TCP Vegas CUBIC state Relentless / Decongestion Moderation Corrective Recovery DDoS Defense Reactive Timeout by Offense

Recommend


More recommend