compsci 514 computer networks lecture 13 tcp incast and
play

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap Midterm I summary Midterm II date change: 12/15 Make up lecture for hurricane Many cant make it on that day Today Other Data Center


  1. CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang

  2. Roadmap • Midterm I summary • Midterm II date change: 12/15 – Make up lecture for hurricane – Many can’t make it on that day • Today – Other Data Center network topologies – What’s TCP incast – Solutions

  3. Other Datacenter network topologies

  4. http://www.infotechlead.com/2013/03/28/gartner-data-center-spending-to-grow-3-7-to-146-bi Charles E. Leiserson's 60th Birthday 4 Symposium

  5. What to build? This question has spawned a cottage industry in the computer networking research community. • “Fat-tree” [SIGCOMM 2008] • VL2 [SIGCOMM 2009, CoNEXT 2013] • DCell [SIGCOMM 2008] • BCube [SIGCOMM 2009] • Jellyfish [NSDI 2012] 5

  6. “Fat-tree” SIGCOMM 2008 Isomorphic to butterfly network except at top level Bisection width n/2, oversubscription ratio 1 6

  7. VL2 (SIGCOMM 2009, CoNEXT 2013) called a Clos network oversubscription ratio 1 (but 1Gbps links at leaves, 10Gbps elsewhere) 7

  8. Dcell 2008) A “clique of cliques” Servers forward packets n servers in DCell 0 (n+1)n servers in DCell 1 (((n+1)n)+1)(n+1)n in DCell 2 oversubscription ratio 1 DCell 1 n=4 8

  9. Bcube (SIGCOMM 2009) Mesh of stars (analogous to a mesh of trees) Bisection width between .25n and .35n 9

  10. Jellyfish (NSDI 2012) vs. Random “fat-tree” connections butterfly Bisection width Θ(n) 10

  11. http://www.itdisasters.com/category/networking-rack/ 11

  12. Datacenter Transport Protocols

  13. Data Center Packet Transport • Large purpose-built DCs – Huge investment: R&D, business • Transport inside the DC – TCP rules (99.9% of traffic) • How’s TCP doing? 13

  14. TCP does not meet demands of DC applications • Goal: – Low latency – High throughput • TCP does not meet these demands. – Incast • Suffers from bursty packet drops – Large queues: Ø Adds significant latency. Ø Wastes precious buffers, esp. bad with shallow-buffered switches. 14

  15. Case Study: Microsoft Bing • Measurements from 6000 server production cluster • Instrumentation passively collects logs ‒ Application-level ‒ Socket-level ‒ Selected packet-level • More than 150TB of compressed data over a month 15

  16. Partition/Aggregate Application Structure 1. TLA Picasso Art is… Deadline = 250ms 2. Art is a lie… ….. 3. Picasso • Time is money ……… MLA MLA 1. Art is a lie… Ø Strict deadlines (SLAs) 1. Deadline = 50ms 2. The chief… 2. ….. • Missed deadline 3. ….. 3. Ø Lower quality result “Everything you can imagine is real.” “The chief enemy of creativity is “It is your work in life that is the “I'd like to live as a poor man “Art is a lie that makes us “Computers are useless. “Inspiration does exist, “Bad artists copy. Deadline = 10ms They can only give you answers.” but it must find you working.” ultimate seduction.“ with lots of money.“ Good artists steal.” realize the truth. good sense.“ Worker Nodes 16

  17. Generality of Partition/Aggregate • The foundation for many large-scale web applications. – Web search, Social network composition, Ad selection, etc. Internet • Example: Facebook Web Servers Partition/Aggregate ~ Multiget – Aggregators: Web Servers Memcached Protocol – Workers: Memcached Servers Memcached Servers 17

  18. Workloads • Partition/Aggregate Delay-sensitive (Query) • Short messages [50KB-1MB] Delay-sensitive ( C oordination, Control state) • Large flows [1MB-50MB] Throughput-sensitive ( D ata update) 18

  19. Work load characterization CDF of Interarrival Time CDF of Interarrival Time 1 1 0.9 0.75 0.8 0.5 0.7 0.25 0.6 Query Traffic Background Traffic 0 0.5 0 0.5 1 0 10 20 secs secs Figure 3: Time between arrival of new work for the Aggrega- tor (queries) and between background flows between servers (update and short message).

  20. Workload characterization Flow Size 0.1 Total Bytes 0.08 PDF 0.06 0.04 0.02 0 3 4 5 6 7 8 10 10 10 10 10 10 Flow Size (Bytes) Figure 4: PDF of flow size distribution for background traffic. PDF of Total Bytes shows probability a randomly selected byte would come from a flow of given size.

  21. Workload characterization All Flows Flows > 1MB 1 1 0.75 0.75 CDF CDF 0.5 0.5 0.25 0.25 0 0 0 50 100 150 200 0 1 2 3 4 Concurrent Connections Concurrent Connections Figure 5: Distribution of number of concurrent connections.

  22. Impairments • Incast • Queue Buildup • Buffer Pressure 22

  23. Incast Worker 1 • Synchronized mice collide. Ø Caused by Partition/Aggregate. Aggregator Worker 2 Worker 3 RTO min = 300 ms Worker 4 TCP timeout 23

  24. Incast Really Happens MLA Query Completion Time (ms) • Requests are jittered over 10ms window. 99.9 th percentile is being tracked. Jittering trades off median against high percentiles. • Jittering switched off around 8:30 am. 24

  25. Queue Buildup Sender 1 • Big flows buildup queues. Ø Increased latency for short flows. Receiver • Measurements in Bing cluster Sender 2 Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms 25

  26. Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication Vijay Vasudevan , Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas Inc.

  27. Datacenter TCP Request-Response R Rsp Rsp Rsp Client Switch Server Request Received Flow stalled... Request Response Response Sent dropped Resent 200ms response delay 2

  28. Applications Sensitive to 200ms TCP Timeouts • “Drive-bys” affecting single-flow request/response • Barrier-Sync workloads • Parallel cluster filesystems (Incast workloads) • Massive multi-server queries (e.g., previous talk) • Latency-sensitive, customer-facing

  29. Main Takeaways • Problem: 200ms TCP timeouts can cripple datacenter apps • Solution: Enable microsecond retransmissions • Can improve datacenter app throughput/latency • Safe in the wide-area

  30. The Datacenter Environment 1-10 Gbps Commodity Ethernet Switches Client Switch Servers 10-100us Under heavy load, packet losses are frequent. 5

  31. TCP: Loss Recovery Comparison Timeout driven recovery is Data-driven recovery is painfully slow (ms) super fast (us) Seq # Seq # 1 1 2 Ack 1 3 2 4 Ack 1 3 5 Ack 1 Ack 1 4 5 Retransmit Ack 5 2 Retransmission Receiver Sender Timeout (RTO) minRTO 200.0ms Ack 1 DC Latency 0.1ms 1 Receiver Sender 1 TCP Timeout lasts 1000 RTTs! 6

  32. RTO Estimation and Minimum Bound • Jacobson’s TCP RTO Estimator • RTO = SRTT + 4*RTTVAR • Minimum RTO bound = 200ms • Actual RTO Timer = max(200ms, RTO)

  33. The Incast Workload Data Block Synchronized Read 1 RRRR 2 3 Client Switch Server 2 3 4 1 Request Unit 4 (SRU) Client now sends Storage Servers next batch of requests

  34. Incast Workload Overfills Buffers Synchronized Read 1 RRRR 2 4 3 Client Switch Server 1 2 3 4 4 Request Unit 4 (SRU) Responses 1-3 Requests completed Received Link Idle! Response 4 Requests Response 4 Resent Sent dropped 9 Monday, September 14, 2009

  35. Client Link Utilization 200ms IDLE!

  36. 200ms Timeouts Cause Throughput Collapse !#""" Cluster !+"" Environment !*"" !)"" ?:/=B4:7B6! 1Gbps Ethernet !("" !!!CDE72F !'"" 100us Delay !&"" !%"" 200ms RTO !$"" !#"" S50 Switch !" !" !' !#" !#' !$" !$' !%" !%' !&" !&' 1MB Block Size ,-./0123-4!56/370!8396: More servers • [Nagle04] called this Incast; provided app-level workaround • Cause of throughput collapse: 200ms TCP Timeouts • Prior work: Other TCP variants did not prevent TCP timeouts. [Phanishayee:FAST2008]

  37. Latency-sensitive Apps • Request for 4MB of data sharded across 16 servers (256KB each) • How long does it take for all of the 4MB of data to return?

  38. Timeouts Increase Latency (256KB from 16 servers) 4MB Block Transfer Time Distribution with No RTO bound 18 16 Ideal Response 14 Time # of Occurrences 12 Responses delayed # of 10 occurrences by 200ms TCP 8 timeout(s) 6 4 2 0 0 50 100 150 200 250 300 350 400 450 Block Transfer Time (in ms) 13

  39. Outline • Problem Description, Examples • Solution: Microsecond TCP Retransmissions • Is it safe? 14

  40. First attempt: reducing RTO min RTOmin vs Goodput RTOmin vs Goodput (Block size = 1MB, buffer = 32KB) (Block size = 1MB, buffer = 32KB (estimate)) 1000 1000 900 900 800 800 Goodput (Mbps) 700 700 Goodput (Mbps) 600 600 500 500 # servers 400 400 4 300 # servers 8 300 4 200 16 200 8 32 100 16 64 100 128 0 0 200u 1m 5m 10m 50m 100m 200m 200u 1m 5m 10m 50m 100m 200m RTOmin (seconds) RTOmin (seconds) Figure 2: Reducing the RTO min in simulation to mi- Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTO min to mi- croseconds from the current default value of 200ms improves goodput. croseconds improves goodput. 3.3 In Simulation

Recommend


More recommend