CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang

Roadmap • Midterm I summary • Midterm II date change: 12/15 – Make up lecture for hurricane – Many can’t make it on that day • Today – Other Data Center network topologies – What’s TCP incast – Solutions

Other Datacenter network topologies

http://www.infotechlead.com/2013/03/28/gartner-data-center-spending-to-grow-3-7-to-146-bi Charles E. Leiserson's 60th Birthday 4 Symposium

What to build? This question has spawned a cottage industry in the computer networking research community. • “Fat-tree” [SIGCOMM 2008] • VL2 [SIGCOMM 2009, CoNEXT 2013] • DCell [SIGCOMM 2008] • BCube [SIGCOMM 2009] • Jellyfish [NSDI 2012] 5

“Fat-tree” SIGCOMM 2008 Isomorphic to butterfly network except at top level Bisection width n/2, oversubscription ratio 1 6

VL2 (SIGCOMM 2009, CoNEXT 2013) called a Clos network oversubscription ratio 1 (but 1Gbps links at leaves, 10Gbps elsewhere) 7

Dcell 2008) A “clique of cliques” Servers forward packets n servers in DCell 0 (n+1)n servers in DCell 1 (((n+1)n)+1)(n+1)n in DCell 2 oversubscription ratio 1 DCell 1 n=4 8

Bcube (SIGCOMM 2009) Mesh of stars (analogous to a mesh of trees) Bisection width between .25n and .35n 9

Jellyfish (NSDI 2012) vs. Random “fat-tree” connections butterfly Bisection width Θ(n) 10

http://www.itdisasters.com/category/networking-rack/ 11

Datacenter Transport Protocols

Data Center Packet Transport • Large purpose-built DCs – Huge investment: R&D, business • Transport inside the DC – TCP rules (99.9% of traffic) • How’s TCP doing? 13

TCP does not meet demands of DC applications • Goal: – Low latency – High throughput • TCP does not meet these demands. – Incast • Suffers from bursty packet drops – Large queues: Ø Adds significant latency. Ø Wastes precious buffers, esp. bad with shallow-buffered switches. 14

Case Study: Microsoft Bing • Measurements from 6000 server production cluster • Instrumentation passively collects logs ‒ Application-level ‒ Socket-level ‒ Selected packet-level • More than 150TB of compressed data over a month 15

Partition/Aggregate Application Structure 1. TLA Picasso Art is… Deadline = 250ms 2. Art is a lie… ….. 3. Picasso • Time is money ……… MLA MLA 1. Art is a lie… Ø Strict deadlines (SLAs) 1. Deadline = 50ms 2. The chief… 2. ….. • Missed deadline 3. ….. 3. Ø Lower quality result “Everything you can imagine is real.” “The chief enemy of creativity is “It is your work in life that is the “I'd like to live as a poor man “Art is a lie that makes us “Computers are useless. “Inspiration does exist, “Bad artists copy. Deadline = 10ms They can only give you answers.” but it must find you working.” ultimate seduction.“ with lots of money.“ Good artists steal.” realize the truth. good sense.“ Worker Nodes 16

Generality of Partition/Aggregate • The foundation for many large-scale web applications. – Web search, Social network composition, Ad selection, etc. Internet • Example: Facebook Web Servers Partition/Aggregate ~ Multiget – Aggregators: Web Servers Memcached Protocol – Workers: Memcached Servers Memcached Servers 17

Workloads • Partition/Aggregate Delay-sensitive (Query) • Short messages [50KB-1MB] Delay-sensitive ( C oordination, Control state) • Large flows [1MB-50MB] Throughput-sensitive ( D ata update) 18

Work load characterization CDF of Interarrival Time CDF of Interarrival Time 1 1 0.9 0.75 0.8 0.5 0.7 0.25 0.6 Query Traffic Background Traffic 0 0.5 0 0.5 1 0 10 20 secs secs Figure 3: Time between arrival of new work for the Aggrega- tor (queries) and between background flows between servers (update and short message).

Workload characterization Flow Size 0.1 Total Bytes 0.08 PDF 0.06 0.04 0.02 0 3 4 5 6 7 8 10 10 10 10 10 10 Flow Size (Bytes) Figure 4: PDF of flow size distribution for background traffic. PDF of Total Bytes shows probability a randomly selected byte would come from a flow of given size.

Workload characterization All Flows Flows > 1MB 1 1 0.75 0.75 CDF CDF 0.5 0.5 0.25 0.25 0 0 0 50 100 150 200 0 1 2 3 4 Concurrent Connections Concurrent Connections Figure 5: Distribution of number of concurrent connections.

Impairments • Incast • Queue Buildup • Buffer Pressure 22

Incast Worker 1 • Synchronized mice collide. Ø Caused by Partition/Aggregate. Aggregator Worker 2 Worker 3 RTO min = 300 ms Worker 4 TCP timeout 23

Incast Really Happens MLA Query Completion Time (ms) • Requests are jittered over 10ms window. 99.9 th percentile is being tracked. Jittering trades off median against high percentiles. • Jittering switched off around 8:30 am. 24

Queue Buildup Sender 1 • Big flows buildup queues. Ø Increased latency for short flows. Receiver • Measurements in Bing cluster Sender 2 Ø For 90% packets: RTT < 1ms Ø For 10% packets: 1ms < RTT < 15ms 25

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication Vijay Vasudevan , Amar Phanishayee, Hiral Shah, Elie Krevat David Andersen, Greg Ganger, Garth Gibson, Brian Mueller* Carnegie Mellon University, *Panasas Inc.

Datacenter TCP Request-Response R Rsp Rsp Rsp Client Switch Server Request Received Flow stalled... Request Response Response Sent dropped Resent 200ms response delay 2

Applications Sensitive to 200ms TCP Timeouts • “Drive-bys” affecting single-flow request/response • Barrier-Sync workloads • Parallel cluster filesystems (Incast workloads) • Massive multi-server queries (e.g., previous talk) • Latency-sensitive, customer-facing

Main Takeaways • Problem: 200ms TCP timeouts can cripple datacenter apps • Solution: Enable microsecond retransmissions • Can improve datacenter app throughput/latency • Safe in the wide-area

The Datacenter Environment 1-10 Gbps Commodity Ethernet Switches Client Switch Servers 10-100us Under heavy load, packet losses are frequent. 5

TCP: Loss Recovery Comparison Timeout driven recovery is Data-driven recovery is painfully slow (ms) super fast (us) Seq # Seq # 1 1 2 Ack 1 3 2 4 Ack 1 3 5 Ack 1 Ack 1 4 5 Retransmit Ack 5 2 Retransmission Receiver Sender Timeout (RTO) minRTO 200.0ms Ack 1 DC Latency 0.1ms 1 Receiver Sender 1 TCP Timeout lasts 1000 RTTs! 6

RTO Estimation and Minimum Bound • Jacobson’s TCP RTO Estimator • RTO = SRTT + 4*RTTVAR • Minimum RTO bound = 200ms • Actual RTO Timer = max(200ms, RTO)

The Incast Workload Data Block Synchronized Read 1 RRRR 2 3 Client Switch Server 2 3 4 1 Request Unit 4 (SRU) Client now sends Storage Servers next batch of requests

Incast Workload Overfills Buffers Synchronized Read 1 RRRR 2 4 3 Client Switch Server 1 2 3 4 4 Request Unit 4 (SRU) Responses 1-3 Requests completed Received Link Idle! Response 4 Requests Response 4 Resent Sent dropped 9 Monday, September 14, 2009

Client Link Utilization 200ms IDLE!

200ms Timeouts Cause Throughput Collapse !#""" Cluster !+"" Environment !*"" !)"" ?:/=B4:7B6! 1Gbps Ethernet !("" !!!CDE72F !'"" 100us Delay !&"" !%"" 200ms RTO !$"" !#"" S50 Switch !" !" !' !#" !#' !$" !$' !%" !%' !&" !&' 1MB Block Size ,-./0123-4!56/370!8396: More servers • [Nagle04] called this Incast; provided app-level workaround • Cause of throughput collapse: 200ms TCP Timeouts • Prior work: Other TCP variants did not prevent TCP timeouts. [Phanishayee:FAST2008]

Latency-sensitive Apps • Request for 4MB of data sharded across 16 servers (256KB each) • How long does it take for all of the 4MB of data to return?

Timeouts Increase Latency (256KB from 16 servers) 4MB Block Transfer Time Distribution with No RTO bound 18 16 Ideal Response 14 Time # of Occurrences 12 Responses delayed # of 10 occurrences by 200ms TCP 8 timeout(s) 6 4 2 0 0 50 100 150 200 250 300 350 400 450 Block Transfer Time (in ms) 13

Outline • Problem Description, Examples • Solution: Microsecond TCP Retransmissions • Is it safe? 14

First attempt: reducing RTO min RTOmin vs Goodput RTOmin vs Goodput (Block size = 1MB, buffer = 32KB) (Block size = 1MB, buffer = 32KB (estimate)) 1000 1000 900 900 800 800 Goodput (Mbps) 700 700 Goodput (Mbps) 600 600 500 500 # servers 400 400 4 300 # servers 8 300 4 200 16 200 8 32 100 16 64 100 128 0 0 200u 1m 5m 10m 50m 100m 200m 200u 1m 5m 10m 50m 100m 200m RTOmin (seconds) RTOmin (seconds) Figure 2: Reducing the RTO min in simulation to mi- Figure 3: Experiments on a real cluster validate the simulation result that reducing the RTO min to mi- croseconds from the current default value of 200ms improves goodput. croseconds improves goodput. 3.3 In Simulation

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap Midterm I summary Midterm II date change: 12/15 Make up lecture for hurricane Many cant make it on that day Today Other Data Center

TCP Pacing in Data Center Networks Monia Ghobadi, Yashar Ganjali Department of Computer

1 The bottomline of the story is that TCP incast is not solved. We will develop this story by

Attacks on TCP 1 Outline What is TCP protocol? How the TCP Protocol Works SYN

Data Center TCP (DCTCP) 1 TCP in the Data Center Well see TCP does not meet demands of

TCP on Wireless Ad Hoc Networks CS 218 Oct 22, 2003 TCP overview Ad hoc TCP : mobility,

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

TCP/IP Networks Dr. Miled M. Tezeghdanti December 17, 2010 Dr. Miled M. Tezeghdanti () TCP/IP

TCP/IP Networks Dr. Miled M. Tezeghdanti October 7, 2011 Dr. Miled M. Tezeghdanti () TCP/IP

TCP/IP Networks Dr. Miled M. Tezeghdanti October 7, 2011 Dr. Miled M. Tezeghdanti () TCP/IP

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

CompSci 514: Computer Networks Lecture 5: Congestion Control Xiaowei Yang 1 Outline

TCP/IP: TCP Network Security Lecture 7 Eike Ritter Network Security - Lecture 7 1 TCP

TCP TCP Congestion Control Congestion Control Essential strategy :: The TCP host sends

Hacking the MPTCP socket API draft-hesmans-mptcp-socket-00 MultiPath TCP WiFi 4G LTE MultiPath

CompSci 514: Computer Networks Lecture 16: Network Function Virtualization Xiaowei Yang Adapted

1 TCP over ATM Prot ocol Archit et ure Typical prot ocol st ack TCP I P

Status of the Graphics Stack on FreeBSD Jean-Sbastien Pdron The FreeBSD Project The X.Org

Vulnerability of Transportation Networks to Traffic-Signal Tampering Aron Laszka 1 , Bradley

Post E vent Data Collection Case Studies Andre R. Barbosa, Ph.D. Assistant Professor of

CrossBow: From Hardware Virtualized NICs t\ to Virtualized Networks Sunay Tripathi, Nicolas

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2018

Community Wireless Networks and Regulatory Impact on Deployment in the United States Tim Pozar

BLOCKCHAIN The Audacity To Break Into A New Economic Period [RE] @richieetwaru scope of

Ripple & XRP Overview UNSW Blockchain & DLT Symposium FEBRUARY 2018 Dilip Rao, Global

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions - PowerPoint PPT Presentation

CompSci 514: Computer Networks Lecture 13 TCP incast and Solutions Xiaowei Yang Roadmap Midterm I summary Midterm II date change: 12/15 Make up lecture for hurricane Many cant make it on that day Today Other Data Center

TCP Pacing in Data Center Networks Monia Ghobadi, Yashar Ganjali Department of Computer

1 The bottomline of the story is that TCP incast is not solved. We will develop this story by

Attacks on TCP 1 Outline What is TCP protocol? How the TCP Protocol Works SYN

Data Center TCP (DCTCP) 1 TCP in the Data Center Well see TCP does not meet demands of

TCP on Wireless Ad Hoc Networks CS 218 Oct 22, 2003 TCP overview Ad hoc TCP : mobility,

CompSci 514: Computer Networks Lecture 14 Datacenter Transport protocols II Xiaowei Yang

TCP/IP Networks Dr. Miled M. Tezeghdanti December 17, 2010 Dr. Miled M. Tezeghdanti () TCP/IP

TCP/IP Networks Dr. Miled M. Tezeghdanti October 7, 2011 Dr. Miled M. Tezeghdanti () TCP/IP

TCP/IP Networks Dr. Miled M. Tezeghdanti October 7, 2011 Dr. Miled M. Tezeghdanti () TCP/IP

CompSci 514: Computer Networks Lecture 15 Practical Datacenter Networks Xiaowei Yang Overview

CompSci 514: Computer Networks Lecture 5: Congestion Control Xiaowei Yang 1 Outline

TCP/IP: TCP Network Security Lecture 7 Eike Ritter Network Security - Lecture 7 1 TCP

TCP TCP Congestion Control Congestion Control Essential strategy :: The TCP host sends

Hacking the MPTCP socket API draft-hesmans-mptcp-socket-00 MultiPath TCP WiFi 4G LTE MultiPath

CompSci 514: Computer Networks Lecture 16: Network Function Virtualization Xiaowei Yang Adapted

1 TCP over ATM Prot ocol Archit et ure Typical prot ocol st ack TCP I P

Status of the Graphics Stack on FreeBSD Jean-Sbastien Pdron The FreeBSD Project The X.Org

Vulnerability of Transportation Networks to Traffic-Signal Tampering Aron Laszka 1 , Bradley

Post E vent Data Collection Case Studies Andre R. Barbosa, Ph.D. Assistant Professor of

CrossBow: From Hardware Virtualized NICs t\ to Virtualized Networks Sunay Tripathi, Nicolas

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2018

Community Wireless Networks and Regulatory Impact on Deployment in the United States Tim Pozar

BLOCKCHAIN The Audacity To Break Into A New Economic Period [RE] @richieetwaru scope of

Ripple &amp; XRP Overview UNSW Blockchain &amp; DLT Symposium FEBRUARY 2018 Dilip Rao, Global

Ripple & XRP Overview UNSW Blockchain & DLT Symposium FEBRUARY 2018 Dilip Rao, Global