6.888 Lecture 5: Flow Scheduling Mohammad Alizadeh Spring 2016 1
Datacenter Transport Goal: Complete flows quickly / meet deadlines Short flows Low Latency (e.g., query, coordina1on) Large flows High Throughput (e.g., data update, backup) 2
Low Latency CongesJon Control (DCTCP, RCP, XCP, …) Keep network queues small (at high throughput) Implicitly prioriJze mice Can we do better? 3
The Opportunity Many DC apps/plaVorms know flow size or deadlines in advance - Key/value stores - Data processing - Web search Front end Server Aggregator … … Aggregator Aggregator Aggregator … … Worker Worker Worker Worker Worker 4 4
What You Said Amy: “Many papers that propose new network protocols for datacenter networks (such as PDQ and pFabric) argue that these will improve "user experience for web services". However, none seem to evaluate the impact of their proposed scheme on user experience… I remain skepGcal that small protocol changes really have drasGc effects on end-to-end metrics such as page load Gmes, which are typically measured in seconds rather than in microseconds.” 5
TX H9 H8 H7 H6 H5 H4 H3 H2 H1 H9 H8 H7 H6 H5 H4 H3 H2 H1 RX 6
DC transport = Objective? Flow scheduling Ø Minimize avg FCT Ø Minimize missed deadlines on giant switch H1 H1 H2 H2 H3 H3 H4 H4 H5 H5 H6 H6 H7 H7 H8 H8 H9 H9 TX RX ingress & egress capacity constraints 7
Example: Minimize Avg FCT Size Flow A 1 A Flow B 2 B B C C C Flow C 3 arrive at the same Jme share the same bobleneck link ² Adapted from slide by Chi-Yao Hong (UIUC) 8
Example: Minimize Avg FCT A Shortest flow first: Fair sharing: B B 1, 3, 6 3, 5, 6 C C C mean: 3.33 mean: 4.67 Throughput Throughput 1 1 A B C B A B C C C Time Time 3 5 6 6 1 3 ² Adapted from slide by Chi-Yao Hong (UIUC) 9
OpJmal Flow Scheduling for Avg FCT NP-hard for mulJ-link network [Bar-Noy et al.] – Shortest Flow First: 2-approxima1on 1 1 1 1 2 2 2 2 3 3 3 3 10
How can we schedule flows based on flow criJcality in a distributed way? Some transmission order 11
PDQ ² Several slides based on presentaJon by Chi-Yao Hong (UIUC) 12
PDQ: Distributed Explicit Rate Control Sender Switch Switch Receiver … Packet hdr criticality Switch preferentially rate = 10 5 allocates bandwidth to critical flows TradiJonal explicit rate control Fair sharing (e.g., XCP, RCP) 13
Contrast with TradiJonal Explicit Rate Control TradiJonal schemes (e.g. RCP, XCP) target fair sharing Sender Switch Receiver Switch … Packet hdr 5 rate = 10 Ø Each switch determines a “fair share” rate based on local congesJon: R ç R - k*congesJon-measure Ø Source use smallest rate adverJsed on their path 14
Challenges PDQ switches need to agree on rate decisions Low uJlizaJon during flow switching CongesJon and queue buildup Paused flows need to know when to start 15
Challenge: Switches need to agree on rate decisions Sender Switch Switch Receiver … Packet hdr What can go wrong without consensus? criJcality How do PDQ switches reach consensus? rate = 10 pauseby = X Why is “pauseby” needed? 16
What You Said Aus%n: “ It is an interesGng departure from AQM in that, with the concept of paused queues, PDQ seems to leverage senders as queue memory.” 17
Challenge: Low uJlizaJon during flow switching Goal: A B C How does PDQ avoid this? 1-2 RTTs PracJce: A C B 18
Early Start: Seamless flow switching Start next set of flows 2 RTTs Throughput 1 Time
Early Start: Seamless flow switching SoluJon: rate controller at switches increased queue [XCP/TeXCP/D3] Throughput 1 Time
Discussion 21
Mean FCT Mean flow compleJon Jme [Normalized to RCP a lower bound ] TCP PDQ w/o Early Start Omniscient scheduler PDQ controls with zero control feedback delay
What if flow size not known? Why does flow size esJmaJon (criJcality = bytes sent) work beber for Pareto? 23
Other quesJons Fairness: can long flows starve? 99% of jobs complete faster under SJF than under fair sharing [Bansal, Harchol-Balter; SIGMETRICS’01] AssumpJon: heavy-tailed flow distribuJon Resilience to error: what if packet gets lost or flow informaJon is inaccurate? MulJpath: does PDQ benefit from mulJpath? 24
pFabric 25
pFabric in 1 Slide Packets carry a single priority # • e.g., prio = remaining flow size pFabric Switches • Send highest priority / drop lowest priority pkts • Very small buffers (20-30KB for 10Gbps fabric) pFabric Hosts • Send/retransmit aggressively • Minimal rate control: just prevent congestion collapse Main Idea: Decouple scheduling from rate control 26
pFabric Switch Boils down to a sort ������������� – EssenJally unlimited prioriJes – Thought to be difficult in hardware ���� ExisJng switching only support ���������������� 4-16 prioriJes �� �� �� �� �� �� �� �� ���� ��������� pFabric queues very small ������� ����������� ������ ����� - 51.2ns to find min/max of ~600 ��������� numbers �������� �� �� – Binary comparator tree: 10 clock ������ ����������� cycles – Current ASICs: clock ~ 1ns 27
pFabric Rate Control What about Minimal version of TCP algorithm queue buildup? 1. Start at line-rate Why window – IniJal window larger than BDP control? 2. No retransmission Jmeout esJmaJon – Fixed RTO at small mulJple of round-trip Jme 3. Reduce window size upon packet drops – Window increase same as TCP (slow start, congesJon avoidance, …) 4. Awer mulJple consecuJve Jmeouts, enter “probe mode” – Probe mode sends min. size packets unJl first ACK 28
Why does pFabric work? Key invariant: At any instant, have the highest priority packet (according to ideal algorithm) available at the switch. Priority scheduling Ø High priority packets traverse fabric as quickly as possible What about dropped packets? Ø Lowest priority → not needed Jll all other packets depart Ø Buffer > BDP → enough Jme (> RTT) to retransmit 29
Discussion 30
Overall Mean FCT Ideal pFabric PDQ DCTCP TCP-DropTail FCT (normalized to op1mal in idle fabric) 10 9 8 7 6 5 4 3 2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Load 31
Mice FCT (<100KB) Average 99 th Percentile Ideal pFabric PDQ DCTCP TCP-DropTail 10 10 9 9 8 8 Normalized FCT Normalized FCT 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Load Load 32
Elephant FCT (>10MB) Average 25 TCP − DropTail DCTCP 20 PDQ Why the gap? Normalized FCT pFabric Ideal 15 10 5 0 0.2 0.4 0.6 0.8 Load 33
Loss Rate vs Packet Priority (at 80% load) * Loss rate at other hops is negligible ��� ���������� �������������� ��� ��������� ��� ��� �� �������� ���������� ���������� ��������� ��������� ���������� ����������� ��������������������� Almost all packet loss is for large (latency-insensitive) flows 34
Next Time: MulJ-Tenant Performance IsolaJon 35
36
Recommend
More recommend