Best-Path vs. Multi-Path Overlay Routing David G. Andersen (MIT) Alex C. Snoeren (UCSD) Hari Balakrishnan (MIT) October 2003 http://nms.lcs.mit.edu/ron/
Overview Best-path vs. redundant overlay routing • What tactics work best to – Reduce loss? – Reduce latency? – Avoid outages? • In what circumstances do they perform best? • Implications for new strategies
Context: Reliability via Path Diversity �� �� �� �� �� �� • Backup links provide alternatives ➔ Mechanisms for obtaining diversity (existing diversity) ➔ Mechanisms for using diversity (overlay techniques)
Obtaining Diversity Engineered diversity: �� �� �� �� �� �� Exploiting existing diversity: �� �� �� �� �� ��
Existing AS-level Redundancy • Traceroute between 12 hosts, showing Autonomous Systems (AS’s) AS5650 AS3 AS1239 AS5050 AS6521 AS13649 MIT Sightpath Aros CCI MA−Cable AS9 AS1742 AS1785 AS7015 AS701 AS210 AS6114 CMU Utah NYU AS226 AS7922 AS7018 AS702 UTREP AS1103 AS6453 AMNAP AS3967 VU−NL AS7280 AS145 AS1200 AS8297 AS3356 CA−T1 AS3756 AS9057 AS8709 AS13790 Abilene vBNS AS26 AS1 AS3561 AS1790 NYSERNet Cornell Known private peering AS209
Exploiting Diversity via overlays �� �� �� �� �� �� �� �� �� �� �� �� • Send packets through cooperating peers • End-hosts only, no network support
Exploiting Diversity via Overlays Reactive Routing Probes and Routing Updates �� �� • Probe paths �� �� �� �� • Route via best • RON (SOSP’01) �� �� �� �� �� �� Detour
Exploiting Diversity via Overlays Probes and Routing Updates �� �� Reactive Routing �� �� �� �� • Probe paths • Route via best �� �� �� �� �� �� Redundant Routing �� �� �� �� �� �� • Parallel paths • No probing �� �� �� �� • Mesh routing �� �� (SOSP’01)
Reactive vs. Redundant Routing 100% % Capacity used by data Probe/Redundant Traffic Capacity limit Data Traffic 0 Desired Loss Rate Improvement 0% 100% • Capacity limits probing and redundancy
Reactive vs. Redundant Routing Best Expected Independence Path Limit Limit 100% % Capacity used by data Capacity limit 0 Desired Loss Rate Improvement 0% 100% • Reactive limit: best path performance • Redundant limit: Path independence
Reactive vs. Redundant Routing Best Expected Independence Path Limit Limit 100% % Capacity used by data Capacity limit Reactive Redundant 0 Desired Loss Rate Improvement 0% 100% • Reactive limit: best path performance • Redundant limit: Path independence
Reactive vs. Redundant Routing Best Expected Independence Path Limit Limit 100% % Capacity used by data Capacity limit Reactive Redundant 0 Desired Loss Rate Improvement 0% 100% • Reactive limit: best path performance • Redundant limit: Path independence • Overhead scaling: throughput vs. nodes
8 Routing Methods Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing
8 Routing Methods Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing Lat Reactive routing, min latency Loss Reactive routing, min loss
8 Routing Methods Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing Lat Reactive routing, min latency Loss Reactive routing, min loss Direct Rand 2pkts, Redundant routing, simplest
8 Routing Methods Direct Single packet, direct path Direct Direct 2 packets, direct, no spacing DD 10ms 2 packets, direct, 10ms spacing DD 20ms 2 packets, direct, 20ms spacing Lat Reactive routing, min latency Loss Reactive routing, min loss Direct Rand 2pkts, Redundant routing, simplest Lat Loss 2pkts, Reactive + Redundant (Falls back to random)
Probing on Internet Testbed Each node repeats: 1. Pick random node j 2. Pick one of the 8 routing types ( direct, loss, lat, etc. ) in round-robin order. Send to j . 3. Delay for random interval [0.6s - 1.2s] Probes are one-way, recorded at sender & receiver.
Datasets From Internet Deployment Dataset Nodes Time Measurements RON wide 17 5 days 4.7M RON narrow 17 3 days 2.8M RON 2003 30 14 days 32.6M ✔ Variety of network types and bandwidths 5 int’l, 3 Cable/DSL, 7 universities... ✔ N 2 path scaling ∼ 900 paths
One-way Loss Rates Are Low 1 0.9 0.8 fraction of paths 90% of paths under 1% loss rate 0.7 • Overall loss 0.6 0.5 0.42% 0.4 0.3 in 2003 0.2 2003 dataset 0.1 2002 dataset 0 0 1 2 3 4 5 6 7 average path−wide loss rate (%) • Includes quiescent periods • Outages still (painfully) apparent
Duplication Reduces Overall Loss Type Loss % direct 0.42 direct direct 0.30 dd 10ms 0.27 dd 20ms 0.27
Duplication Reduces Overall Loss Type Loss % direct 0.42 direct direct 0.30 dd 10ms 0.27 dd 20ms 0.27 Lat 0.43 Loss 0.33 Direct Rand 0.26 Lat Loss 0.23
Loss Probabilities Sanity Check • 0.42% loss << [Paxson 94,95] (2.8%, 5%). • Unloaded paths vs. loaded by TCP transfer • Conditional loss probabilities are similar P ( lose P2 | lost P1 ) Study ∼ 50% Paxson TCP Bolot 8ms spacing 60% RON 2003 no spacing 72% RON 2003 20ms 65% RON 2003 direct rand 62%
Latency Improvements 1 0.95 5% of connections exhibit large latency improvement Fraction of paths 0.9 0.85 Mean Latency lat loss 46.8 ms 0.8 lat 48.0 direct rand 51.7 0.75 direct 54.1 0.7 0 50 100 150 200 250 300 Latency (ms) Unlike loss, most latency from specific bad paths
# High Loss Periods (1 hr, normalized) > 0% Type direct 1 (8817) direct direct 0.59 dd 20ms 0.43 Lat 1.2 ← Worse than naive duplication Loss 0.80 Direct Rand 0.44 for low loss situations Lat Loss 0.38
# High Loss Periods (1 hr, normalized) > 0% > 30% Type direct 1 (8817) 1 (630) direct direct 0.59 0.93 dd 20ms 0.43 0.91 Lat 1.2 0.96 ← on par Loss 0.80 0.91 Direct Rand 0.44 0.92 Lat Loss 0.38 0.89
# High Loss Periods (1 hr, normalized) > 0% > 30% > 60% Type direct 1 (8817) 1 (630) 1 (255) direct direct 0.59 0.93 0.98 dd 20ms 0.43 0.91 0.98 Lat 1.2 0.96 0.91 0.86 ★ Loss 0.80 0.91 0.92 ★ Direct Rand 0.44 0.92 0.84 ★ Lat Loss 0.38 0.89
Measurement Summary ✔ Redundant beats reactive for low loss – “Meshing” beats controls during outages ✔ Reactive finds specific good paths – Latency improvements – Low loss paths ✘ No overlay technique near independent paths – Hypothesis: Access link failures – More severe outages harder to correct
Why Not FEC? Redundant assumption: Fast recovery, low rate 0.42% loss rate → need little redundancy 1st packet lost Recovery X ...100 packets... Failure losses bursty ( ≥ 0 . 5 conditional loss) ✘ Spread FEC over even more packets ➔ Latency-critical traffic: 2-redundant mesh
Conclusions • Loss rate for low-rate traffic low (0.42%) • Conditional loss probability high (0.72) even for random mesh (0.62) • 40-60% of loss avoidable ✔ Reundant: Avoiding low loss rates ✔ Reactive: Avoiding high loss, latency ➔ Low loss suggests selective approach ...
Future Work Strategies for avoiding losses and outages: • Selective redundancy: Protecting SYNs, etc. (shameless plug: Currently implementing) • Selective probing: Activate on first loss Measurements: • Engineered network redundancy impact? (testing now, looking for multihomed sites) http://nms.lcs.mit.edu/ron/
Scaling • Reactive: Scales with # nodes • Redundant: Scales with traffic volume
Best Path Scaling Routing and probing add packets: Responsiveness vs. overhead vs. size 35000 Overhead 30000 Overhead (bits/second) 30 nodes 25000 13.3Kbps 10 nodes 20000 2.2Kbps 15000 10000 50 nodes 33Kbps 5000 0 0 5 10 15 20 25 30 35 40 45 50 Number of Nodes • 50 nodes near limit, enough for many apps.
Best Path Routing �� �� �� �� �� �� �� �� �� �� �� �� Probes and Routing • Frequently measure all inter-node paths • Exchange routing information • Route along app-specific best path consistent with routing policy
Probing and Outage Detection Node A Node B I n i t i a l P i n ID 5: time 10 g 1 e ID 5: time 33 s n o p s e R ID 5: time 15 R e s p o n s e 2 ID 5: time 39 Record "success" with RTT 5 Record "success" with RTT 6 • Probe every random(14) seconds • 3 packets, both sides get RTT and reachability • If “lost probe,” send next immediately Timeout based on RTT and RTT variance • If N lost probes, notify outage
Architecture: Probing �� �� �� �� �� �� �� �� �� �� �� �� ➔ Probe between nodes, determine path qualities � N 2 � – O probe traffic with active probes – Passive measurements
Recommend
More recommend