Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U)
How to invest smartly in the stock market? π¦ " π§ " π¦ # π§ # π¦ $ π§ $ Gain/Loss $10,000 Uncertainty Control Loss will be β€ $100 with probability 99% Solution: financial risk theory 2
Traffic Engineering (TE) problem π¦ " 10 Gbps π¦ # 10 Gbps BOS NYC π¦ $ 10 Gbps Throughput/ Traffic demand: 10 Gbps Packet loss β’ How to configure the allocation of traffic on network paths? β’ Goal: efficiently utilizing the network to match the current traffic demand (periodic process) 3
Extensive research on TE in a broad variety of environments β’ Wide-area networks β’ Kumar et al. [NSDIβ18] β’ Liu et al. [SIGCOMMβ14] β’ Kumar et al. [SIGCOMMβ15] β’ Jain et al. [SIGCOMMβ13] β’ Hong et al. [SIGCOMMβ13] β’ ISP networks β’ Jiang et al.[SIGMETRICSβ09] β’ Kandula et al. [SIGCOMMβ05] β’ Fortz et al. [INFOCOMβ2000] β’ Data center networks β’ Alizadeh et al. [SIGCOMMβ14] β’ Akyildiz et al. [Journal of Comp. Nets.β14] β’ Benson et al. [CoNEXTβ11] 4
TE problem in Wide-Area Networks Microsoft WAN 5
TE problem in Wide-Area Networks Challenging: Solution: β’Billion dollar infrastructure β’Model the network as a graph β’High efficiency and availability β’Solve a Linear Program Objective Constraints Microsoft WAN [B. Fortz, Internet Traffic Engineering by Optimizing OSPF Weights, INFOCOMβ2000] 6
Competing goals: high utilization and availability Availability Utilization Failures 7
Competing goals: high utilization and availability Availability Utilization Failures 8
Traffic engineering under failures Today: optimize for the worst conceivable (potentially unlikely) β’ failure scenarios Problem: under-utilizing the network β’ Robust against k simultaneous link failures 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 NYC BOS 10 Gbps p(fail) = 10 -2 Admissible traffic: 5 Gbps all the time 99.999% of the time 9
Traffic engineering under failures Robust against k simultaneous link failures 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 NYC BOS 10 Gbps p(fail) = 10 -2 Admissible traffic: 5 Gbps 99.999% of the time Utilization Availability Failures 10
Our approach to traffic engineering under failures Use the failure probabilities to reason about the likelihood of failure scenarios β’ Provide a mathematical probabilistic guarantee for availability β’ 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 BOS NYC p(fail) = 10 -2 10 Gbps Admissible traffic Availability 5 Gbps 99.999% 10 Gbps 99.99% 15 Gbps 99.8% 20 Gbps 98% 11
Our approach to traffic engineering under failures Use the failure probabilities to reason about the likelihood of failure scenarios β’ Provide a mathematical probabilistic guarantee for availability β’ Flow allocation vector Uncertainty vector 10 Gbps π¦ " π§ " demand π§ # π¦ # 5 Gbps BOS NYC π ' π¦ $ π§ $ 10 Gbps For all flows, 90% of the demand is satisfied 99.9% of the time For all flows, loss will be β€ 10% of the demand 99.9% of the time 12
Main idea π¦ " π§ " π¦ # π§ # π¦ $ π§ $ The loss will be β€ $100 with probability 99% Find x that minimizes the loss with probability Ξ² π¦ " π§ " π¦ # π§ # demand π ' π§ $ π¦ $ The loss will be β€ 10% of the demand with probability 99% Find x that minimizes the loss with probability Ξ² 13
Key technique: scenario-based formulation ⒠One link failure A failure scenario ⒠Correlated link failures Loss †10% of demand w probability 0.95 Probability Target probability β = 0.95 VaRᡦ = 10% β = 0.95 ⒠Unsatisfied demand 0 5 10 Loss ( % ) ⒠Packet loss 14
Key technique: scenario-based formulation π(π¦, πππ) = π(π|π(π¦, π§(π)) β€ πππ) πππ{πππ|π(π¦, πππ) β₯ πΎ} Probability Target probability Ξ² = 0.95 VaRᡦ = 10% Ξ² = 0.95 0 5 10 Loss ( % ) 15
TeaVaR: Traffic Engineering Applying Value-at-Risk π(π¦, πππ) = π(π|π(π¦, π§(π)) β€ πππ) πππ{πππ|π(π¦, πππ) β₯ πΎ} Probability What about the worst 5% of scenarios? πππ{πΉ[πππ‘π‘|πππ‘π‘ β₯ πππ]} Ξ² = 0.95 0 5 10 Loss ( % ) 16
Challenges unique to networking β’ Achieving fairness across network users. β’ Enabling computational tractability as the network scales. β’ Capturing fast rerouting of traffic in data plane. β’ Accounting for correlated failures. 17
Achieving fairness π¦ " π§ " π¦ # π§ # demand π§ $ π¦ $ π ' Objective: Find x that minimizes the loss with probability Ξ² R i : routes for flow i Satisfied demand for flow i : Ξ£ CβD E π¦ C π§ C x r : flow allocation on route r β R i y r : binary variable indicating if route r is up Starvation-aware loss function: β’ Worst case normalized unmet demand Ξ£ CβD E π¦ C π§ C ] H π(π¦, π§) = πππ¦ ' [1 β π ' 18
Handling scale All Scenarios Our approach 100 Coverage (%) 99.5 99 98.5 98 Googleβs B4 topology B4 IBM MSFT ATT 125 Topology # Edges # Scenarios 100 Run time (s) 75 B4 38 O(1E11) 50 IBM 48 O(1E14) 25 MSFT 100 O(1E30) 0 ATT 112 O(1E33) B4 IBM MSFT ATT 19
System architecture π§ " π¦ " π¦ # π§ # demand π¦ $ π§ $ π ' Topology Flow demands TeaVaR Flow Linear allocations Optimization Failure probability of scenarios Target availability (0.99, 0.999,..) 20
Evaluations β’ Topologies: B4, IBM, ATT, and MSFT β’ Traffic matrix: β’ Four months of MSFT traffic matrix (one sample/hour), for the rest of topologies, used 24 TMs from YATES [SOSRβ18] β’ Tunnel selection: β’ Our optimization framework is orthogonal to tunnel selection β’ Oblivious paths, link disjoint paths, and k-shortest paths β’ Baselines: β’ SMORE [NSDIβ18] β’ FFC [SIGCOMMβ14] β’ B4 [SIGCOMMβ13] β’ ECMP 22
Availability vs. demand scale β’ Availability is measured as the probability mass of scenarios in which demand is fully satisfied (βall-or-nothingβ requirement) β’ If a TE schemeβs bandwidth allocation is unable to fully satisfy demand in 0.1% of scenarios, it has an availability of 99.9% 100 Availability (%) 98 96 94 SMORE B4 FFC-1 92 FFC-2 ECMP TeaVaR 90 1.0 1.5 2.1 2.6 3.1 Demand Scale 23
Robustness to probability estimates Noise in probability % error in throughput estimations 1% 1.43% 5% 2.95% 10% 3.07% 15% 3.95% 20% 6.73% 24
Summary TeaVaR uses financial risk theory for solving Traffic β’ Engineering under failures. TeaVaRβs approach is applicable to networking β’ resource allocation problems such as capacity planning. Code and demo available at: β’ http://teavar.csail.mit.edu/ 25
Recommend
More recommend