striking the right utilization availability balance in
play

Striking the Right Utilization- Availability Balance in the WAN - PowerPoint PPT Presentation

Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U) How to invest smartly in


  1. Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U)

  2. How to invest smartly in the stock market? 𝑦 " 𝑧 " 𝑦 # 𝑧 # 𝑦 $ 𝑧 $ Gain/Loss $10,000 Uncertainty Control Loss will be ≀ $100 with probability 99% Solution: financial risk theory 2

  3. Traffic Engineering (TE) problem 𝑦 " 10 Gbps 𝑦 # 10 Gbps BOS NYC 𝑦 $ 10 Gbps Throughput/ Traffic demand: 10 Gbps Packet loss β€’ How to configure the allocation of traffic on network paths? β€’ Goal: efficiently utilizing the network to match the current traffic demand (periodic process) 3

  4. Extensive research on TE in a broad variety of environments β€’ Wide-area networks β€’ Kumar et al. [NSDI’18] β€’ Liu et al. [SIGCOMM’14] β€’ Kumar et al. [SIGCOMM’15] β€’ Jain et al. [SIGCOMM’13] β€’ Hong et al. [SIGCOMM’13] β€’ ISP networks β€’ Jiang et al.[SIGMETRICS’09] β€’ Kandula et al. [SIGCOMM’05] β€’ Fortz et al. [INFOCOM’2000] β€’ Data center networks β€’ Alizadeh et al. [SIGCOMM’14] β€’ Akyildiz et al. [Journal of Comp. Nets.’14] β€’ Benson et al. [CoNEXT’11] 4

  5. TE problem in Wide-Area Networks Microsoft WAN 5

  6. TE problem in Wide-Area Networks Challenging: Solution: β€’Billion dollar infrastructure β€’Model the network as a graph β€’High efficiency and availability β€’Solve a Linear Program Objective Constraints Microsoft WAN [B. Fortz, Internet Traffic Engineering by Optimizing OSPF Weights, INFOCOM’2000] 6

  7. Competing goals: high utilization and availability Availability Utilization Failures 7

  8. Competing goals: high utilization and availability Availability Utilization Failures 8

  9. Traffic engineering under failures Today: optimize for the worst conceivable (potentially unlikely) β€’ failure scenarios Problem: under-utilizing the network β€’ Robust against k simultaneous link failures 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 NYC BOS 10 Gbps p(fail) = 10 -2 Admissible traffic: 5 Gbps all the time 99.999% of the time 9

  10. Traffic engineering under failures Robust against k simultaneous link failures 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 NYC BOS 10 Gbps p(fail) = 10 -2 Admissible traffic: 5 Gbps 99.999% of the time Utilization Availability Failures 10

  11. Our approach to traffic engineering under failures Use the failure probabilities to reason about the likelihood of failure scenarios β€’ Provide a mathematical probabilistic guarantee for availability β€’ 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 BOS NYC p(fail) = 10 -2 10 Gbps Admissible traffic Availability 5 Gbps 99.999% 10 Gbps 99.99% 15 Gbps 99.8% 20 Gbps 98% 11

  12. Our approach to traffic engineering under failures Use the failure probabilities to reason about the likelihood of failure scenarios β€’ Provide a mathematical probabilistic guarantee for availability β€’ Flow allocation vector Uncertainty vector 10 Gbps 𝑦 " 𝑧 " demand 𝑧 # 𝑦 # 5 Gbps BOS NYC 𝑒 ' 𝑦 $ 𝑧 $ 10 Gbps For all flows, 90% of the demand is satisfied 99.9% of the time For all flows, loss will be ≀ 10% of the demand 99.9% of the time 12

  13. Main idea 𝑦 " 𝑧 " 𝑦 # 𝑧 # 𝑦 $ 𝑧 $ The loss will be ≀ $100 with probability 99% Find x that minimizes the loss with probability Ξ² 𝑦 " 𝑧 " 𝑦 # 𝑧 # demand 𝑒 ' 𝑧 $ 𝑦 $ The loss will be ≀ 10% of the demand with probability 99% Find x that minimizes the loss with probability Ξ² 13

  14. Key technique: scenario-based formulation β€’ One link failure A failure scenario β€’ Correlated link failures Loss ≀ 10% of demand w probability 0.95 Probability Target probability Ξ² = 0.95 VaRᡦ = 10% Ξ² = 0.95 β€’ Unsatisfied demand 0 5 10 Loss ( % ) β€’ Packet loss 14

  15. Key technique: scenario-based formulation πœ”(𝑦, π‘Šπ‘π‘†) = 𝑄(π‘Ÿ|𝑀(𝑦, 𝑧(π‘Ÿ)) ≀ π‘Šπ‘π‘†) π‘›π‘—π‘œ{π‘Šπ‘π‘†|πœ”(𝑦, π‘Šπ‘π‘†) β‰₯ 𝛾} Probability Target probability Ξ² = 0.95 VaRᡦ = 10% Ξ² = 0.95 0 5 10 Loss ( % ) 15

  16. TeaVaR: Traffic Engineering Applying Value-at-Risk πœ”(𝑦, π‘Šπ‘π‘†) = 𝑄(π‘Ÿ|𝑀(𝑦, 𝑧(π‘Ÿ)) ≀ π‘Šπ‘π‘†) π‘›π‘—π‘œ{π‘Šπ‘π‘†|πœ”(𝑦, π‘Šπ‘π‘†) β‰₯ 𝛾} Probability What about the worst 5% of scenarios? π‘›π‘—π‘œ{𝐹[𝑀𝑝𝑑𝑑|𝑀𝑝𝑑𝑑 β‰₯ π‘Šπ‘π‘†]} Ξ² = 0.95 0 5 10 Loss ( % ) 16

  17. Challenges unique to networking β€’ Achieving fairness across network users. β€’ Enabling computational tractability as the network scales. β€’ Capturing fast rerouting of traffic in data plane. β€’ Accounting for correlated failures. 17

  18. Achieving fairness 𝑦 " 𝑧 " 𝑦 # 𝑧 # demand 𝑧 $ 𝑦 $ 𝑒 ' Objective: Find x that minimizes the loss with probability Ξ² R i : routes for flow i Satisfied demand for flow i : Ξ£ C∈D E 𝑦 C 𝑧 C x r : flow allocation on route r ∈ R i y r : binary variable indicating if route r is up Starvation-aware loss function: β€’ Worst case normalized unmet demand Ξ£ C∈D E 𝑦 C 𝑧 C ] H 𝑀(𝑦, 𝑧) = 𝑛𝑏𝑦 ' [1 βˆ’ 𝑒 ' 18

  19. Handling scale All Scenarios Our approach 100 Coverage (%) 99.5 99 98.5 98 Google’s B4 topology B4 IBM MSFT ATT 125 Topology # Edges # Scenarios 100 Run time (s) 75 B4 38 O(1E11) 50 IBM 48 O(1E14) 25 MSFT 100 O(1E30) 0 ATT 112 O(1E33) B4 IBM MSFT ATT 19

  20. System architecture 𝑧 " 𝑦 " 𝑦 # 𝑧 # demand 𝑦 $ 𝑧 $ 𝑒 ' Topology Flow demands TeaVaR Flow Linear allocations Optimization Failure probability of scenarios Target availability (0.99, 0.999,..) 20

  21. Evaluations β€’ Topologies: B4, IBM, ATT, and MSFT β€’ Traffic matrix: β€’ Four months of MSFT traffic matrix (one sample/hour), for the rest of topologies, used 24 TMs from YATES [SOSR’18] β€’ Tunnel selection: β€’ Our optimization framework is orthogonal to tunnel selection β€’ Oblivious paths, link disjoint paths, and k-shortest paths β€’ Baselines: β€’ SMORE [NSDI’18] β€’ FFC [SIGCOMM’14] β€’ B4 [SIGCOMM’13] β€’ ECMP 22

  22. Availability vs. demand scale β€’ Availability is measured as the probability mass of scenarios in which demand is fully satisfied (β€œall-or-nothing” requirement) β€’ If a TE scheme’s bandwidth allocation is unable to fully satisfy demand in 0.1% of scenarios, it has an availability of 99.9% 100 Availability (%) 98 96 94 SMORE B4 FFC-1 92 FFC-2 ECMP TeaVaR 90 1.0 1.5 2.1 2.6 3.1 Demand Scale 23

  23. Robustness to probability estimates Noise in probability % error in throughput estimations 1% 1.43% 5% 2.95% 10% 3.07% 15% 3.95% 20% 6.73% 24

  24. Summary TeaVaR uses financial risk theory for solving Traffic β€’ Engineering under failures. TeaVaR’s approach is applicable to networking β€’ resource allocation problems such as capacity planning. Code and demo available at: β€’ http://teavar.csail.mit.edu/ 25

Recommend


More recommend