Traffic Engineering with Forw rward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1
Cloud services require large network capacity Cloud Applications Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year) 2
TE is critical to effectively utilizing networks Traffic Engineering (centralized & SDN-Based) WAN Network Datacenter Network • Microsoft SWAN (SIGCOMM’13) • Devoflow (SIGCOMM’11) • Google B4 (SIGCOMM’13) • MicroTE (CoNEXT’11) • …… • …… 3
Centralized TE is the key to network efficiency Demand=10 2 Link Cap: 10 Sub-optimal resource allocation 5 10 10 Total throughput: 15 Demand=10 1 4 based on local view & control. 5 1) how much traffic to admit 3 2) how to route Requirement: path length ≤ 2 hops 2 Link Cap: 10 Optimal resource allocation Total throughput: 20 10 10 based on global view & control. 1 4 10 3 TE controller 4
But, centralized TE is also vulnerable to faults TE controller Network view Network configuration Frequent updates for high (e.g. topo, cap, traffic) (e.g. routes, rate limits) utilization (e.g. per 5min) Control-plane Data-plane faults faults Network 5
Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s2 s2 7 link failure link failure 10 3 s1 s4 s1 s4 3 3 congestion 7 7 s3 s4 (10) s3 s3 Link Capacity: 10 6
Control plane faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage Control plane faults can also result in congestion. 7
The TE controllability is undermined by faults TE controller Network view Network configuration Incompleteness Inaccuracy Control-plane Data-plane faults faults Network 8
Control and data plane faults in practice In a production WAN network (200+ routers, 6000+ links) : • Faults are common. • Faults cause severe congestion. Data plane: Control plane: fault rate = 25% per 5 minutes. fault rate = 0.1% -- 1% per TE update. 9
State of the art for handling faults • Heavy over-provisioning: Big loss in throughput • Reactive handling of faults: • Control plane faults: retry • Data plane faults: re-compute TE and update networks Cannot prevent Blocked by control Slow congestion plane faults (seconds -- minutes) 10
How about handling faults proactively? TE Algorithm Network making it robust not robust enough 11
Forward fault correction (FFC) in TE • [Bad News] Individual faults are unpredictable. • [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding FFC guarantees no congestion under up to k arbitrary faults. Network faults with careful traffic distribution 12
Example: FFC for link failures Link Capacity: 10 Failure Cases s2 s4 (9) s2 8 s2 s2 s2 9 link failure 9 link failure 1 9 link failure s1 s4 s1 s4 s1 s4 s1 s4 1 1 1 8 8 9 s3 s3 s3 8 s3 s4 (9) s3 K=1 (FFC) 13
Trade-off: network efficiency v.s. robustness s2 s2 10 link failure 5 15 Non-FFC s1 s4 s1 s4 (Throughput: 30) 5 5 10 10 s3 s3 There exists a trade-off between throughput and robustness s2 8 FFC (k=1) 1 Achieving the optimal throughput with FFC guarantee s1 s4 (Throughput: 18) 1 8 s3 FFC does not always sacrifice efficiency for robustness s2 s2 link failure 5 9 Non-FFC 4 s1 s4 s1 s4 (Throughput: 18) 4 4 5 5 s3 s3 14
Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently? 15
Basic TE linear programming formulations LP formulations 𝑐 𝑔 Sizes of flows TE decisions: 𝑚 𝑔,𝑢 Traffic on paths max. ∀𝑔 𝑐 𝑔 TE objective: Maximizing throughput ∀𝑢 𝑚 𝑔,𝑢 ≥ 𝑐 𝑔 s.t. ∀𝑔: Deliver all granted flows ∀𝑓: ∀𝑔 ∀𝑢∋𝑓 𝑚 𝑔,𝑢 ≤ 𝑑 𝑓 No overloaded link Basic TE constraints: … … 𝑙 𝑑 control plane faults FFC constraints: No overloaded link up to 𝑙 𝑓 link failures 𝑙 𝑤 switch failures 16
Formulating data-plane FFC path-1 Paths are link-disjoint. bw allocation: 𝑏 1 flow size: 𝑡 𝑔 path-2 S D 𝑏 2 path-3 𝑏 3 𝑡 𝑔 ≤ 𝑏 2 + 𝑏 3 Fault on path-1: 𝟒 Lemma: FFC is achieved when FFC k=1 𝑡 𝑔 ≤ 𝑏 1 + 𝑏 3 Fault on path-2: path- i’s weight is 𝑏 𝑗 / 𝑏 1 + 𝑏 2 + 𝑏 3 𝟑 𝑡 𝑔 ≤ 𝑏 1 + 𝑏 2 Fault on path-3: 17
An efficient and precise solution to FFC k-sum linear constraint group (k-sum group): Given n paths and 𝐵 = {𝑏 1 , 𝑏 2 , … , 𝑏 𝑜 } , FFC requires that O( 𝑜 𝑙 ) the sum of arbitrary n-k elements in 𝐵 is ≥ flow size FFC-TE LP-formulation: Lossless compression of a k-sum group: O(kn) bubble sorting network TE Objective (SIGCOMM 2014) O( 𝑜 Basic TE Constraints 𝑙 ) O(n) strong duality k-sum group-1 (MSR TR 2016) FFC http://www.hongqiangliu.com/publications.html Constraints k-sum group-N (too many) 18
FFC extensions • Differential protection for different traffic priorities • Minimizing congestion risks without rate limiters • Control plane faults on rate limiters • Uncertainty in current TE • Different TE objectives (e.g. max-min fairness) • … 19
Implementation & evaluation highlights • Testbed experiment (8 switches & 30 servers) • FFC can be implemented in commodity switches • FFC has no data loss due to congestion under faults • Large-scale simulation • A WAN network with O(100) switches and O(1000) links • One-week traffic trace • Fault injection according to real failure trace • Results: with negligible throughput loss, FFC can reduce • data loss by a factor of 7-130 in well-provisioned networks • data loss of high priority traffic to almost zero in well-utilized networks 20
Conclusion and future work SDN Controller Network view Network configuration Network Network Faults: Network Properties: • Data-plane • High throughput FFC • Control-plane • No congestion • Misconfigurations • Security • Attacks • Availability • Traffic spikes • Connectivity • …… • …… 21
Q&A 22
Recommend
More recommend