Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1
Cloud services require large network capacity Cloud Services Growing traffic Cloud Networks Expensive (e.g. cost of WAN: $100M/year) 2
TE is critical to effectively utilizing networks Traffic Engineering • Devoflow • Microsoft SWAN • MicroTE • Google B4 • …… • …… WAN Network Datacenter Network 3
But, TE is also vulnerable to faults TE controller Network Network view configuration Frequent updates for Control-plane high utilization faults Data-plane faults Network 4
Control plan faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage 5
Congestion due to control plane faults s2 s4 (10) New Flows (traffic demands): s2 s2 Link Capacity: 10 s1 s2 (10) 7 10 10 s1 s3 (10) 3 s1 s4 (10) 10 s1 s4 s1 s4 3 7 10 10 s3 s3 Configuration s3 s4 (10) failure s2 7 10 3 s1 s4 Congestion 10 10 10 s3 6
Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) s2 s2 7 link failure link failure 10 3 s1 s4 s1 s4 3 3 congestion 7 7 s3 s4 (10) s3 s3 Link Capacity: 10 7
Control and data plane faults in practice In production networks : • Faults are common. • Faults cause severe congestion. Control plane: Data plane: fault rate = 0.1% -- 1% per TE update. fault rate = 25% per 5 minutes. 8
State of the art for handling faults • Heavy over-provisioning Big loss in throughput • Reactive handling of faults • Control plane faults: retry • Data plane faults: re-compute TE and update networks Cannot prevent Blocked by control Slow congestion plane faults (seconds -- minutes) 9
How about handling congestion proactively? 10
Forward fault correction (FFC) in TE • [Bad News] Individual faults are unpredictable. • [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding FFC guarantees no congestion under up to k arbitrary faults. Network faults with careful traffic distribution 11
Example: FFC for control plane faults s2 s2 Link Capacity: 10 7 10 10 3 10 s1 s4 s1 s4 3 7 10 10 s3 s3 Non-FFC Configuration failure s2 7 10 3 s1 s4 Congestion 10 10 10 s3 12
Example: FFC for control plane faults s2 s2 Link Capacity: 10 7 10 10 3 7 s1 s4 s1 s4 3 10 7 10 s3 s3 Control Plane FFC (k=1) Configuration failure s2 s2 7 10 10 10 3 7 s1 s4 s1 s4 3 7 10 10 10 7 s3 s3 Configuration 13 failure
Trade-off: network utilization vs. robustness s2 s2 s2 10 10 10 10 10 10 4 7 10 s1 s4 s1 s4 s1 s4 10 10 10 10 10 10 s3 s3 s3 K=1 K=2 Non-FFC (Control Plane FFC) (Control Plane FFC) Throughput: 44 Throughput: 47 Throughput: 50 14
Systematically realizing FFC in TE Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently? 15
Basic TE linear programming formulations LP formulations 𝑐 𝑔 Sizes of flows TE decisions: 𝑚 𝑔,𝑢 Traffic on paths max. ∀𝑔 𝑐 𝑔 TE objective: Maximizing throughput ∀𝑢 𝑚 𝑔,𝑢 ≥ 𝑐 𝑔 s.t. ∀𝑔: Deliver all granted flows ∀𝑔 ∀𝑢∋𝑓 𝑚 𝑔,𝑢 ≤ 𝑑 𝑓 ∀𝑓: No overloaded link Basic TE constraints: … … 𝑙 𝑑 control plane faults FFC constraints: No overloaded link up to 𝑙 𝑓 link failures 𝑙 𝑤 switch failures 16
Formulating control plane FFC 𝑔 1 s1 s2 𝑔 2 𝑔 3 Total load under faults? 𝑔 1 ’s load in old TE 𝑔 2 ’s load in new TE 𝑝𝑚𝑒 + 𝑚 2 𝑜𝑓𝑥 + 𝑚 3 𝑜𝑓𝑥 ≤ link cap Fault on 𝑔 1 : 𝑚 1 𝟒 𝑜𝑓𝑥 + 𝑚 2 𝑝𝑚𝑒 + 𝑚 3 𝑜𝑓𝑥 ≤ link cap Fault on 𝑔 2 : 𝑚 1 𝟐 𝑜𝑓𝑥 + 𝑚 2 𝑜𝑓𝑥 + 𝑚 3 𝑝𝑚𝑒 ≤ link cap Fault on 𝑔 3 : 𝑚 1 Challenge: too many constraints With n flows and FFC protection k : #constraints = 𝒐 𝒐 𝟐 + … + 𝒍 for each link. 17
An efficient and precise solution to FFC Our approach: A lossless compression from O( 𝑜 𝑙 ) constraints to O(kn) constraints. Total load under faults Total additional load due to faults ≤ link capacity ≤ link spare capacity 𝑦 𝑗 : additional load due to fault- i Given 𝑌 = {𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 } , FFC requires that O( 𝑜 𝑙 ) the sum of arbitrary k elements in 𝑌 is ≤ link spare capacity Define 𝑧 𝑛 as the m th largest element in 𝑌 : O( 1 ) 𝑙 𝑛=1 𝑧 𝑛 ≤ link spare capacity Expressing 𝑧 𝑛 with 𝑌 ? 18
Sorting network 𝑧 1 (1 st largest) 𝑦 1 𝑨 4 𝑨 5 𝑧 2 (2 nd largest) 𝑦 2 𝑨 7 𝑨 2 𝑨 3 𝑨 8 𝑦 3 𝑨 6 𝑨 1 A comparison 𝑦 4 𝑦 1 𝑨 1 =max{ 𝑦 1 , 𝑦 2 } 𝑦 2 𝑨 2 =min{ 𝑦 1 , 𝑦 2 } 1 st round 2 nd round 𝑧 1 + 𝑧 2 ≤ link spare capacity • Complexity: O(kn) additional variables and constraints. • Throughput: optimal in control-plane and data plane if paths are disjoint. 19
FFC extensions • Differential protection for different traffic priorities • Minimizing congestion risks without rate limiters • Control plane faults on rate limiters • Uncertainty in current TE • Different TE objectives (e.g. max-min fairness) • … 20
Evaluation overview • Testbed experiment • FFC can be implemented in commodity switches • FFC has no data loss due to congestion under faults • Large-scale simulation A WAN network with O(100) switches and O(1000) links Injecting faults based on real failure reports Single priority traffic in a Multiple priority traffic in a well-provisioned network well-utilized network 21
FFC prevents congestion with negligible throughput loss Single priority 160% High priority (High FFC protection) Medium priority (Low FFC protection) Low priority (No FFC protection) 100 80 Ratio (%) 60 40 20 <0.01% 0 FFC Data-loss / Non-FFC Data-loss FFC Throughput / Optimal Throughput 22
Conclusions • Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults. • FFC proactively handle these faults. • Guarantee: no congestion when #faults ≤ k . • Efficiently computable with low throughput overhead in practice. FFC Heavy network High risk of over-provisioning congestion 23
Recommend
More recommend