towards highly available clos based wan routers
play

Towards Highly Available Clos-Based WAN Routers Sucha - PowerPoint PPT Presentation

Towards Highly Available Clos-Based WAN Routers Sucha Supittayapornpong , Barath Raghavan, Ramesh Govindan University of Southern California SIGCOMM 2019 Googles Wide Area Network This network connects datacenters, so it has to be highly


  1. Towards Highly Available Clos-Based WAN Routers Sucha Supittayapornpong , Barath Raghavan, Ramesh Govindan University of Southern California SIGCOMM 2019

  2. Google’s Wide Area Network This network connects datacenters, so it has to be highly available. from B4 and After SIGCOMM’18 2

  3. Google’s Wide Area Network Each datacenter has one or more routers, and each router is connected by trunks. from B4 and After SIGCOMM’18 3

  4. A Trunk Contains Many Optical Links https://www.sd-wan-experts.com/blog/undersea-cables/ 4

  5. WAN Router Trunk’s links are wired to the router. Real routers have 128 or 512 ports. Router Router Wiring Trunk 5 from B4 and After SIGCOMM’18

  6. WAN Router Let’s use a toy router to develop intuitions. Router Router Router Wiring Trunk 6

  7. Clos-Based WAN Router A router is built as a Clos topology. Upper stage Internal Link Lower stage 7

  8. Clos is Non-Blocking It can handle any traffic matrix without loss. (All-to-All 1 unit) 8

  9. Clos is Non-Blocking Equal-cost multipath (ECMP) routing can achieve the non-blocking property. 9

  10. Achieving Non-Blocking Property via ECMP ECMP splits traffic equally to nexthops. 10

  11. Achieving Non-Blocking Property via ECMP ECMP splits traffic equally to nexthops. 11

  12. Achieving Non-Blocking Property via ECMP ECMP splits traffic equally to nexthops. 12

  13. Implication of Non-Blocking Property There is sufficient internal capacity to route traffic between lower and upper stages. 13

  14. What Happens If There are Failures? 14

  15. What Happens If There are Failures? A single failure reduces internal capacity. 15

  16. What Happens If There are Failures? A single failure reduces internal capacity. Overall capacity can reduce by half when ECMP is used. 16

  17. Key Question Can we completely mask internal link and switch failures? 17

  18. Key Question Can we completely mask internal link and switch failures? If not, can we degrade gracefully? Existing approaches do neither of these. 18

  19. Key Insight: Wiring trunks to maximize early forwarding 19

  20. Key Insight: Wiring trunks to maximize early forwarding Careful wiring enables early forwarding. Previous Early forwarding 20

  21. Key Insight: Wiring trunks to maximize early forwarding Early forwarding can reduce upflow. 21

  22. Key Insight: Wiring trunks to maximize early forwarding Early forwarding can reduce upflow. 22

  23. Key Insight: Wiring trunk to maximize early forwarding The router can recover full capacity in this example. (We completely mask the failure.) 23

  24. Early forwarding needs weighted version of ECMP Weight = 0 Weight = 1 24

  25. WCMP can increase table sizes Weight Weight Weight Weight = 2 = 1 = 21 = 11 Use 2+1 = 3 Use 21+11 = 32 weight entries weight entries 25

  26. WCMP weights can depend on failure pattern 26

  27. Challenges What wiring minimizes What is the effective What is the most upflow? capacity for a failure space-efficient set of pattern? WCMP weights? 27

  28. Contributions 28

  29. The Entire Pipeline is Offline Computing routing table is expensive and cannot be done after failure happens. So, we must precompute tables for every possible pattern. Challenge: All of these steps must scale to very large routers. 29

  30. Finding Optimal Wiring 30

  31. Upflow depends on both trunk wiring and traffic Same traffic, Different wiring 31

  32. Upflow depends on both trunk wiring and traffic Different traffic, Same wiring 32

  33. Upflow depends on both trunk wiring and traffic Upflow is a function of wiring and traffic matrix . 33

  34. Each wiring has its worst-case traffic Upflow = 2 Upflow = 4 Traffic matrix 1 Traffic matrix 2 34

  35. Each wiring has its worst-case traffic Upflow = 8 Upflow = 6 Traffic matrix 1 Traffic matrix 2 35

  36. Optimal wiring minimizes the worst-case upflow Upflow = 8 Upflow = 4 Choose this wiring 36

  37. Challenge: There are infinitely many traffic matrices! 37

  38. Solution: Extreme traffic matrices are sufficient 38

  39. Finding optimal wiring becomes MILP 39

  40. Calculating Effective Capacity 40

  41. A Non-Blocking Router Allows Topology Abstraction Abstraction simplifies traffic engineering. Non blocking Non-blocking router Topology Abstraction 2 2 A C 41

  42. A Blocking router breaks the Topology Abstraction The router cannot be abstracted by a simple node with flow conservation anymore. Blocking Blocking router 1 2 A C 42

  43. Upon Failure, Scale Demand to Ensure Non-Blocking Blocking Non-blocking = 2 ⨉ 0.5 43

  44. Effective Capacity for Non-Blocking Design Effective capacity is the largest scaling factor that a router is non-blocking under a given failure pattern. Blocking Non-blocking, 𝜄 =0.5 = 2 ⨉ 0.5 44

  45. Computing Effective Capacity Under a failure pattern, finding effective capacity is a linear program per traffic matrix. 𝜄 =0.75 𝜄 =0.5 Traffic matrix 1 Traffic matrix 2 45

  46. Effective Capacity under Failure and Traffic The effective capacity is the minimum value. 𝜄 =0.75 𝜄 =0.5 Traffic matrix 1 Traffic matrix 2 46

  47. Challenge: Exponential Number of Failure Patterns 47

  48. Challenge: Exponential Number of Failure Patterns Similar Not similar 48

  49. Challenge: Exponential Number of Failure Patterns Solution: Group similar failure patterns using a graph canonicalization algorithm Similar Calculate effective capacity for each canonical pattern 49

  50. Compacting Routing Table Please see this part in the paper. 50

  51. Evaluation Resilience of 128-port router Comparison to alternative strategies Scalability to 512-port router Routing table sizes Impact of optimizations 51

  52. Evaluation Resilience of 128-port router Comparison to alternative strategies Scalability to 512-port router Routing table sizes Impact of optimizations 52

  53. Methodology 128-port router Upper switch failure 8 upper switches Link failure 8 links per lower switch 16 lower switches Lower switch failure 4 trunks We enumerate all multiple-of-8 trunk sizes. (34 combinations) We compute the effective capacity under all possible failure conditions. 53

  54. Effective Capacity - Link Failure: 128-Port Router 0 (16,32) (0,16] 32 Our approach can mask up to 6 concurrent link failures. 54

  55. Lower Switch Failure: 128-Port Router Lower switch failure Capacity degrades gracefully. 55

  56. Comparison to Alternative Wiring Strategies Baseline Wiring Random Wiring 56

  57. Minimal-Upflow Wiring Yields Superior Resilience Minimal-Upflow Wiring Baseline Wiring Random Wiring No other approach can mask even a single link failure 57

  58. Scalability: 512-Port Router The pipeline can scale to the 512-port router. Lower switch failure Upper switch failure 58

  59. Conclusion Min-upflow wiring and early forwarding can mask significant number of failures. It improves the availability of WAN routers. It can be used to reduce the cost of WAN routers. 59

  60. https://github.com/USC-NSL/Highly-Available-WAN-Router (Available Oct. 2019) 60

Recommend


More recommend