a network state management service
play

A Network-State Management Service Peng Sun Ratul Mahajan, - PowerPoint PPT Presentation

A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft Complex Infrastructure Microsoft Azure Number of 2010 2014 Data Center A few 10s Network


  1. A Network-State Management Service Peng Sun Ratul Mahajan, Jennifer Rexford, Lihua Yuan, Ming Zhang, Ahsan Arefin Princeton & Microsoft

  2. Complex Infrastructure Microsoft Azure Number of 2010 2014 Data Center A few 10s Network 1,000s 10s of 1,000s Device Network 10s of Tbps Pbps Capacity Variety of vendors/models/time 1

  3. Management Applications Traffic Engineering Load Balancing Link Corruption Device Mitigation Firmware Upgrade …… 2

  4. Our Question How to safely run multiple management applications on shared infrastructure 3

  5. Naïve Solution • Run independently Link Traffic Corruption Engineering Mitigation Firmware Upgrade Network Devices 4

  6. Naïve Solution • It does not work due to 2 problems Link Traffic Corruption Engineering Mitigation Firmware Upgrade Network Devices 4

  7. Problem #2: Safety Violation Link-corruption- Core1 2 mitigation shuts down faulty Agg A Agg Agg A B Firmware-upgrade schedules Agg B ToRs to upgrade 6

  8. Potential Solution #1 • One monolithic application • Central control of all actions Link Traffic Corruption Engineering Mitigation Firmware Upgrade 7

  9. Too Complex to Build • Difficult to develop • Combine all applications that are already individually complicated • High maintenance cost • for such huge software in practice 8

  10. Potential Solution #2 • Explicit coordination among applications • Consensus over network changes Link Traffic Corruption Engineering Mitigation Firmware Upgrade 9

  11. Still Too Complex • Hard to understand each other • Diverse network interactions Device Application Routing Config Traffic Engineering Firmware upgrade 10

  12. Main Enemy: Complexity • Application development • Application coordination Explicitly Indepen- Monolithic coordinate dent Simple Complex 11

  13. What We Advocate • Loose coupling of applications • Design principle: • Simplicity with safety guarantees • Forgo joint optimization • Worthwhile tradeoff for simplicity • Applications could do it out-of-band 12

  14. Overview of Statesman • Network operating system for safe multi-application operation • Uses network state abstraction • Three views of network state • Dependency model of states 13

  15. The “State” in Statesman • Complexity of dealing with devices • Heterogeneity • Device-specific commands Network State Network Devices 14

  16. State Variable Examples State Variable Value Device Power Status Up, down Device Firmware Version number Device SDN Agent Boot Up, down Device Routing State Routing rules Link Admin Status Up, down Link Control Plane BGP, OpenFlow , … 15

  17. Simplify Device Interaction Past Now Application Application Device Statistics Read Write SNMP, OF, Device- vendor specific Network API, … cmds State Network Devices Network Devices 16

  18. Views of Network State Observed State Actual state of the whole network Desired state to be updated on Target State the whole network Application Application Application Observed Target State State Network Devices 17

  19. Two Views Are Not Enough Application Application Application Observed Target State State Network Devices 18

  20. Two Views Are Not Enough One More View A group of entity-variable-values Proposed State desired by an application Application Application Application Observed Target Proposed State State State Network Devices 18

  21. How Merging Works • Combine multiple proposed states into a safe target state • Conflict resolution • Last-writer-wins • Priority-based locking • Sufficient for current deployment • Safety invariant checking • Partial rejection & Skip update 19

  22. Choose Safety Invariants Cannot protect Hinder application network operation too frequently Loose Tight • Our current choice • Connectivity: Every pair of ToRs in one DC is connected • Capacity: 99% of ToR pairs have at least 50% capacity 20

  23. Recap of Three-View Model • Simplify network management Application Application Statesman Application Observed Target Proposed State State State What we What we want What can be see from the network actually done the network to be on the network 21

  24. Yet Another Problem • What’s in Proposed State • Small number of state variables that application cares • Implicit conflicts arises • Caused by state dependency 22

  25. Implicit Conflict TE writes new value A D of routing state of B for tunneling traffic Firmware-upgrade writes new value of B C firmware state of B 23

  26. Dependency Relations PathState RoutingState ConfigurationState ConfigurationState AdminState Link FirmwareVersion PowerState Device 24

  27. Build in Dependency Model • Statesman calculates it internally • Only exposes the result for each state variable • Whether the variable is controllable 25

  28. Statesman System Checker Storage Service Observed Proposed Target State State State Monitor Updater 26

  29. Deployment Overview • Operational in Microsoft Azure for 10 months • Cover 10 DCs of 20K devices 27

  30. Production Applications • 3 diverse applications built • Device firmware upgrade • Link corruption mitigation • Traffic engineering • Finish within months • Only thousands of lines of code 28

  31. Case #1: Resolve Conflict Inter-DC TE & Firmware-upgrade DC 2 DC = Data Center BR 3 BR = Border Router BR 4 BR 1 BR 8 DC 1 DC 4 BR 2 BR 7 BR 6 BR 5 29 DC 3

  32. … … … … 30

  33. … … … … Firmware-upgrade acquires lock of BR1 30

  34. … … … … TE fails to acquire lock, and moves traffic away 30

  35. … … … … TE fails to acquire lock, and moves traffic away 30

  36. … … … … BR1 firmware upgrade starts 30

  37. … … … … BR1 firmware BR1 firmware upgrade upgrade starts ends. Lock released. 30

  38. … … … … BR1 firmware TE re-acquires lock, and upgrade starts moves traffic back 30

  39. … … … … BR1 firmware TE re-acquires lock, and upgrade starts moves traffic back 30

  40. Case #1 Summary • Each application: • Simple logic • Unaware of the other • Statesman enables: • Conflict resolution • Necessary coordination 31

  41. Case #2: Maintain Capacity Invariant Firmware-upgrade & Link-corruption-mitigation … Core 1 4 … … … 1 4 1 4 1 4 Agg … … … ToR 1 n 1 n 1 n Pod 1 Pod 4 Pod 10 Link corrupting packets 32

  42. … … … … … Upgrade proceeds in normal speed in Pod 3 and 5 33

  43. … … … … … Upgrade proceeds in normal speed in Pod 3 and 5 33

  44. … … … … … Upgrade proceeds in normal speed in Pod 3 and 5 33

  45. … … … … … Upgrade in Pod 4 is slowed Upgrade proceeds in normal down by checker due to lost speed in Pod 3 and 5 capacity 33

  46. … … … … … Upgrade in Pod 4 is slowed Upgrade proceeds in normal down by checker due to lost speed in Pod 3 and 5 capacity 33

  47. Case #2 Summary • Statesman: • Automatically adjusts application progresses • Keeps the network within safety requirements 34

  48. Conclusion • Need network operating system for multiple management applications • Statesman • Loose coupling of applications • Network state abstraction • Deployed and operational in Azure 35

  49. Thanks! Questions ? Check paper for related works 36

Recommend


More recommend