Robotron: Top-down Network Management at Scale Yu-Wei Eric Sung , Xiaozheng Tie, Starsky H.Y. Wong, Hongyi Zeng ACM SIGCOMM 2016 August 25, 2016
Scale of Facebook Community 500 Million 1 Billion 1 Billion 1.7 Billion on Facebook Monthly on Whatsapp Monthly on Instagram Monthly on Messenger Monthly
Network Management at Facebook . . . What’s involved? . . . . . . 1 R 511 . . . . . . . . . • Goals: Build and evolve FB network • Example tasks: circuit/device turnup, network monitoring • Human interactions -> outages . . . . . . . . . ` . . . R 512 1024 . . . . . .
Network Management at Facebook Why is it hard? • Distributed Configurations • Multiple Domains • Versioning • Dependency • Vendor Differences
Network Management at Facebook Early days… 2004-2007 2008 2009 2010 2011 2012 2013 2014 2015 Manual Configuration and Monitoring with ad-hoc scripts
Contribution Robotron started 2004-2007 2008 2009 2010 2011 2012 2013 2014 2015 Manual Configuration and Our Paper Monitoring with ad-hoc scripts • Shed light on • Network management tasks • Robotron’s usage • Evolution of Roboron • Our experiences using Robotron
Overview of Facebook’s Network Lifecycle of user requests Users Internet POPs Backbone Data Centers
Point of Presence (POP) • Standardized topology • Services: LB, Cache • Common tasks • Build/upgrade a cluster • Provisioning new peering circuits Users Internet POPs Backbone Data Centers
Backbone • Irregular, demand-driven topology • Common tasks: • Add/migrate circuits • Add/remove routers Users Internet POPs Backbone Data Centers
Datacenter • Standardized topology • Services: Web, Cache, Database • Common tasks • Build/decomm a cluster • Cluster capacity upgrade Users Internet POPs Backbone Data Centers
Overview of Facebook’s Network Multiple versions of FB cluster architectures co-exist # of clusters (normalized) (normalized) 1 Gen2 Gen1 0.8 POP 0.6 0.4 0.2 0 Time 8 generations 1 Gen3V6 # of clusters (normalized) Gen3 DC 0.8 Gen2V6 Gen2-D Gen2-C 0.6 Gen2-B Gen2-A 0.4 Gen1 0.2 0 Time
Robotron: “Top-Down” Network Management System@FB Overview Network Config Deployment Monitoring Design Generation FBNet DB
FBNet: Modeling the Network Example 4-post POP cluster Internet BB 1 BB 2 PR 2 PR 1 20G 4-post POP PSW c PSW d PSW a PSW b Cluster To Top-of-Rack switches & servers
FBNet: Modeling the Network Object 2001::1 2001::2 ae0 ae1 Circuit et1/1 10G et2/1 PhysicalInterface PSW a PR 1 et1/2 et3/1 10G AggregatedInterface Linecard BgpV6Session eBGP session Networkswitch Linecard V6Prefix PhysicalInterface Circuit Circuit
FBNet: Modeling the Network Value 2001::1 2001::2 ae0 ae1 Circuit et1/1 10G et2/1 PhysicalInterface PSW a PR 1 et1/2 et3/1 speed=10G 10G name=et1/1 AggregatedInterface Linecard BgpV6Session eBGP session name=ae0 Networkswitch Linecard slot=1 name=PSW a V6Prefix PhysicalInterface model=X Circuit prefix=2001::1 name=et1/2 Circuit speed=10G
FBNet: Modeling the Network Relationship 2001::1 2001::2 ae0 ae1 Circuit et1/1 10G et2/1 a_endpoint= PhysicalInterface PSW a PR 1 et1/2 et3/1 z_endpoint= speed=10G 10G name=et1/1 It’s complicated AggregatedInterface linecard= Linecard agg_interface= BgpV6Session eBGP session name=ae0 Networkswitch Linecard slot=1 name=PSW a V6Prefix PhysicalInterface a_prefix= model=X Circuit z_prefix= device= prefix=2001::1 name=et1/2 interface= agg_interface= linecard= Circuit a_endpoint= z_endpoint= speed=10G
FBNet Model Snippet class PhysicalInterface(Interface): linecard = models.ForeignKey(Linecard) agg_interface = models.ForeignKey( AggregatedInterface)
FBNet Model Snippet Related models class PhysicalInterface(Interface): linecard = models.ForeignKey(Linecard) agg_interface = models.ForeignKey( AggregatedInterface)
FBNet Model Snippet Model inheritance class PhysicalInterface(Interface): linecard = models.ForeignKey(Linecard) agg_interface = models.ForeignKey( AggregatedInterface)
FBNet: Architecture API Layer Read API Read API • RPC services Read API Read API Write Service Read Service • Read: fine-grained per- model query • Write: task-based • High Availability: Multiple replicas per DC FBNet
FBNet: Architecture API Layer Read API Read API • 1 primary, multiple secondary Read API Read API Write Service Read Service DBs • Scalability: 1 slave per DC Slave Slave Primary Secondary Replication FBNet Stream
Robotron’s management life cycle Network Config Deployment Monitoring Design Generation FBNet DB
Network Design Design intent à FBNet objects Template for a POP cluster FBNet objects Cluster( devices={ PR: DeviceSpec( PR 1 PR 2 hardware=“Router_Vendor1” num_devices=2) PSW a PSW b PSW c PSW d PSW: DeviceSpec( BackboneRouters: 2 hardware=“Switch_Vendor2” NetworkSwitches: 4 num_devices=4) Circuits: 16 }, PhysicalInterfaces: 32 Link_groups=[ AggregatedInterfaces: 16 LinkGroup( V6Prefixes: 16 a_device=PR, BgpV6Sessions: 8 z_device=PSW, pifs_per_agg=2, 94 objects across 7 ip=V6) models ] )
Config Generation FBNet objects à Device configs FBNet PR 1 PR 2 struct Device { PSW a PSW b PSW c PSW d 1: list<AggregatedInterface> aggs, FBNet objects } Vendor Config Schema struct AggregatedInterface { agnostic 1: string name, Per-device PSW b PSW a 2: i32 number, PR 1 PR 2 objects 3: string v4_prefix, PSW c PSW d 4: string v6_prefix, 5: list<PhysicalInterface> pifs, } struct PhysicalInterface { 1: string name, }
Config Generation FBNet objects à Device configs FBNet PR 1 PR 2 PSW a PSW b PSW c PSW d FBNet objects Vendor {% for agg in device.aggs %} Config Schema agnostic interface {{agg.name}} Per-device PSW b mtu 9192 PSW a PR 1 PR 2 objects no switchport PSW c Vendor 1 PSW d Vendor 2 load-interval 30 {% if agg.v4_prefix %} interface template interface template ip addr {{agg.v4_prefix}} {% endif %} BGP template BGP template {% if agg.v6_prefix %} Vendor MPLS template MPLS template … ipv6 addr {{agg.v6_prefix}} … Specific {% endif %} no shutdown ! Vendor-specific PR 1 config PSW a config PSW b config {% endfor %} Device Configs PSW c config PSW d config PR 2 config
Config Generation FBNet objects à Device configs FBNet PR 1 PR 2 PSW a PSW b PSW c PSW d FBNet objects Vendor {% for agg in device.aggs %} Config Schema agnostic interface {{agg.name}} Per-device PSW b mtu 9192 PSW a PR 1 PR 2 objects no switchport PSW c Vendor 1 PSW d Vendor 2 load-interval 30 {% if agg.v4_prefix %} interface template interface template ip addr {{agg.v4_prefix}} {% endif %} BGP template BGP template {% if agg.v6_prefix %} Vendor MPLS template MPLS template … ipv6 addr {{agg.v6_prefix}} … Specific {% endif %} no shutdown ! Vendor-specific PR 1 config PSW a config PSW b config {% endfor %} Device Configs PSW c config PSW d config PR 2 config
Config Generation FBNet objects à Device configs FBNet PR 1 PR 2 PSW a PSW b PSW c PSW d FBNet objects Vendor {% for agg in device.aggs %} Config Schema agnostic interface {{agg.name}} Per-device PSW b mtu 9192 PSW a PR 1 PR 2 objects no switchport PSW c Vendor 1 PSW d Vendor 2 load-interval 30 {% if agg.v4_prefix %} interface template interface template ip addr {{agg.v4_prefix}} {% endif %} BGP template BGP template {% if agg.v6_prefix %} Vendor MPLS template MPLS template … ipv6 addr {{agg.v6_prefix}} … Specific {% endif %} no shutdown ! Vendor-specific PR 1 config PSW a config PSW b config {% endfor %} Device Configs PSW c config PSW d config PR 2 config
Config Generation FBNet objects à Device configs FBNet PR 1 PR 2 PSW a PSW b PSW c PSW d FBNet objects Vendor {% for agg in device.aggs %} Config Schema agnostic interface {{agg.name}} Per-device PSW b mtu 9192 PSW a PR 1 PR 2 objects no switchport PSW c Vendor 1 PSW d Vendor 2 load-interval 30 {% if agg.v4_prefix %} interface template interface template ip addr {{agg.v4_prefix}} {% endif %} BGP template BGP template {% if agg.v6_prefix %} Vendor MPLS template MPLS template … ipv6 addr {{agg.v6_prefix}} … Specific {% endif %} no shutdown ! Vendor-specific PR 1 config PSW a config PSW b config {% endfor %} Device Configs PSW c config PSW d config PR 2 config
Usage Statistics • # of FBNet model change? • # changed FBNet objects per design change? • Frequency and size of config change?
FBNet Model Changes How much does FBNet model change over time? • Still many changes over time • Reasons: new models, values, relationships
Design Changes How many FBNet object are changed per design change? CDF across design changes 1 0.75 All Interface 0.5 POP/DC Circuit v6 Prefix 0.25 v4 Prefix Device 0 1 10 100 1,000 10,000 # of FBNet objects CDF across design changes 1 0.75 All Backbone Interface 0.5 Circuit v6 Prefix 0.25 v4 Prefix Device 0 1 10 100 1,000 10,000 # of FBNet objects
Recommend
More recommend