mplane an architecture for scalable fault localization
play

mPlane: An Architecture for Scalable Fault Localization Ramana Rao - PowerPoint PPT Presentation

mPlane: An Architecture for Scalable Fault Localization Ramana Rao Kompella, Alex C. Snoeren, George Varghese Purdue University University of California, San Diego ReARCH 2009 1 Disruptions are costly Disruptions in the network are


  1. mPlane: An Architecture for Scalable Fault Localization Ramana Rao Kompella, Alex C. Snoeren, George Varghese Purdue University University of California, San Diego ReARCH 2009 1

  2. Disruptions are costly Disruptions in the network are significant in their impact  Stringent latency and loss requirements  VoIP, IPTV, Gaming (100-200 msec, small loss)  High-performance computing (10s of µsecs, very small loss)  Very tight SLAs with little room  Small amounts of extra delay (1ms) can cause SLA violations 2 ReARCH 2009

  3. Debugging ISP networks  ISPs use active probes to detect delay spikes or loss episodes  Problem: Active probes do not scale well  O(n 2 ) in today’s networks  Paths share links, so many probes redundant  Solutions:  Probe less frequently (one every few mins/secs)  Reduce the value of n by aggregating end-points  Measure between smaller number of points and extrapolate (iPlane [Madhyastha06])  Cannot detect many performance problems 3 ReARCH 2009

  4. Localizing the root cause  Active probes indicate that a problem occurred, not where the problem occurred  Problems are delay spikes or loss episodes  Tomography approaches help answer this to some extent [Chen04,Duffield03]  Problem: Inference under-constrained  Hence, manual debugging and troubleshooting Main question: How to perform scalable fault localization ? 4 ReARCH 2009

  5. mPlane: Basic idea Link Router 5 ReARCH 2009

  6. mPlane: Basic idea Segment  Key idea  Break end-to-end paths into “segments” (e.g., router forwarding paths, links)  High fidelity measurements local to a segment  For a network with n routers, m links,  Total number of segments = O(nd 2 +m), d is the average degree 6 ReARCH 2009

  7. Advantages of segment approach  Advantage 1: Probes can be injected within a local segment at high frequency  Measurements not amplified by path lengths  Advantage 2: Direct fault localization of end-to-end paths  No need for indirect approaches such as tomography  End-to-end active probes may still need to be issued, but with lower frequency 7 ReARCH 2009

  8. Architecture of mPlane External External Component Component Measures properties of links across routers Internal Internal Component Component Measurements are Measures properties of reported periodically all forwarding paths to the NOC within routers 3) Network Operations Center 8 ReARCH 2009

  9. Internal component  Use data structures such as Lossy Difference Aggregator (LDA) [Kompella09]  Reports aggregate latency measurements using few counters at both interfaces  Periodically state is transmitted between sender and receiver (very little overhead)  Can measure loss and latency in a scalable fashion  Typically required for each measurement equivalent class if QoS enabled  mPlane itself is oblivious to LDA  Any data structure that can report router latency measurements works fine 9 ReARCH 2009

  10. External component  Measure properties of links  Link properties typically vary less  Optical re-configuration may change the delay to some extent  Routers inject active probes periodically to the neighbor  Can also piggyback on control packets exchanged between two neighboring routers  Example: OSPF Hello packets, Time synchronization packets, etc. 10 ReARCH 2009

  11. mPlane deployment  Clean-slate deployment fairly straightforward  Each router reports measurements of all forwarding paths and links  Any end-to-end path problem can now be correlated directly with individual segment measurements to isolate the root cause  Fork-lift upgrade difficult.  How to deploy mPlane incrementally ?  How can a subset of routers cover the entire network ? 11 ReARCH 2009

  12. Partial deployment example Upgraded routers 3 3 2 B C 1 2 A 1 D 2 2 OSPF weights F 2 E Measurement server (m-server) ReARCH 2009 12

  13. Self-sourced OSPF shortest paths A M-set for A consists of Cut in the OSPF {F, B} shortest path tree F B whenever an upgraded router or a leaf is C E encountered D D Measuring the nodes within the sub-tree handled by F 13 ReARCH 2009

  14. Evaluating the benefits  Two main metrics  Probe Hop Count: Sum of all hops taken by every active probe in the network  Localization Granularity: Average segment size in number of hops  Upgrade strategy involves picking the right routers for upgrade  Naïve strategy: pick routers at random  Intelligent strategy: pick routers that decrease the probe hop count and localization granularity the most 14 ReARCH 2009

  15. Intelligent upgrade strategy  Greedily pick routers that are present on most number of shortest paths  Benefits localization granularity by reducing the length of most segments  Benefits probe hop count since most number of paths are shortened  Not necessarily the most optimal, but greedy works much better than random as our evaluation shows  LP-formulation should be possible 15 ReARCH 2009

  16. Benefit in bandwidth reduction (Sprint Rocketfuel topology, 315 routers) Random falls off Upgrading just 40 relatively slowly routers reduces probe hop count by 2 orders of magnitude 16 ReARCH 2009

  17. Localization granularity (segment size in number of hops) 5 Localization granularity Intelligent (avg) Random (avg) 4.5 4 Upgrading just 50 3.5 routers reduces 3 localization granularity to about 1.5 2.5 2 1.5 1 0 50 100 150 200 250 300 350 Number of upgraded routers 17 ReARCH 2009

  18. Summary  Proposed mPlane for direct and scalable fault localization  Key idea is to break end-to-end paths into segments, and monitor them with high fidelity  Partial deployment using OSPF shortest path trees to determine upgraded routers to probe  Benefits of an intelligent upgrade strategy  100x reduction in bandwidth  Localization granularity of 1.5  With only 15% of routers upgraded Thanks! Questions… 18 ReARCH 2009

  19. Thanks! Questions… ReARCH 2009 19

  20. Other details  Routers advertize the measurement capability using one reserved bit within options field of HELLO messages  M-sets may change with OSPF LSAs as shortest paths change during link failures  During periods of churn, need to conduct measurements to both old and new m-sets  ECMP handled by routers by issuing separate probes through separate paths 20 ReARCH 2009

Recommend


More recommend