mPlane: An Architecture for Scalable Fault Localization Ramana Rao - PowerPoint PPT Presentation

mPlane: An Architecture for Scalable Fault Localization Ramana Rao Kompella, Alex C. Snoeren, George Varghese Purdue University University of California, San Diego ReARCH 2009 1

Disruptions are costly Disruptions in the network are significant in their impact  Stringent latency and loss requirements  VoIP, IPTV, Gaming (100-200 msec, small loss)  High-performance computing (10s of µsecs, very small loss)  Very tight SLAs with little room  Small amounts of extra delay (1ms) can cause SLA violations 2 ReARCH 2009

Debugging ISP networks  ISPs use active probes to detect delay spikes or loss episodes  Problem: Active probes do not scale well  O(n 2 ) in today’s networks  Paths share links, so many probes redundant  Solutions:  Probe less frequently (one every few mins/secs)  Reduce the value of n by aggregating end-points  Measure between smaller number of points and extrapolate (iPlane [Madhyastha06])  Cannot detect many performance problems 3 ReARCH 2009

Localizing the root cause  Active probes indicate that a problem occurred, not where the problem occurred  Problems are delay spikes or loss episodes  Tomography approaches help answer this to some extent [Chen04,Duffield03]  Problem: Inference under-constrained  Hence, manual debugging and troubleshooting Main question: How to perform scalable fault localization ? 4 ReARCH 2009

mPlane: Basic idea Link Router 5 ReARCH 2009

mPlane: Basic idea Segment  Key idea  Break end-to-end paths into “segments” (e.g., router forwarding paths, links)  High fidelity measurements local to a segment  For a network with n routers, m links,  Total number of segments = O(nd 2 +m), d is the average degree 6 ReARCH 2009

Advantages of segment approach  Advantage 1: Probes can be injected within a local segment at high frequency  Measurements not amplified by path lengths  Advantage 2: Direct fault localization of end-to-end paths  No need for indirect approaches such as tomography  End-to-end active probes may still need to be issued, but with lower frequency 7 ReARCH 2009

Architecture of mPlane External External Component Component Measures properties of links across routers Internal Internal Component Component Measurements are Measures properties of reported periodically all forwarding paths to the NOC within routers 3) Network Operations Center 8 ReARCH 2009

Internal component  Use data structures such as Lossy Difference Aggregator (LDA) [Kompella09]  Reports aggregate latency measurements using few counters at both interfaces  Periodically state is transmitted between sender and receiver (very little overhead)  Can measure loss and latency in a scalable fashion  Typically required for each measurement equivalent class if QoS enabled  mPlane itself is oblivious to LDA  Any data structure that can report router latency measurements works fine 9 ReARCH 2009

External component  Measure properties of links  Link properties typically vary less  Optical re-configuration may change the delay to some extent  Routers inject active probes periodically to the neighbor  Can also piggyback on control packets exchanged between two neighboring routers  Example: OSPF Hello packets, Time synchronization packets, etc. 10 ReARCH 2009

mPlane deployment  Clean-slate deployment fairly straightforward  Each router reports measurements of all forwarding paths and links  Any end-to-end path problem can now be correlated directly with individual segment measurements to isolate the root cause  Fork-lift upgrade difficult.  How to deploy mPlane incrementally ?  How can a subset of routers cover the entire network ? 11 ReARCH 2009

Partial deployment example Upgraded routers 3 3 2 B C 1 2 A 1 D 2 2 OSPF weights F 2 E Measurement server (m-server) ReARCH 2009 12

Self-sourced OSPF shortest paths A M-set for A consists of Cut in the OSPF {F, B} shortest path tree F B whenever an upgraded router or a leaf is C E encountered D D Measuring the nodes within the sub-tree handled by F 13 ReARCH 2009

Evaluating the benefits  Two main metrics  Probe Hop Count: Sum of all hops taken by every active probe in the network  Localization Granularity: Average segment size in number of hops  Upgrade strategy involves picking the right routers for upgrade  Naïve strategy: pick routers at random  Intelligent strategy: pick routers that decrease the probe hop count and localization granularity the most 14 ReARCH 2009

Intelligent upgrade strategy  Greedily pick routers that are present on most number of shortest paths  Benefits localization granularity by reducing the length of most segments  Benefits probe hop count since most number of paths are shortened  Not necessarily the most optimal, but greedy works much better than random as our evaluation shows  LP-formulation should be possible 15 ReARCH 2009

Benefit in bandwidth reduction (Sprint Rocketfuel topology, 315 routers) Random falls off Upgrading just 40 relatively slowly routers reduces probe hop count by 2 orders of magnitude 16 ReARCH 2009

Localization granularity (segment size in number of hops) 5 Localization granularity Intelligent (avg) Random (avg) 4.5 4 Upgrading just 50 3.5 routers reduces 3 localization granularity to about 1.5 2.5 2 1.5 1 0 50 100 150 200 250 300 350 Number of upgraded routers 17 ReARCH 2009

Summary  Proposed mPlane for direct and scalable fault localization  Key idea is to break end-to-end paths into segments, and monitor them with high fidelity  Partial deployment using OSPF shortest path trees to determine upgraded routers to probe  Benefits of an intelligent upgrade strategy  100x reduction in bandwidth  Localization granularity of 1.5  With only 15% of routers upgraded Thanks! Questions… 18 ReARCH 2009

Thanks! Questions… ReARCH 2009 19

Other details  Routers advertize the measurement capability using one reserved bit within options field of HELLO messages  M-sets may change with OSPF LSAs as shortest paths change during link failures  During periods of churn, need to conduct measurements to both old and new m-sets  ECMP handled by routers by issuing separate probes through separate paths 20 ReARCH 2009

mPlane: An Architecture for Scalable Fault Localization Ramana Rao - PowerPoint PPT Presentation

mPlane: An Architecture for Scalable Fault Localization Ramana Rao Kompella, Alex C. Snoeren, George Varghese Purdue University University of California, San Diego ReARCH 2009 1 Disruptions are costly Disruptions in the network are

Category-level localization Cordelia Schmid Category-level localization Localization of

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Prevalence of Single-Fault Fixes and its Impact on Fault Localization Alexandre Perez, Rui Abreu,

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

JUST ONE FAULT Persistent Fault Analysis on Block Ciphers Shivam Bhasin Temasek Labs @ NTU ASK

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

MPLS TP Ring Fault Detection and Localization draft-jiang-mpls-tp-ring-fd Authors Albert

An Empirical Study of Fault Localization Families and Their Combinations Daming Zou, Jingjing

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Active fault level management Introducing the Fault Current Limiting service 1 Fluctuating

BSc Project What kinds of fault we may confront in a control loop? Fault Detection &

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault Modeling 1 Why Fault Models? Actual number of physical defects in a circuit are too

Anderson Localization Alaska Subedi April 24, 2008 Alaska Subedi Anderson Localization

Moving from Logical Sharing of Guest OS to Physical Sharing of Deduplication on Virtual Machine

Computing Betweenness Centrality in Link Streams Cl emence Magnien joint work with Fr ed

Image Segmentation Machine Learning Study Group Presented by Yaochen Xie Jan 25, 2018 Outline

What is a density surface model? Why model abundance spatially? Use non-designed surveys Use

1 Top-down segmentation Basic ideas of grouping in human vision Figure-ground Gestalt

Simulation of Gauge-Higgs models using the worm algorithm Y. Delgado , C. Gattringer, A. Schmidt

Absence Management Results by Calendar Workshop October 2017 The University of Wisconsin Service

BBM 413 Fundamentals of Image Processing Erkut Erdem Dept. of Computer Engineering Hacettepe

mPlane: An Architecture for Scalable Fault Localization Ramana Rao - PowerPoint PPT Presentation

mPlane: An Architecture for Scalable Fault Localization Ramana Rao Kompella, Alex C. Snoeren, George Varghese Purdue University University of California, San Diego ReARCH 2009 1 Disruptions are costly Disruptions in the network are

Category-level localization Cordelia Schmid Category-level localization Localization of

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Prevalence of Single-Fault Fixes and its Impact on Fault Localization Alexandre Perez, Rui Abreu,

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

JUST ONE FAULT Persistent Fault Analysis on Block Ciphers Shivam Bhasin Temasek Labs @ NTU ASK

Localization Nischal K N System Overview Mapping Hector Mapping Localization Path Planning

Category-level localization Cordelia Schmid Category-level localization Localization up to a

MPLS TP Ring Fault Detection and Localization draft-jiang-mpls-tp-ring-fd Authors Albert

An Empirical Study of Fault Localization Families and Their Combinations Daming Zou, Jingjing

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Active fault level management Introducing the Fault Current Limiting service 1 Fluctuating

BSc Project What kinds of fault we may confront in a control loop? Fault Detection &amp;

Overview Introduction and basic concept ECE 753: FAULT-TOLERANT Fault model and fault

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Fault Modeling 1 Why Fault Models? Actual number of physical defects in a circuit are too

Anderson Localization Alaska Subedi April 24, 2008 Alaska Subedi Anderson Localization

Moving from Logical Sharing of Guest OS to Physical Sharing of Deduplication on Virtual Machine

Computing Betweenness Centrality in Link Streams Cl emence Magnien joint work with Fr ed

Image Segmentation Machine Learning Study Group Presented by Yaochen Xie Jan 25, 2018 Outline

What is a density surface model? Why model abundance spatially? Use non-designed surveys Use

1 Top-down segmentation Basic ideas of grouping in human vision Figure-ground Gestalt

Simulation of Gauge-Higgs models using the worm algorithm Y. Delgado , C. Gattringer, A. Schmidt

Absence Management Results by Calendar Workshop October 2017 The University of Wisconsin Service

BBM 413 Fundamentals of Image Processing Erkut Erdem Dept. of Computer Engineering Hacettepe

BSc Project What kinds of fault we may confront in a control loop? Fault Detection &