analysis of link failures in an analysis of link failures
play

Analysis of link failures in an Analysis of link failures in an IP - PowerPoint PPT Presentation

Analysis of link failures in an Analysis of link failures in an IP backbone network IP backbone network Gianluca Iannaccone Gianluca Iannaccone Sprint ATL Sprint ATL joint work with: Chen-Nee Chuah, UC Davis Richard Mortier, Microsoft


  1. Analysis of link failures in an Analysis of link failures in an IP backbone network IP backbone network Gianluca Iannaccone Gianluca Iannaccone Sprint ATL Sprint ATL joint work with: Chen-Nee Chuah, UC Davis Richard Mortier, Microsoft Supratik Bhattacharyya, Sprint ATL Christophe Diot, Sprint ATL

  2. Motivation Motivation • Today’s Service Level Agreements: – Performance in terms of delay and packet loss – Availability in terms of “port availability” • Need to introduce a “service availability” metric: – Would permit to compare VoIP/VPN services to standard telephone networks Question: Question: “How often does a router have no forwarding “How often does a router have no forwarding information for any given destination prefix?” information for any given destination prefix Internet Measurement Workshop November 7th, 2002 2

  3. Methodology Methodology • Frequency and duration of link failures – Recorded IS-IS routing updates – Python Rout(e)ing Toolkit to listen to failures – 4 months of data (Dec 2001 – Mar 2002) – U.S. inter-PoP links – Failures less than 24hrs long Internet Measurement Workshop November 7th, 2002 3

  4. Network- -wide Time Between Failures wide Time Between Failures Network Average: ~ 34min Average: ~ 34min 50%: ~ 3min 50%: ~ 3min Internet Measurement Workshop November 7th, 2002 4

  5. Breakdown by time of the day (EDT) Breakdown by time of the day (EDT) Higher incidence of failures at night. Likely due to maintenance. Internet Measurement Workshop November 7th, 2002 5

  6. Causes of failures Causes of failures • Duration may give a hint • Some speculations: – Long (>1hour): fiber cuts, severe failures – Medium (>10min): router/line card failures – Short (>1min): line card resets – Very Short (<1min): optical equipment Internet Measurement Workshop November 7th, 2002 6

  7. Does the duration give any hint? Does the duration give any hint? ~ 94% < 1hr ~ 94% < 1hr ~ 80% < 10min ~ 80% < 10min ~ 50% < 1min ~ 50% < 1min Internet Measurement Workshop November 7th, 2002 7

  8. Controlled failure experiment Controlled failure experiment Internet Measurement Workshop November 7th, 2002 8

  9. Impact of a failure: 7 steps to re- -route traffic route traffic Impact of a failure: 7 steps to re 1. Detect link down <100ms 2. Wait to filter out transient flaps 2s 3. Wait before sending update out 50ms 4. Processing & flooding the update ~10ms/hop 5. Wait before computing SPF 5.5s 6. Compute shortest paths 100-400 ms � exp. protocol convergence: 5.1s / 5.9s 7. Update the routing tables ~20 pfx/ms � exp. service convergence: 1.5s / 2.1s � exp. total disruption: 6.6s / 8.0s � � � Internet Measurement Workshop November 7th, 2002 9

  10. Conclusion Conclusion • Link failures are part of everyday operations • Majority of failures are short-lived • Disruption in packet forwarding depends on – routing protocol dynamics and implementation – router architecture – too many timers and interactions among different components • Need to develop link failure model: – define IP service availability – need more points (4 months are not enough) Internet Measurement Workshop November 7th, 2002 10

Recommend


More recommend