outline
play

OUTLINE California Fault Lines: Understanding the Causes and Impact - PowerPoint PPT Presentation

Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011 OUTLINE California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush A Measurement Study


  1. Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011

  2. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .

  3. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .

  4. Why Study Failure • Failure is a reality for large network • Achieving high availability requires engineering the network to be robust to failure • Designing mechanisms to effectively mitigate failures requires deep understanding of real failures

  5. CENIC Network • Serving California educational institutions • Over 200 routers • 5 years of data • Three Types of Components: ◦ The Digital California (DC) network ◦ The High-Performance Research (HPR) network ◦ Customer-premises equipment (CPE)

  6. Contribution • Methodology to reconstruct historical failure events of CENIC network • Using only commonly available data, No need for additional instrumentation • Analyze the network based on failure measurement

  7. Reconstruction What data are available to reconstruct a failure 4 years later? ◦ Syslog • Describes interface state changes ◦ Router Configuration Files • Maps interfaces to Links ◦ Operation announcements on mailing list Data are not intended for failure reconstruction!

  8. Validation • Internal consistency  Using the administrator announcements to validate the event history reconstructed. • External consistency  CAIDA Skitter project (now Ark) validating UP.  Route Views project validating DOWN.

  9. Overview of Link Failures

  10. Overview of Link Failures

  11. Overview of Link Failures • Vertical banding  V1: a network-wide IS-IS configuration change requiring a router restart  V2: a network-wide software upgrade  V3: a network-wide configuration change in preparation for IPv6 • Horizontal banding  H1: a series of failures on a link between a core router and a County of Education office (hardware)  H2: this link experienced over 33,000 short-duration failures (fiber cut)

  12. CDFs of Individual Failure Events

  13. Various Link Hardware Types

  14. Cause of Failure

  15. Failure Events

  16. Summary • Engineering for failure requires real data - Data has historically been difficult to obtain • Methodology to perform historical failure analysis with low-quality data sources • Shared our findings in the CENIC network - Reliability of individual components - Causes of failures - Impact of failure

  17. OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .

  18. Key Questions • How could routing events cause degraded end-to-end path performance? • How topological properties and routing policies affect performance degradation?

  19. Approach • Study end-to-end performance under realistic topologies. • Investigate several metrics to characterize the end-to-end loss, delay, and out-of-order packets. • Characterize the kinds of routing changes that impact end-to-end path performance. • Analyze the impact of topology, routing policies, MRAI timer and iBGP configurations on end-to-end path performance.

  20. Experiment Methodology • A multi-homed prefix • BGP Beacon prefix: 192.83.230.0/24 • Controlled Routing Changes • Failover events: Beacon changes from the state of being connected to both providers to the state of being connected to a single provider. • Recovery events: Beacon changes from the state of being connected to a single provider to the state of being connected to both providers. ISP1 ISP 2 ISP 1 ISP 2 ISP 1 ISP 2 Failover event Recovery event Beacon Beacon Beacon

  21. Controlled Routing Changes • 12 routing events every day  8 for beacon events: o Failover events o Recovery events  4 for resetting the Beacon Connectivity. Time schedule (GMT) for BGP Beacon routing transitions

  22. host B host A Active Probing host C Internet • Goal : capture the impact of routing changes on the end-to- end performance. ISP 1 ISP 2 • From 37 PlanetLab hosts to the Beacon host (a host within Beacon host the Beacon prefix). • Three probing methods: Data Plane Active probing - Back-to-back traceroutes Performance traceroute ping UDP probing metrics - Back-to-back pings Pack loss √ - UDP probing (50msec Delay √ interval) Out-of-order √

  23. Packet Loss Loss burst : consecutive UDP probing packets lost during a routing change event . Failover Recovery

  24. Packet Delay Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery

  25. Out-of-order Packets • Number of reordering (number of packets out of order) • Reordering offset Recovery Failover

  26. How Routing Failures Occur (Failover)? Prefer-customer routing policy: routes received from a provider’s customers are always preferred over those received from its peers. Provider 1 Provider 2 Peer link 0 0 R2 R3 R4 R5 0 0 2 0 1 0 R1 R6 0 0 Customer link Beacon AS 0

  27. How Routing Failures Occur (Failover)? (contd.) No-valley routing policy: peers do not transit traffic from one peer to another. 1 0 1 0 2 0 R8 2 0 R7 R9 Provider 3 1 0 Peer link R2 R3 R4 R5 Peer link 0 0 0 0 2 0 1 0 R1 R6 0 0 Provider 2 Provider 1 Beacon AS 0

  28. How Routing Failures Occur? (Recovery) iBGP constraint: a route received from an iBGP router cannot be transited to another iBGP router Provider 2 Withdraw (2 0) R1 R2 R4 Provider 1 1. Path 0 ⇒ R3 recovery. 2. R3 sends the path to R2 path (0) Path (0) 3. R2 sends a withdrawal R3 to R1 4. R3 sends the recovery path to R1 0 5. R1 regains its connection to the Beacon Beacon AS 0

  29. Summary • During failover and recovery events • Routing events impact packet loss significantly. • Routing failures contribute to end-to-end packet loss significantly. • Routing events can lead to long packet round-trip delays and reordering • Routing policies and iBGP configuration play a major role in causing packet loss during routing events.

  30. Discussion • How could we prevent packet loss during path exploration? Would storing an alternative path in each router be a good idea? What are the downsides? • How could we exploit the previous results to improve end- to-end performance? • How realistic could we consider the topology in the second paper?

  31. References • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance . SIGCOMM 2006. • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. Presentation on SIGCOMM 2006. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines . SIGCOMM 2010. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Presentation on SIGCOMM 2010.

Recommend


More recommend