Internet Measurement Huaiyu Zhu, Rim Kaddah CS538 Fall 2011
OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .
OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .
Why Study Failure • Failure is a reality for large network • Achieving high availability requires engineering the network to be robust to failure • Designing mechanisms to effectively mitigate failures requires deep understanding of real failures
CENIC Network • Serving California educational institutions • Over 200 routers • 5 years of data • Three Types of Components: ◦ The Digital California (DC) network ◦ The High-Performance Research (HPR) network ◦ Customer-premises equipment (CPE)
Contribution • Methodology to reconstruct historical failure events of CENIC network • Using only commonly available data, No need for additional instrumentation • Analyze the network based on failure measurement
Reconstruction What data are available to reconstruct a failure 4 years later? ◦ Syslog • Describes interface state changes ◦ Router Configuration Files • Maps interfaces to Links ◦ Operation announcements on mailing list Data are not intended for failure reconstruction!
Validation • Internal consistency Using the administrator announcements to validate the event history reconstructed. • External consistency CAIDA Skitter project (now Ark) validating UP. Route Views project validating DOWN.
Overview of Link Failures
Overview of Link Failures
Overview of Link Failures • Vertical banding V1: a network-wide IS-IS configuration change requiring a router restart V2: a network-wide software upgrade V3: a network-wide configuration change in preparation for IPv6 • Horizontal banding H1: a series of failures on a link between a core router and a County of Education office (hardware) H2: this link experienced over 33,000 short-duration failures (fiber cut)
CDFs of Individual Failure Events
Various Link Hardware Types
Cause of Failure
Failure Events
Summary • Engineering for failure requires real data - Data has historically been difficult to obtain • Methodology to perform historical failure analysis with low-quality data sources • Shared our findings in the CENIC network - Reliability of individual components - Causes of failures - Impact of failure
OUTLINE • California Fault Lines: Understanding the Causes and Impact of Network Failures. Feng Wang , Zhuoqing Morley MaoJia Wang3, Lixin Gao and Randy Bush • A Measurement Study on the Impact of Routing Events on End to End Internet Path Performance Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage .
Key Questions • How could routing events cause degraded end-to-end path performance? • How topological properties and routing policies affect performance degradation?
Approach • Study end-to-end performance under realistic topologies. • Investigate several metrics to characterize the end-to-end loss, delay, and out-of-order packets. • Characterize the kinds of routing changes that impact end-to-end path performance. • Analyze the impact of topology, routing policies, MRAI timer and iBGP configurations on end-to-end path performance.
Experiment Methodology • A multi-homed prefix • BGP Beacon prefix: 192.83.230.0/24 • Controlled Routing Changes • Failover events: Beacon changes from the state of being connected to both providers to the state of being connected to a single provider. • Recovery events: Beacon changes from the state of being connected to a single provider to the state of being connected to both providers. ISP1 ISP 2 ISP 1 ISP 2 ISP 1 ISP 2 Failover event Recovery event Beacon Beacon Beacon
Controlled Routing Changes • 12 routing events every day 8 for beacon events: o Failover events o Recovery events 4 for resetting the Beacon Connectivity. Time schedule (GMT) for BGP Beacon routing transitions
host B host A Active Probing host C Internet • Goal : capture the impact of routing changes on the end-to- end performance. ISP 1 ISP 2 • From 37 PlanetLab hosts to the Beacon host (a host within Beacon host the Beacon prefix). • Three probing methods: Data Plane Active probing - Back-to-back traceroutes Performance traceroute ping UDP probing metrics - Back-to-back pings Pack loss √ - UDP probing (50msec Delay √ interval) Out-of-order √
Packet Loss Loss burst : consecutive UDP probing packets lost during a routing change event . Failover Recovery
Packet Delay Roundtrip delays from the probe host to the Beacon host (clock skews problem when using one-way delays). Failover Recovery
Out-of-order Packets • Number of reordering (number of packets out of order) • Reordering offset Recovery Failover
How Routing Failures Occur (Failover)? Prefer-customer routing policy: routes received from a provider’s customers are always preferred over those received from its peers. Provider 1 Provider 2 Peer link 0 0 R2 R3 R4 R5 0 0 2 0 1 0 R1 R6 0 0 Customer link Beacon AS 0
How Routing Failures Occur (Failover)? (contd.) No-valley routing policy: peers do not transit traffic from one peer to another. 1 0 1 0 2 0 R8 2 0 R7 R9 Provider 3 1 0 Peer link R2 R3 R4 R5 Peer link 0 0 0 0 2 0 1 0 R1 R6 0 0 Provider 2 Provider 1 Beacon AS 0
How Routing Failures Occur? (Recovery) iBGP constraint: a route received from an iBGP router cannot be transited to another iBGP router Provider 2 Withdraw (2 0) R1 R2 R4 Provider 1 1. Path 0 ⇒ R3 recovery. 2. R3 sends the path to R2 path (0) Path (0) 3. R2 sends a withdrawal R3 to R1 4. R3 sends the recovery path to R1 0 5. R1 regains its connection to the Beacon Beacon AS 0
Summary • During failover and recovery events • Routing events impact packet loss significantly. • Routing failures contribute to end-to-end packet loss significantly. • Routing events can lead to long packet round-trip delays and reordering • Routing policies and iBGP configuration play a major role in causing packet loss during routing events.
Discussion • How could we prevent packet loss during path exploration? Would storing an alternative path in each router be a good idea? What are the downsides? • How could we exploit the previous results to improve end- to-end performance? • How realistic could we consider the topology in the second paper?
References • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance . SIGCOMM 2006. • Feng Wang , Zhuoqing Morley MaoJia Wang, Lixin Gao and Randy Bush. Presentation on SIGCOMM 2006. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. California Fault Lines . SIGCOMM 2010. • Daniel Turner, Kirill Levchenko, Alex C. Snoeren, and Stefan Savage. Presentation on SIGCOMM 2010.
Recommend
More recommend