understanding network failures in data centers
play

Understanding Network Failures in Data Centers: Measurement, - PowerPoint PPT Presentation

SIGCOMM 2011 Toronto, ON Aug. 18, 2011 Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa ipa Gill Navendu ndu Jain in & & Nachia chiapp ppan an Nag agappan appan Microsof osoft


  1. SIGCOMM 2011 Toronto, ON Aug. 18, 2011 Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa ipa Gill Navendu ndu Jain in & & Nachia chiapp ppan an Nag agappan appan Microsof osoft t Resear search Univer Uni ersity ity of Toron onto 1

  2. Motivation 2

  3. Motivation $5,600 per minu nute We need to understand failures to prevent and mitigate them! 3

  4. Overview Our goal: : Improve reliability by understanding network failures 1. Failure charact acteriz erization ation – Most failure prone components – Understanding root cause 2. What is the impact act of failure? 3. Is redun undanc ancy effective? Our r cont ntri ribution: bution: First st large ge-sc scale ale empir iric ical al study udy of net networ ork k failure ures across ss multi ltiple ple DCs • Methodology to extract failures from noisy data sources. • Correlate events with network traffic to estimate impact • Analyzing implications for future data center networks 4

  5. Road Map Motivation ivation Backgr ground und & M Met ethodo dology ogy Res esults lts 1. Characterizing failures 2. Do current network redundancy strategies help? Conclusion lusions 5

  6. Data center networks overview Internet Access routers/network “core” fabric Load balancers Aggregation Top of Rack “ Agg ” switch (ToR) switch Servers 6

  7. Data center networks overview Which components are most failure prone? Internet How effective is What is the impact of failure? redundancy? ? What causes failures? 7

  8. Failure event information flow • Failure is logged in numerous data sources Ticket ID: 34 LINK DOWN! LINK DOWN! LINK DOWN! Syslog, SNMP traps/polling Troubleshooting 5 min traffic Tickets averages on Network event logs links Diary entries, root cause Network traffic logs Troubleshooting 8

  9. Data summary • One year of event logs from Oct. 2009-Sept. 2010 – Network event logs and troubleshooting tickets • Network event logs are a combination of Syslog, SNMP traps and polling – Caveat: may miss some events e.g., UDP, correlated faults • Filtered by operators to acti tionable onable events – … still many warnings from various software daemons running Key challenge: How to extract failures of interest? 9

  10. Extracting failures from event logs • Defi finin ning failu lures res Network event logs – Device failure: device is no longer forwarding traffic. – Link failure: connection between two interfaces is down. Detected by monitoring interface state. • Dealing ing wi with inconsist nsisten ent t data: – Devices: • Correlate with link failures – Links: • Reconstruct state from logged messages • Correlate with network traffic to determine impact 10

  11. Reconstructing device state • Devices may send spurious DOWN messages • Verify at least one link on device fails within five minutes – Conservative to account for message loss (correlated failures) LINK DOWN! DEVICE DOWN! Aggregation switch 1 Top-of-rack switch Aggregation switch 2 LINK DOWN! This sanity check reduces device failures by 10x 11

  12. Reconstructing link state • Inconsistencies in link failure events – Note: our logs bind each link down to the time it is resolved LINK DOWN! LINK UP! UP Link state DOWN time What we expect 12

  13. Reconstructing link state • Inconsistencies in link failure events – Note: our logs bind each link down to the time it is resolved LINK DOWN 2! LINK UP 2! LINK UP 1! LINK DOWN 1! UP ? ? Link state DOWN time 1. Take the earliest of the down times 2. Take the earliest of the up times How to deal with discrepancies? What we sometimes see. 13

  14. Identifying failures with impact Correlate link failures Network traffic logs with ne netwo work k traffic Only consider events where traffic decreases ses BEFORE RE AFTER 𝒖𝒔𝒃𝒈𝒈𝒋𝒅 𝒆𝒗𝒔𝒋𝒐𝒉 𝒖𝒔𝒃𝒈𝒈𝒋𝒅 𝒄𝒇𝒈𝒑𝒔𝒇 < 𝟐 time DURING LINK DOWN LINK UP • Summar ary y of impa pact ct : – 28.6% of failures impact network traffic – 41.2% of failures were on links carrying no no traf affic ic • E.g., scheduled maintenance activities • Cavea veat: Impact is only on network traffic not ne neces essari arily ly ap appl plic icat ation ions! s! – Redundancy: Network, compute, storage mask outages 14

  15. Road Map Motiv tivati ation Background & Methodology Res esul ults ts 1. 1. Charact racter eriz izing ing failur lures – Distribution of failures over measurement period. – Which components fail most? – How long do failures take to mitigate? 2. Do current network redundancy strategies help? Conclu nclusio sions ns 15

  16. Visualization of failure panorama: Sep’09 to Sep’10 All Failures 46K 12000 Widespread failures er a center 10000 Long lived failures. Link Y had failure on day X. ed by data 8000 s sorted 6000 Links 4000 2000 0 Oct-09 Dec-09 Feb-10 Apr-10 Jul-10 Sep-10 Time (binne ned d by day) 16

  17. Visualization of failure panorama: Sep’09 to Sep’10 Failures with Impact 28% All Failures 46K Component failure: link 12000 failures on multiple ports er a center 10000 ed by data 8000 s sorted 6000 Links 4000 Load balancer update (multiple data centers) 2000 0 Oct-09 Dec-09 Feb-10 Apr-10 Jul-10 Sep-10 Time (binne ned d by day) 17

  18. Which devices cause most failures? Internet ? 18

  19. Which devices cause most failures? 100% Top of rack switches have few failures… (annu nual al pro rob. . of fail ilure e <5%) 90% failures downtime 80% 70% 66% centage …but a lot of downtime! 60% Percentag 50% 38% 40% 28% 30% 18% 20% 15% 9% 8% 10% 5% 4% 4% 2% 0.4% 0% Load Load Load Top of Top of Aggregati egation on LB-1 LB-2 ToR-1 LB-3 ToR-2 AggS-1 Balan ancer cer 2 Balan ancer cer 3 Balan ancer cer 1 Rack k 1 Rack k 2 Switch tch Device ce type pe Device e Type 19 Load balancer 1: very little downtime relative to number of failures.

  20. How long do failures take to resolve? Internet 20

  21. How long do failures take to resolve? Load balancer 1: short-lived transient faults Median time to repair: 4 mins Load Balan ancer cer 1 Load Balan ancer cer 2 Top of Rack ck 1 Load Balance Lo ncer r 3 Top of Rack ck 2 Aggregati egation on Switch ch Overa erall Correlated failures on ToRs connected to the same Aggs. Median time to repair: 5 minutes Mean: 2.7 hours Median time to repair: ToR-1: 3.6 hrs ToR-2: 22 min 21

  22. Summary • Data center networks are highly reliable – Majority of components have four 9’s of reliability • Low-cost top of rack switches have highest reliability – <5% probability of failure • …but most downtime – Because they are lower priority component • Load balancers experience many short lived faults – Root cause: software bugs, configuration errors and hardware faults • Software and hardware faults dominate failures – …but hardware faults contribute most downtime 22

  23. Road Map Motivation ivation Background & Methodology Res esults lts 1. Characterizing failures 2. 2. Do o curre rrent nt net network ork re redu dund ndancy ancy strat rategie gies s he help? p? Conclusion lusions 23

  24. Is redundancy effective in reducing impact? Redundant devices/links to mask failures Internet • This is expensive! (management overhead + $$$) Goal: l: Reroute traffic along available paths X How effective is this in practice? 24

  25. Measuring the effectiveness of redundancy Ide dea: compare traffic before and during failure Acc. router Acc. router (primary) (backup) Measure traffic on links: 1. Before failure X 2. During failure 3. Compute “normalized traffic” ratio: 𝒖𝒔𝒃𝒈𝒈𝒋𝒅 𝒆𝒗𝒔𝒋𝒐𝒉 𝒖𝒔𝒃𝒈𝒈𝒋𝒅 𝒄𝒇𝒈𝒑𝒔𝒇 ~𝟐 Agg. Agg. Compare normalized traffic over switch switch redundancy groups to normalized traffic (primary) (backup) on the link that failed 25

  26. Is redundancy effective in reducing impact? Less impact lower in the topology … but redundancy masks it Redundancy is least effective for AggS and AccR Core link failures have most impact… 100% Internet n) dian) 90% ure (media 80% 70% failure 60% ng fa ic during 50% 40% ized traffic 30% 20% malized 10% Norma 0% All Top of Rack to Aggregation Core Aggregation switch to switch Access router Overall increase of 40% in terms of traffic due to redundancy Per link Per redundancy group 26

  27. Road Map Motivation ivation Background & Methodology Res esults lts 1. Characterizing failures 2. Do current network redundancy strategies help? Con oncl clusi usion ons 27

  28. Conclusions • Goal: Unde derstand tand failure ures s in da data center r net etworks ks – Empirical study of data center failures • Key obs bservati ations: ons: – Data center networks have high reliability – Low-cost switches exhibit high reliability – Load balancers are subject to transient faults – Failures may lead to loss of small packets • Future re di dire rections: tions: – Study application level failures and their causes – Further study of redundancy effectiveness 28

  29. Thanks! Contact: phillipa@cs.toronto.edu Proje oject ct page: e: http://research.microsoft.com/~navendu/netwiser 29

Recommend


More recommend