what can we learn from four years of data center hardware
play

What Can We Learn from Four Years of Data Center Hardware Failures? - PowerPoint PPT Presentation

What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang, Lifei Zhang, Wei Xu Mo Moti tivati tion: n: E Evolving ng F Failur ure Mo Mode del Failures in data centers are common and costly - Violate service


  1. What Can We Learn from Four Years of Data Center Hardware Failures? Guosai Wang, Lifei Zhang, Wei Xu

  2. Mo Moti tivati tion: n: E Evolving ng F Failur ure Mo Mode del • Failures in data centers are common and costly - Violate service level agreement (SLA) and cause loss of revenue • Understand failures: reduce TCO • Today’s data centers are different - ! Better failure detection systems, experienced operators - " Adoption of less-reliable, commodity or custom ordered hardware, more heterogeneous hardware and workload - Result: more complex failure model • Goal: comprehensive analysis of hardware failures in modern large-scale IDCs

  3. We R Re-study y Hardware Failures in IDCs Our work: - Large scale : hundreds of thousands of servers with 290,000 failure operation tickets - Long-term : 2012-2016 - Multi-dimensional : components, time, space, product lines, operators’ response, etc. - Reconfirm or extend previous findings + Observe new patterns Time Product lines Operators’ response Components Space

  4. In Inter eres esting g Fi Findings gs Over erview Common beliefs Our findings • Failures are uniformly • HW failures are not uniformly randomly distributed over random time/space - at different time scales - sometimes at different locations • Failures happen independently • Correlated HW failures are common in IDCs • It is also the other way around: • HW unreliability shapes the software fault tolerance software fault tolerance design indulges operators to care less about HW dependability

  5. Failure Management Arch Fa chitect cture ��������� ��� ������� ����� ��������������� ������������������ ��������� ������� �������� ���� ������������ ����������������� ������� ��� ������� �������

  6. Failure Management Arch Fa chitect cture • HMS agents detect failures on servers ��������� ��� ������� ����� ��������������� ������������������ ��������� ������� �������� ���� ������������ ����������������� ������� ��� ������� �������

  7. Fa Failure Management Arch chitect cture • HMS agents detect failures on servers • HMS collects failure records, and store them in a failure pool ��������� ��� ������� ����� ��������������� ������������������ ��������� ������� �������� ���� ������������ ����������������� ������� ��� ������� �������

  8. Fa Failure Management Arch chitect cture • HMS agents detect failures on servers • HMS collects failure records, and store them in a failure pool • Operators/programs generate a FOT ��������� for each failure record ��� ������� ����� ��������������� ������������������ ��������� ������� �������� ���� ������������ ����������������� ������� ��� ������� �������

  9. Da Dataset: : 290, 290,000+ 000+ FOTs • The failure operation tickets (FOTs) contain many fields ��������� ��� ������� ����� ��������������� id, hostname, host idc, ������������������ error device, error type, ��������� error time, error position, ������� �������� ���� ������������ op time, error detail, etc. ����������������� ������� ��� ������� �������

  10. Mul Multi ti-di dimens nsiona nal A Ana nalysis o on the n the D Dataset • We study the failures on different dimensions based on different fields of FOTs Time Product lines Operators’ response id, hostname, host idc, error device, error type, error time, error position, op time, error detail, etc. Components Space

  11. Mul Multi ti-di dimens nsiona nal A Ana nalysis o on the n the D Dataset • We study the failures on different dimensions based on different fields of FOTs Time: error time Product lines: Operators’ response: id, hostname, host idc, hostname error time, op time error device, error type, error time, error position, op time, error detail, etc. Components: Space: error device hostname, host idc

  12. Fa Failure Percentage Br Breakdown by y Com Compon onent Device Proportion Hard Disk Drive 81.84% Miscellaneous* 10.20% Memory 3.06% Power 1.74% RAID card 1.23% Flash card 0.67% Motherboard 0.57% SSD 0.31% Fan 0.19% HDD backboard 0.14% CPU 0.04% *” Miscellaneous ” are manually submitted or uncategorized failures

  13. Failure Typ ypes for Hard Disk k Drive • About half of HDD failures are related to SMART values or prediction error count Failure Type Breakdown of HDD SMARTFail Some HDD SMART value PredictErr exceeds the threshold RaidPdPreErr Other types RaidPdFailed SMART = Missing S elf M onitoring A nalysis NotReady and R eporting T echnique MediumErr RaidPdMediaErr BadSector PendingLBA The prediction error count TooMany exceeds the threshold DStatus Others

  14. Failure Typ ypes for Hard Disk k Drive • About half of HDD failures are related to SMART values or prediction error count Failure Type Breakdown of HDD SMARTFail Some HDD SMART value PredictErr exceeds the threshold RaidPdPreErr Other types RaidPdFailed SMART = Missing S elf M onitoring A nalysis NotReady and R eporting T echnique MediumErr RaidPdMediaErr BadSector PendingLBA The prediction error count TooMany exceeds the threshold DStatus Others

  15. Ou Outline • Dataset overview Ø Temporal distribution of the failures • Spatial distribution of the failures • Correlated failures • Operators’ response to failures • Lessons Learned

  16. FR FR i is NO NOT Un Uniformly R y Random o over D Days o of t the W Week • Hypothesis 1. The average number of component failures is uniformly random over different days of the week. • A chi-square test can reject the hypothesis at 0.01 significance level for all component classes.

  17. FR FR i is NO NOT Un Uniformly R y Random o over Hou Hours of of the e Da Day • Hypothesis 2. The average number of component failures is uniformly random during each hour of the day.

  18. FR FR i is NO NOT Un Uniformly R y Random o over Hou Hours of of the e Da Day • Possible Reasons - High workload results in more failures - Human factors - Components fail in large batches

  19. FR FR i is NO NOT Un Uniformly R y Random o over Hou Hours of of the e Da Day • Possible Reasons - High workload results in more failures - Human factors - Components fail in large batches

  20. FR FR i is NO NOT Un Uniformly R y Random o over Hou Hours of of the e Da Day • Possible Reasons - High workload results in more failures - Human factors - Components fail in large batches

  21. FR FR i is NO NOT Un Uniformly R y Random o over Hou Hours of of the e Da Day • Possible Reasons - High workload results in more failures - Human factors - Components fail in large batches

  22. FR FR o of e each ch C Component C Changes Du Duri ring its Life Cy Cycle • Different component classes exhibit different FR patterns.

  23. FR FR o of e each ch C Component C Changes Du Duri ring its Life Cy Cycle • Infant mortalities:

  24. FR FR o of e each ch C Component C Changes Du Duri ring its Life Cy Cycle • Wear out

  25. Ou Outline • Dataset overview • Temporal distribution of the failures Ø Spatial distribution of the failures • Correlated failures • Operators’ response to failures • Lessons Learned

  26. Ph Physi sical Locations ns Migh ght Af Affect the he FR Distribut bution • Hypothesis 3. The failure rate on each rack position is independent of the rack position. • In general, at 0.05 significance level: - can not reject the hypothesis in 40% of the data centers - can reject it in the other 60%

  27. FR FR C Can b be A Affect cted b by t the C Cooling D Design • FRs are higher at rack position 22 and 35 At the top Above the PSU Cooling air • Possible reasons - Design of IDC cooling and physical structure of the racks A typical Scorpion rack

  28. Ou Outline • Dataset overview • Temporal distribution of the failures • Spatial distribution of the failures Ø Correlated failures • Operators’ response to failures • Lessons Learned

Recommend


More recommend