ssd fail ilures in in datacenters
play

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya - PowerPoint PPT Presentation

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid The 9 th ACM Systems And Storage


  1. SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid The 9 th ACM Systems And Storage Conference (SYSTOR 2016) 1

  2. Why SSD Reliability ? Data reliability SSDs’ popularity 46.5% annual 01001100 01001101 11010010 01000000 growth * 10011100 10111111 10101111 11000101 Datacenter decision Limited field data support *Source: IDC, Dec 2015 2

  3. Why SSD Reliability ? Data reliability SSDs’ popularity 46.5% annual 01001100 01001101 11010010 01000000 growth * 10011100 10111111 10101111 11000101 Large scale Field data Datacenter decision Limited field data support *Source: IDC, Dec 2015 3

  4. SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 4

  5. SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 5

  6. SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 6

  7. SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 7

  8. SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 8

  9. SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention Fail-stop failures - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 9

  10. SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 10 Enterprise Consumer

  11. SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 11

  12. SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 5 large datacenters 12

  13. SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 4 major workloads 13

  14. SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 6 different rack SKUs 14

  15. Various factors in production environment could affect SSD SSD Reliability failure trends very differently from lab test conditions 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A Can we understand SSD failures in the presence of SSD Model various factors ? 15

  16. Understanding SSD Failures – An analogy SSD Reactive Proactive 16

  17. What are the symptoms? SSD 011001?00101? Unexpected Fever weight loss Reallocated sectors Data errors Low blood pressure Program and erase failure SATA downshift 17

  18. SSD Failure Symptoms 18X 3.5 w Symptom 3 Reallocated Sector Count w/o Symptom 2.5 Program and Erase Fail Count 2 AFR % 3.91X 1.5 CRC and Uncorrectable Error 3.95X Count 1 2.76X 0.5 SATA Downshift Count 0 Reallocated Program and CRC and SATA Sector Count Erase Failure Uncorrectable Downshift Count Error Count Count 18

  19. Insufficiency of symptom only diagnosis 70 Symptoms seen Failed Healthy 60 only in 62% of failed devices 50 % of devices 40 30 20 10 0 Reallocations Program and Data Errors SATA Any Erase Fail Downshift 19

  20. What are the factors? SSD Workload Lifestyle Production Environmental environment agents Design decisions Genetics 20

  21. Device level correlating factors Increasing failure trend 2.5 at higher write rates Average write rate of a device 2 Average read rate of a device 1.5 AFR % Total read and/or write usage 1 Write Amplification 0.5 0 Read Write Ratio >50 10 15 20 25 30 35 40 45 50 Avg. host writes per day More results in the paper 21

  22. Server level correlating factors Decreasing failure trend 1.2 at high disk space usage SSD space utilization 1 0.8 Disk space utilization AFR% 0.6 0.4 Memory utilization 0.2 0 Processor utilization 10 20 30 40 50 60 70 Avg. Disk Space Utilization More results in the paper 22

  23. Datacenter factors Same model different behavior 0.6 Rack SKU AFR % 0.5 0.4 0.3 Datacenter Facility 0.2 0.1 0 1-D 2-A 1-D 2-A S1-3a S1-3b SKU and SSD model More results in the paper 23

  24. Understanding SSD Failures – An analogy MULTI FEATURE ANALYSIS SSD Symptoms Factors Symptoms Factors 24

  25. Understanding SSD Failures – An analogy SSD Symptoms Factors Symptoms Factors Random forest based binary classification Permutation feature ranking 25

  26. Understanding What ? are the important factors ? What is their order of importance ? are the important combinations? 26

  27. Understanding What ? Feature Importance 0 0.2 0.4 0.6 0.8 1 DataErrors SYMPTOMS ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay 27

  28. Understanding What ? Feature Importance 0 0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites DEVICE HostWrites WORKLOAD TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay 28

  29. Understanding What ? Feature Importance 0 0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory SERVER WORKLOAD AvgSSDSpace UsagePerDay TotalReads ReadsPerDay 29

  30. Understanding What ? Combinations of top 8 important features Frequent Combinations Condition Class Data Errors <=1 & Reallocated Sectors<=5 H SYMPTOMS Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H Avg. SSD space >=10 F 30

  31. Understanding What ? Combinations of top 8 important features Frequent Combinations Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H SYMPTOMS + WORKLOAD Media Wear-out=100 & WAF<=1 H Avg. SSD space >=10 F 31

  32. Understanding What ? Combinations of top 8 important features Frequent Combinations Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H Avg. SSD space >=10 F WORKLOAD 32

  33. Understanding When ? What is the duration between detection and failure? signatures characterize SSD survivability? 33

  34. Understanding When ? 1 0.9 0.8 0.7 50% of 0.6 CDF(x) failures 0.5 Sufficient time to intervene 0.4 0.3 > 4 0.2 months 0.1 0 0 2 4 6 8 10 12 Time To Fail (months) 34

  35. Understanding When ? 1 0.9 Late failures: Rules 0.8 contains only workload factors 0.7 50% of 0.6 CDF(x) failures 0.5 0.4 0.3 > 4 0.2 months 0.1 Early failures (< 1 month): 0 Rules include symptoms 0 2 4 6 8 10 12 and their thresholds Time To Fail (months) 35

  36. Understanding SSD Failures – An analogy SSD Symptoms Factors Symptoms Factors Observation based causal estimate Probabilistic causal models and Pearl’s do -calculus 36

  37. Understanding Why ? What factors impact SSD reliability? is their magnitude of impact? 37

  38. Understanding Why ? SSD model and symptoms have direct impact Workload impacts failures through media wearout 38

  39. Concluding Remarks • SSD Failures in the field • Factors -> Symptoms -> Failures • Important Symptoms: Data Errors and Reallocated Sectors • High intensity and rapid progression fails early • Important factors: NAND Writes, Total Reads and Writes, etc. • Direct impact: SSD Model and Symptoms • Indirect impact: Workload through wear-out • Future direction: prediction and control 39

Recommend


More recommend