SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, Kushagra Vaid The 9 th ACM Systems And Storage Conference (SYSTOR 2016) 1
Why SSD Reliability ? Data reliability SSDs’ popularity 46.5% annual 01001100 01001101 11010010 01000000 growth * 10011100 10111111 10101111 11000101 Datacenter decision Limited field data support *Source: IDC, Dec 2015 2
Why SSD Reliability ? Data reliability SSDs’ popularity 46.5% annual 01001100 01001101 11010010 01000000 growth * 10011100 10111111 10101111 11000101 Large scale Field data Datacenter decision Limited field data support *Source: IDC, Dec 2015 3
SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 4
SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 5
SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 6
SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 7
SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 8
SSD Failures Flash failures FTL Mechanisms - Media wear-out - Wear levelling - Data Retention Fail-stop failures - Error detection - Program disturb - Error correction - Erase disturb - Flash correct and refresh, etc. 9
SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 10 Enterprise Consumer
SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 11
SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 5 large datacenters 12
SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 4 major workloads 13
SSD Reliability 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A SSD Model 6 different rack SKUs 14
Various factors in production environment could affect SSD SSD Reliability failure trends very differently from lab test conditions 1.2 Annualized Failure Rate % AFR=0.61 AFR=0.73 1 0.8 0.6 0.4 0.2 0 1-A 1-B 1-C 1-D 2-A Can we understand SSD failures in the presence of SSD Model various factors ? 15
Understanding SSD Failures – An analogy SSD Reactive Proactive 16
What are the symptoms? SSD 011001?00101? Unexpected Fever weight loss Reallocated sectors Data errors Low blood pressure Program and erase failure SATA downshift 17
SSD Failure Symptoms 18X 3.5 w Symptom 3 Reallocated Sector Count w/o Symptom 2.5 Program and Erase Fail Count 2 AFR % 3.91X 1.5 CRC and Uncorrectable Error 3.95X Count 1 2.76X 0.5 SATA Downshift Count 0 Reallocated Program and CRC and SATA Sector Count Erase Failure Uncorrectable Downshift Count Error Count Count 18
Insufficiency of symptom only diagnosis 70 Symptoms seen Failed Healthy 60 only in 62% of failed devices 50 % of devices 40 30 20 10 0 Reallocations Program and Data Errors SATA Any Erase Fail Downshift 19
What are the factors? SSD Workload Lifestyle Production Environmental environment agents Design decisions Genetics 20
Device level correlating factors Increasing failure trend 2.5 at higher write rates Average write rate of a device 2 Average read rate of a device 1.5 AFR % Total read and/or write usage 1 Write Amplification 0.5 0 Read Write Ratio >50 10 15 20 25 30 35 40 45 50 Avg. host writes per day More results in the paper 21
Server level correlating factors Decreasing failure trend 1.2 at high disk space usage SSD space utilization 1 0.8 Disk space utilization AFR% 0.6 0.4 Memory utilization 0.2 0 Processor utilization 10 20 30 40 50 60 70 Avg. Disk Space Utilization More results in the paper 22
Datacenter factors Same model different behavior 0.6 Rack SKU AFR % 0.5 0.4 0.3 Datacenter Facility 0.2 0.1 0 1-D 2-A 1-D 2-A S1-3a S1-3b SKU and SSD model More results in the paper 23
Understanding SSD Failures – An analogy MULTI FEATURE ANALYSIS SSD Symptoms Factors Symptoms Factors 24
Understanding SSD Failures – An analogy SSD Symptoms Factors Symptoms Factors Random forest based binary classification Permutation feature ranking 25
Understanding What ? are the important factors ? What is their order of importance ? are the important combinations? 26
Understanding What ? Feature Importance 0 0.2 0.4 0.6 0.8 1 DataErrors SYMPTOMS ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay 27
Understanding What ? Feature Importance 0 0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites DEVICE HostWrites WORKLOAD TotalReads+Writes AvgMemory AvgSSDSpace UsagePerDay TotalReads ReadsPerDay 28
Understanding What ? Feature Importance 0 0.2 0.4 0.6 0.8 1 DataErrors ReallocSectors TotalNANDWrites HostWrites TotalReads+Writes AvgMemory SERVER WORKLOAD AvgSSDSpace UsagePerDay TotalReads ReadsPerDay 29
Understanding What ? Combinations of top 8 important features Frequent Combinations Condition Class Data Errors <=1 & Reallocated Sectors<=5 H SYMPTOMS Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H Avg. SSD space >=10 F 30
Understanding What ? Combinations of top 8 important features Frequent Combinations Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H SYMPTOMS + WORKLOAD Media Wear-out=100 & WAF<=1 H Avg. SSD space >=10 F 31
Understanding What ? Combinations of top 8 important features Frequent Combinations Condition Class Data Errors <=1 & Reallocated Sectors<=5 H Data Errors<=1& WAF<=1 H Media Wear-out=100 & WAF<=1 H Avg. SSD space >=10 F WORKLOAD 32
Understanding When ? What is the duration between detection and failure? signatures characterize SSD survivability? 33
Understanding When ? 1 0.9 0.8 0.7 50% of 0.6 CDF(x) failures 0.5 Sufficient time to intervene 0.4 0.3 > 4 0.2 months 0.1 0 0 2 4 6 8 10 12 Time To Fail (months) 34
Understanding When ? 1 0.9 Late failures: Rules 0.8 contains only workload factors 0.7 50% of 0.6 CDF(x) failures 0.5 0.4 0.3 > 4 0.2 months 0.1 Early failures (< 1 month): 0 Rules include symptoms 0 2 4 6 8 10 12 and their thresholds Time To Fail (months) 35
Understanding SSD Failures – An analogy SSD Symptoms Factors Symptoms Factors Observation based causal estimate Probabilistic causal models and Pearl’s do -calculus 36
Understanding Why ? What factors impact SSD reliability? is their magnitude of impact? 37
Understanding Why ? SSD model and symptoms have direct impact Workload impacts failures through media wearout 38
Concluding Remarks • SSD Failures in the field • Factors -> Symptoms -> Failures • Important Symptoms: Data Errors and Reallocated Sectors • High intensity and rapid progression fails early • Important factors: NAND Writes, Total Reads and Writes, etc. • Direct impact: SSD Model and Symptoms • Indirect impact: Workload through wear-out • Future direction: prediction and control 39
Recommend
More recommend