An Analysis of Data Corruption in the Storage Stack Lakshmi N. - PowerPoint PPT Presentation

Department of Computer Science, Institute for System Architecture, Operating Systems Group An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram, Garth Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Presented by Carsten Weinhold Paper Reading Group, 2008-06-24

About the Study • Large scale study: – Tens of thousands of production systems – 41 months – 1.53 million disks – 400,000+ checksum mismatches • Both “nearline” and enterprise class disks • Focus on silent data corruption (e.g., not about latent sector errors) Paper Reading Group, 2008-06-24 Slide 2 of 21

Background: NetApp Storage Systems • All storage systems by Network Appliance TM • Dedicated network filers: – WAFL file system – RAID with parity – SCSI layer – Fibre Channel (FC) loops – Fibre Channel disks / SATA disks with adapter • Data collected using “Autosupport” • Sent to central database • Note: not all disks were in use for the full duration of 41 months Paper Reading Group, 2008-06-24 Slide 3 of 21

Background: Data Integrity Segments Paper Reading Group, 2008-06-24 Slide 4 of 21

Corruption & Detection Paper Reading Group, 2008-06-24 Slide 5 of 21

Summary Statistics • Total of 1.53 million disks • Total of 400,000+ checksum mismatches • Percentage of corrupt disks varies: – 0.86% of 358,000 nearline disks – 0.065% of 1,170,000 enterprise class disks Observation 1: the probability of developing checksum mismatches is an order of magnitude higher for nearline disks (+SATA/FC adapter) than for enterprise class disks Paper Reading Group, 2008-06-24 Slide 6 of 21

Factor Disk Age: Nearline Disks Paper Reading Group, 2008-06-24 Slide 7 of 21

Factor Disk Age: Enterprise Class Disks Paper Reading Group, 2008-06-24 Slide 8 of 21

Observations Observation 2: probability of developing checksum mismatches varies significantly across disk models in the same class of disks Observation 3: age affects disk models differently with respect to the probability of developing checksum mismatches Paper Reading Group, 2008-06-24 Slide 9 of 21

Factor Disk Size ?? Paper Reading Group, 2008-06-24 Slide 10 of 21

(Non-)Factors ?? Observation 4: there is no clear indication that disk size affects the probability of developing checksum mismatches Observation 5: there is no clear indication that workload affects the probability of developing checksum mismatches ... but: the collected data on access patterns was very coarse and likely to be insufficient Paper Reading Group, 2008-06-24 Slide 11 of 21

Characteristics: Models, Classes Observation 6: the number of checksum mismatches varies greatly across disks Observation 7: on average, corrupt enterprise class disks develop many more checksum mismatches than corrupt nearline disks Paper Reading Group, 2008-06-24 Slide 12 of 21

Characteristics: Disks and Disk Shelves Observation 8: checksum mismatches within the same disk are not independent Observation 9: the probability of developing a checksum mismatch is not independent of that of other disks in the same storage system – Example: • One system had 92 disks develop errors • Caused by faulty storage controller Paper Reading Group, 2008-06-24 Slide 13 of 21

Characteristics: Locality Observation 10: checksum mismatches have high spatial locality Observation 11 & 12: there is temporal locality Paper Reading Group, 2008-06-24 Slide 14 of 21

Characteristics: Error Type Correlation Observations 12: checksum mismatches correlate with system resets Observation 13: weak positive correlation between checksum mismatches and latent sector errors – If latent sector errors detected, probability of developing checksum mismatches increases: • Nearline disks: 1.4 times • Enterprise class disks: 2.2 times Paper Reading Group, 2008-06-24 Slide 15 of 21

Request Type Analysis Paper Reading Group, 2008-06-24 Slide 16 of 21

Comparison to Latent Sector Errors Paper Reading Group, 2008-06-24 Slide 17 of 21

Lessons Learned • Silent corruption does happen: up to 4% of drives developed errors in 17 months • On average, 8% of checksum mismatches detected during RAID reconstruction ➔ Protection against double disk failure required • An enterprise class disk is likely to quickly develop more corruption after first occurrance ➔ The faulty disk should be replaced soon • Some block numbers are more likely to be affected, possibly due to hardware/firmware bugs ➔ Staggered striping for RAID should be used Paper Reading Group, 2008-06-24 Slide 18 of 21

Lessons Learned (II) • Corruptions have strong spatial locality ➔ Redundant data structures should stored distant from each other • Corruptions also have strong temporal locality ➔ Same write request? Use multiple write request for important / redundant data? ➔ To be leveraged for smarter scrubbing? • Correlation of silent corruption and other errors could be used to improve failure prediction (e.g., latent sector errors) Paper Reading Group, 2008-06-24 Slide 19 of 21

Discussion Points • RAID does not (always) help and most file systems don't do checksumming! Is everything lost? • Laptops have only one disk. ZFS supports redundancy on same disk. Any experiences? • Can checksumming in the disk itself be improved? What would that mean with respect to firmware bugs? • Why are enterprise class disks so much more reliable? Is there any hope that consumer disks catch up in the future? • What about flash disks? Paper Reading Group, 2008-06-24 Slide 20 of 21

References • Lakshmi N. Bairavasundaram, Garth Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, “An Analysis of Data Corruption in the Storage Stack” , FAST '08, San Jose Paper Reading Group, 2008-06-24 Slide 21 of 21

An Analysis of Data Corruption in the Storage Stack Lakshmi N. - PowerPoint PPT Presentation

Department of Computer Science, Institute for System Architecture, Operating Systems Group An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram, Garth Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, Remzi H.

Corruption Prevention Department Corruption Prevention Department Corruption in the Corruption

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Compilers Stack Machines Alex Aiken Stack Machines Only storage is a stack An

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

The Stack Eric McCreath The Stack The stack is a simple but useful data structure in computer

Corruption in Infrastructure Corruption in Infrastructure Corruption in Infrastructure Delivery:

BULLYING: BULLYING: A Pathway to Corruption? A Pathway to Corruption? Matt Maloy Matt Maloy

Combating Corruption in Procurement Macedonia - 2017 Aleksandar Argirovski How is corruption

Against Corruption Andrii Kukharuk Resident Advisor OECD Anti-Corruption Project for Ukraine

Political Economy - Economics 410/510 February 14, 2014 1/40 Outline Introduction Corruption

Lobbying and Corruption Dr James Tremewan (james.tremewan@univie.ac.at) Gender and Corruption

Aid, Donors and Corruption: Emerging Issues Liz Hart, Director U4 Anti-Corruption Resource

development in the SADC region: Panel Data Analysis SHEWANGU DZOMIRA INTRODUCTION Jain

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Silberschatz and Galvin Chapter 14 Tertiary Storage Structure CPSC 410--Richard Furuta 3/29/99

Open World Forum 2013 Bareos is a pure Open Source fork of the bacula.org project Agenda

GOING PUBLIC WITH SEA LEVEL RISE September 24, 2019 Angela Danyluk Senior Sustainability

> Philip Lawrence Senior Solution Architect Sun Microsystems UK & Ireland 1 1

{ avg. latency) Cylinder 7.4/8.2 ms avg. seek Track Arm Platter Head Buffer Platters

Grids and Clouds Interoperation: Development of e-Science Applications Data Manager on Grid

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring

SYBASE IQ ANALYTICS SERVER Sybase Inc March, 2010 SYBASE IQ ANALYTICS SERVER The New

An Analysis of Data Corruption in the Storage Stack Lakshmi N. - PowerPoint PPT Presentation

Department of Computer Science, Institute for System Architecture, Operating Systems Group An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram, Garth Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, Remzi H.

Corruption Prevention Department Corruption Prevention Department Corruption in the Corruption

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Compilers Stack Machines Alex Aiken Stack Machines Only storage is a stack An

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

The Stack Eric McCreath The Stack The stack is a simple but useful data structure in computer

Corruption in Infrastructure Corruption in Infrastructure Corruption in Infrastructure Delivery:

BULLYING: BULLYING: A Pathway to Corruption? A Pathway to Corruption? Matt Maloy Matt Maloy

Combating Corruption in Procurement Macedonia - 2017 Aleksandar Argirovski How is corruption

Against Corruption Andrii Kukharuk Resident Advisor OECD Anti-Corruption Project for Ukraine

Political Economy - Economics 410/510 February 14, 2014 1/40 Outline Introduction Corruption

Lobbying and Corruption Dr James Tremewan (james.tremewan@univie.ac.at) Gender and Corruption

Aid, Donors and Corruption: Emerging Issues Liz Hart, Director U4 Anti-Corruption Resource

development in the SADC region: Panel Data Analysis SHEWANGU DZOMIRA INTRODUCTION Jain

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Silberschatz and Galvin Chapter 14 Tertiary Storage Structure CPSC 410--Richard Furuta 3/29/99

Open World Forum 2013 Bareos is a pure Open Source fork of the bacula.org project Agenda

GOING PUBLIC WITH SEA LEVEL RISE September 24, 2019 Angela Danyluk Senior Sustainability

&gt; Philip Lawrence Senior Solution Architect Sun Microsystems UK &amp; Ireland 1 1

{ avg. latency) Cylinder 7.4/8.2 ms avg. seek Track Arm Platter Head Buffer Platters

Grids and Clouds Interoperation: Development of e-Science Applications Data Manager on Grid

ADVANCED DATABASE SYSTEMS Storage Models &amp; Data Layout @ Andy_Pavlo // 15- 721 // Spring

SYBASE IQ ANALYTICS SERVER Sybase Inc March, 2010 SYBASE IQ ANALYTICS SERVER The New

> Philip Lawrence Senior Solution Architect Sun Microsystems UK & Ireland 1 1

ADVANCED DATABASE SYSTEMS Storage Models & Data Layout @ Andy_Pavlo // 15- 721 // Spring