Systems: A ZFS Case Study Yupu Zhang , Abhishek Rajimwale Andrea C. - PowerPoint PPT Presentation

End-to-end Data Integrity for File Systems: A ZFS Case Study Yupu Zhang , Abhishek Rajimwale Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - Madison 2/26/2010 1

End-to-end Argument • Ideally, applications should take care of data integrity • In reality, file systems are in charge – Data is organized by metadata – Most applications rely on file systems – Applications share data 2/26/2010 2

Data Integrity In Reality • Preserving data integrity is a challenge • Imperfect components – disk media, firmware, controllers, etc. • Techniques to maintain data integrity – Checksums [Stein01, Bartlett04] , RAID [Patternson88] • Enough about disk. What about memory? 2/26/2010 3

Memory Corruption • Memory corruptions do exist – Old studies: 200 – 5,000 FIT per Mb *O’Gorman92, Ziegler96, Normand96, Tezzaron04+ • 14 – 359 errors per year per GB – A recent work: 25,000 – 70,000 FIT per Mb [Schroeder09] • 1794 – 5023 errors per year per GB – Reports from various software bug and vulnerability databases • Isn’t ECC enough? – Usually correct single-bit error – Many commodity systems don’t have ECC (for cost) – Can’t handle software-induced memory corruptions 2/26/2010 4

The Problem • File systems cache a large amount of data in memory for performance – Memory capacity is growing • File systems may cache data for a long time – Susceptible to memory corruptions • How robust are modern file systems to memory corruptions? 2/26/2010 5

A ZFS Case Study • Fault injection experiments on ZFS – What happens when disk corruption occurs? – What happens when memory corruption occurs? – How likely a bit flip would cause problems? • Why ZFS? – Many reliability mechanisms – “ provable end-to-end data integrity ” [Bonwick07] 2/26/2010 6

Results • ZFS is robust to a wide range of disk corruptions • ZFS fails to maintain data integrity in the presence of memory corruptions – reading/writing corrupt data, system crash – one bit flip has non-negligible chances of causing failures • Data integrity at memory level is not preserved 2/26/2010 7

Outline • Introduction • ZFS Background • Data Integrity Analysis – On-disk Analysis – In-mem Analysis • Conclusion 2/26/2010 8

ZFS Reliability Features • Checksums – Detect silent data corruption Address 1 – Stored in a generic block pointer Address 2 Address 3 • Replication – Up to three copies (ditto blocks) Block – Recover from checksum mismatch Checksum • Copy-On-Write transactions – Keep disk image always consistent • Storage pool Block – Mirror, RAID-Z Block Block 2/26/2010 9

Summary of On-disk Analysis • ZFS detects all corruptions by using checksums • Redundant on-disk copies and in-mem caching help ZFS recover from disk corruptions • Data integrity at this level is well preserved (See our paper for more details) 2/26/2010 11

Outline • Introduction • ZFS Background • Data Integrity Analysis – On-disk Analysis – In-mem Analysis • Random Test • Controlled Test • Conclusion 2/26/2010 12

Random Test • Goal – What happens when random bits get flipped? – How often do those failures happen? • Fault injection – A trial: each run of a workload • Run a workload -> inject bit flips -> observe failures • Probability calculation – For each type of failure • P (failure) = # of trials with such failure / total # of trials 2/26/2010 13

Result of Random Test Reading Writing Workload Crash Page Cache Corrupt Data Corrupt Data varmail 0.6% 0.0% 0.3% 31 MB oltp 1.9% 0.1% 1.1% 129 MB webserver 0.7% 1.4% 1.3% 441 MB fileserver 7.1% 3.6% 1.6% 915 MB • The probability of failures is non-negligible • The more page cache is consumed, the more likely a failure would occur

Outline • Introduction • ZFS Background • Data Integrity Analysis – On-disk Analysis – In-mem Analysis • Random Test • Controlled Test • Conclusion 2/26/2010 15

Controlled Test • Goal – Why do those failures happen in ZFS? – How does ZFS react to memory corruptions? • Fault injection – Metadata: field by field – Data: a random bit in a data block • Workload – For global metadata: the “zfs” command – For file system level metadata and data: POSIX API 2/26/2010 16

Result Overview • General observations – Life cycle of a block • Why does bad data get read or written to disk? • Specific cases – Bad data is returned – System crashes – Operation fails 2/26/2010 17

Lifecycle of a Block: READ READ CORRUPT BLOCK READ EVICTION PAGE  CACHE verify checksum DISK unbounded time unbounded time • Blocks on the disk are protected • Blocks in memory are not protected • The window of vulnerability is unbounded 2/26/2010 18

Lifecycle of a Block: WRITE WRITE FLUSH CORRUPT BLOCK EVICTION PAGE CACHE generate checksum DISK <= 30s unbounded time • Corrupt blocks are written to disk permanently • Corrupt blocks are “protected” by the new checksum 2/26/2010 19

Result Overview • General observations – Life cycle of a block • Why does bad data get read or written to disk? • Specific cases – Bad data is returned – System crashes – Operation fails 2/26/2010 20

Case 1: Bad Data • Read (block 0) dnode  dn_nlevels == 3 (011)  return data block 0 at the leaf level × dn_nlevels == 1 (001) …  treat an indirect block as data block 0 … 0 1 2  return the indirect block BAD DATA!!! indirect block data block 2/26/2010 21

Case 2: System Crash • Read (block 0) dnode  dn_nlevels == 3 (011)  return data block 0 at the leaf level × dn_nlevels == 7 (111) … …  go down to the leaf level 0 1 2  treat data block 0 as an indirect block  try to follow an invalid block pointer indirect block  later a NULL-pointer is dereferenced data block 2/26/2010 22

Case 2: System Crash (cont.) uint64_t size = BP_GET_LSIZE(bp); a block pointer, now invalid ... buf->b_data = zio_buf_alloc (size); void * zio_buf_alloc (size_t size) { could be an arbitrarily large value size_t c = (size - 1) >> SPA_MINBLOCKSHIFT; ASSERT(c< SPA_MAXBLOCKSIZE void * kmem_cache_alloc ASSERT(c<256) >>SPA_MINBLOCKSHIFT); (kmem_cache_t *cp, int kmflag) disabled { return ( kmem_cache_alloc … but now c > 256 (zio_buf_cache[c],KM_PUSHPAGE)); ccp = KMEM_CPU_CACHE(cp); } … mutex_enter(&ccp->cc_ylock); NULL ... NULL-pointer dereference } ccp is also NULL CRASH!!! 2/26/2010 23

Case 3: Operation Fail • Open (“file”)  zp_flags is correct  open() succeeds × the 41 st bit of zp_flags is flipped from 0 to 1  EACCES (permission denied) 2/26/2010 24

Case 3: Operation Fail (cont.) 41 st bit …. 00 1 0 …. #define ZFS_AV_QUARANTINED 0x0000020000000000 … if (((v4_mode & (ACE_READ_DATA|ACE_EXECUTE)) && (zp->z_phys->zp_flags & ZFS_AV_QUARANTINED))) { *check_privs = B_FALSE; return (EACCES); } … 2/26/2010 25

Summary of Results • Blocks in memory are not protected – Checksum is only used at the disk boundary • Metadata is critical – Bad data is returned, system crashes, or operations fail • Data integrity at this level is not preserved 2/26/2010 26

Conclusion • A lot of effort has been put into dealing with disk failures – little into handling memory corruptions • Memory corruptions do cause problems – reading/writing bad data, system crash, operation fail • Shouldn't we protect data and metadata from memory corruptions? – to achieve end-to-end data integrity 2/26/2010 28

Thank you! Questions? The ADvanced Systems Laboratory (ADSL) http://www.cs.wisc.edu/adsl/ 2/26/2010 29

Systems: A ZFS Case Study Yupu Zhang , Abhishek Rajimwale Andrea C. - PowerPoint PPT Presentation

End-to-end Data Integrity for File Systems: A ZFS Case Study Yupu Zhang , Abhishek Rajimwale Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - Madison 2/26/2010 1 End-to-end Argument Ideally, applications should

Advanced File Systems, Advanced File Systems, ZFS ZFS http://d3s.mff.cuni.cz/aosy

ZFS Zettabyte File System Powered by: www.netbsd.ir www.usenix.ir ZFS Futures Zpool

ZFS Internal Structure Ulrich Grf Senior SE Sun Microsystems ZFS Filesystem of a New

ZFS Caching Explain Like I'm 5: the ZFS ARC (Adaptive Replacement Cache) Summary &

PostgreSQL + ZFS Best Practices and Standard Procedures "If you are not using ZFS, you are

ZFS UTH Always consistent on disk Under The Hood No journal not needed Superlite

Using Linux perf at Netflix Brendan Gregg Senior Performance Architect Sep 2017 Case Study: ZFS

ZFS Allan Jude -- ScaleEngine Inc. allanjude@freebsd.org twitter: @allanjude Introduction Allan

ZFS For Newbies Dan Langille EuroBSDCon 2019 Lillehammer @dlangille

ZFS: Advanced Integration Allan Jude -- allanjude@freebsd.org @allanjude Introduction: Allan

ZFS For Newbies Dan Langille FreeBSD Fridays: 14 Aug 2020 online @dlangille

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

ZFS Performance Analysis and Tools Brendan Gregg Lead Performance Engineer brendan@joyent.com

How Expert Knowledge Can Three Case Studies Help Measurements: First Case Study Second Case

Case Study A Case fo r Use in Addic tio n Re se arc h De re k Quig le y Unive rsity o f Auc

ZFS file system in deskt op l a b envir onment Dr . Al exei Kot el nikov & Tr ent Ha ndl

Module 5: Implementing Data Integrity Overview Types of Data Integrity Enforcing Data

Reliable Byte-Stream (TCP) re-orders messages delivers duplicate copies of a given

The Internet Today Niko Matsakis Outline Summaries of: End-to-End Arguments in System

Service time evaluation for network protocols and routers Computer Networking Communication

Authentication and Data Integrity Authentication with Symmetric Key Encryption Authentication

Securing software by enforcing data-flow integrity Manuel Costa Joint work with: Miguel Castro,

PAYE Modernisation Thesaurus Webinar December 2019 Overview PAYE Modernisation Update

Lecture 07: Race Conditions, Deadlock, Data Integrity The job - list - broken and job - list -

Systems: A ZFS Case Study Yupu Zhang , Abhishek Rajimwale Andrea C. - PowerPoint PPT Presentation

End-to-end Data Integrity for File Systems: A ZFS Case Study Yupu Zhang , Abhishek Rajimwale Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin - Madison 2/26/2010 1 End-to-end Argument Ideally, applications should

Advanced File Systems, Advanced File Systems, ZFS ZFS http://d3s.mff.cuni.cz/aosy

ZFS Zettabyte File System Powered by: www.netbsd.ir www.usenix.ir ZFS Futures Zpool

ZFS Internal Structure Ulrich Grf Senior SE Sun Microsystems ZFS Filesystem of a New

ZFS Caching Explain Like I'm 5: the ZFS ARC (Adaptive Replacement Cache) Summary &amp;

PostgreSQL + ZFS Best Practices and Standard Procedures &quot;If you are not using ZFS, you are

ZFS UTH Always consistent on disk Under The Hood No journal not needed Superlite

Using Linux perf at Netflix Brendan Gregg Senior Performance Architect Sep 2017 Case Study: ZFS

ZFS Allan Jude -- ScaleEngine Inc. allanjude@freebsd.org twitter: @allanjude Introduction Allan

ZFS For Newbies Dan Langille EuroBSDCon 2019 Lillehammer @dlangille

ZFS: Advanced Integration Allan Jude -- allanjude@freebsd.org @allanjude Introduction: Allan

ZFS For Newbies Dan Langille FreeBSD Fridays: 14 Aug 2020 online @dlangille

Case study 2 Case study 2 Case study 2 Case study 2 Former Industrial Site, London: How has

ZFS Performance Analysis and Tools Brendan Gregg Lead Performance Engineer brendan@joyent.com

How Expert Knowledge Can Three Case Studies Help Measurements: First Case Study Second Case

Case Study A Case fo r Use in Addic tio n Re se arc h De re k Quig le y Unive rsity o f Auc

ZFS file system in deskt op l a b envir onment Dr . Al exei Kot el nikov &amp; Tr ent Ha ndl

Module 5: Implementing Data Integrity Overview Types of Data Integrity Enforcing Data

Reliable Byte-Stream (TCP) re-orders messages delivers duplicate copies of a given

The Internet Today Niko Matsakis Outline Summaries of: End-to-End Arguments in System

Service time evaluation for network protocols and routers Computer Networking Communication

Authentication and Data Integrity Authentication with Symmetric Key Encryption Authentication

Securing software by enforcing data-flow integrity Manuel Costa Joint work with: Miguel Castro,

PAYE Modernisation Thesaurus Webinar December 2019 Overview PAYE Modernisation Update

Lecture 07: Race Conditions, Deadlock, Data Integrity The job - list - broken and job - list -

ZFS Caching Explain Like I'm 5: the ZFS ARC (Adaptive Replacement Cache) Summary &

PostgreSQL + ZFS Best Practices and Standard Procedures "If you are not using ZFS, you are

ZFS file system in deskt op l a b envir onment Dr . Al exei Kot el nikov & Tr ent Ha ndl