Tolera'ng File‐System Mistakes with EnvyFS Swaminathan Sundararaman Lakshmi N. Bairavasundaram Andrea C. Arpaci‐Dusseau NetApp, Inc. Remzi H. Arpaci‐Dusseau University of Wisconsin Madison
File Systems in Today’s World • Modern file systems are complex – Tens of thousands of lines of code (e.g., XFS 45K LOC) • Storage stack is also geVng deeper – Hypervisor, network, logical volume manager • Need to handle a gamut of failures – Memory alloca'on, disk faults, bit flips, system crashes • Preserve integrity of its meta‐data and user data 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 2
File System Bugs • Bug reports for Linux 2.6 series from Bugzilla – ext3: 64, JFS: 17, ReiserFS: 38 – Some are FS corrup'on causing permanent data loss • FS bugs broadly classified into two categories – “fail‐stop” : System immediately crashes • Solu'ons: Nooks [ Swi/ 04 ], CuriOS [ David08 ] – “ fail‐silent ”: Accidentally corrupt on‐disk state • Many such bugs uncovered [ Prabhakaran05, Gunawi08, Yang04, Yang06b ] 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 3
Bugs are inevitable in file systems Challenge: how to cope with them? 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 4
N‐Version File Systems • Based on N‐version programming [ Avizienis77 ] – NFS servers [ Rodrigues01 ], databases [ Vandiver07 ], security [ Cox06 ] Applica'on • EnvyFS: Simple solware layer EnvyFS layer – Store data in N child file systems – Opera'ons performed on all children Child 1 Child N Child 2 … • Rely on a simple so-ware layer • Challenge: reducing overheads while Disk driver SIS layer retaining reliability Disk – SubSIST: Novel Single Instance Store 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 5
Results • Robustness – Tradi'onal file systems handle few corrup'ons (< 4%) – EnvyFS 3 tolerates 98.9% of single file system mistakes • Performance – Desktop workloads: EnvyFS 3 has comparable performance – I/O intensive workloads: • Normal mode: EnvyFS 3 + SubSIST acceptable performance • Under memory pressure: EnvyFS 3 + SubSIST large overheads • Poten'al as a debugging tool for FS developers – Pinpoint the source of “ fail‐silent ” bug in ext3 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 6
Outline • Introduc'on • Building reliable file systems • Reducing overheads with SubSIST • Evalua'on • Conclusion 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 7
N‐Version Systems Development process: 1. Producing the specifica'on of solware 2. Implemen'ng N versions of the solware 3. Crea'ng N‐version layer — Executes different versions — Determines the consensus result 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 8
1. Producing Specifica'on • Our own specifica'on ? – Imprac'cal: Requires wide scale changes to file systems – Specifica'ons take years to get accepted • Can we leverage exis'ng specifica'on ? – Yes, can leverage VFS , but there are some issues • VFS not precise for N‐versioning purpose – Needs to handle cases where specifica'on is not precise – e.g., Ordering directory entries, inode number alloca'on 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 9
Imprecise VFS Specifica'on File 1 Ordering directory entries File 2 File 3 Dir: test • Issue: File 1 – No specified return order File 2 No Entries Readdir: test File 3 – Can’t blindly compare entries EnvyFS layer File 1 File 2 File 3 • Solu'on: – Read all entries from a directory File 2 File 1 File 3 FS X FS Y … FS Z (dir: test in our case) from all FSes File 3 File 2 File 1 File 3 File 1 File 2 – Match entries from FSes Dir: test Dir: test Dir: test – Return majority results 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 10
Imprecise VFS Specifica'on (cont) • Inode number alloca'on – Inode numbers returned through system calls – Each child file system issues different inode numbers – Possible solu'on: Force file systems to use same algorithm? – Our solu'on: Issue inode numbers at EnvyFS layer Stat: File 1 File 1 | ?? 15 Virt # FS 1 FS 2 FS 3 EnvyFS layer 15 10 36 65 File 1 | 10 File 1 | 36 File 1 |65 Inode Mapping Table File 1 10 File 2 04 File 3 99 FS X FS Z FS Y File 2 15 File 3 44 File 1 65 File 3 16 File 1 36 File 2 43 Inode Mapping Table not persistently stored Dir: test Dir: test Dir: test Inode Numbers 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 11
2. Implemen'ng N versions of FS • Painful process – High cost of development, long 'me delays • Lucky! Hard work already done for us – 30 different disk based file systems in Linux 2.6 • Which file systems to use? – ext3, JFS, ReiserFS in a three‐version FS – Others should work without modifica'ons 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 12
3. Crea'ng N‐Version Layer N‐Version layer ( EnvyFS ) • Applica'on – Inserted beneath VFS Read (file, 1 block) err , D VFS layer – Simple design to avoid bugs Read (file, 1 block) err , D Inode Mapping Table EnvyFS Example: Reading a file • Wrappers Comparators Layer – Allocate N data buffers – Read data block from the disk D err = D err = err = Read (…) Read (…) Read (…) D – Compare: data, return code, file posi'on F ReiserFS F F ext3 … – Return: data, return code JFS pos: x pos: x pos: x • Issues: D D D – Allocate memory for each read opera'on Disk – Extra copy from allocated buffer to applica'on – Comparison overheads 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 13
Reading a File in EnvyFS • Solu'on: Applica'on Read (file, 1 block) err , D – Same applica'on buffer for all FS VFS layer – TCP‐like checksums for data comparison Read (file, 1 block) err , D – Compare: checksums, return code, file Inode Mapping Table EnvyFS posi'on Wrappers Comparators Layer – Read data un'l majority D err = D err = err = Read (…) Read (…) Read (…) D F ReiserFS F F ext3 … JFS pos: x pos: x pos: x D D D FS N # FS 1 # FS 2 # … 435 435 436 … Disk Checksums 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 14
Outline • Introduc'on • Building reliable file systems • Reducing overheads with SubSIST • Evalua'on • Conclusion 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 15
Case for Single Instance Storage (SIS) Applica'on • Ideal: One disk per FS VFS layer 1 • Prac'cal: One disk for all FS EnvyFS layer 2 1 N • Overheads FS N FS 1 FS 2 … – Effec've storage space: 1/N – N 'mes more I/O (Read/write) 1 2 N … Disk Req. Queue Disk 1 Disk 2 Disk N • Challenge: Maintain diversity … Disk Part 1 Part 2 Part N while minimizing overheads 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 16
SubSIST: Single Instance Store Applica'on • Variant of an Single Instance Store VFS layer – Selec'vely merges data blocks D EnvyFS layer • Block addressable SIS – Exports virtual disks to FSes D D D FS N FS 1 – Manages mapping, free space info. … FS 2 – Not persistently stored on disk M D M D M D • EnvyFS writes through N file systems SubSIST Vdisk 1 Vdisk 2 Vdisk N – N data blocks merged to 1 data block – Content hashes not stored persistently CHash Layer Read Cache – Meta‐data blocks not merged Free Space Management – Inter FS blocks and not intra FS Disk 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 17
Handling Data Block Corrup'ons? Applica'on Corrup'on to data in a single FS VFS layer – Due to bugs, bit flips, storage stack – Corrupt data blocks not merged EnvyFS layer – All other N‐1 data blocks merged D D D D – Corrupt data block fixed at next read FS N FS 1 … FS 2 × Corrup'on to data block inside disk D D D D SubSIST Vdisk 1 Vdisk 2 Vdisk N • Single copy of data D D – Different code paths CHash Layer Read Cache – Different on‐disk structures Free Space Management D D Disk 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 18
Outline • Introduc'on • Building reliable file systems • Reducing overheads with SubSIST • EvaluaHon – Reliability – Performance • Conclusion 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 19
Reliability Evalua'on: Fault Injec'on VFS Robustness of EnvyFS in recovering from a EnvyFS layer child file system’s mistake? ReiserFS ext 3 • Corrup'on: bugs in FS / storage stack JFS • Types of disk blocks B B – superblock, inode, block bitmap, file data, … Pseudo • Perform different file ops Type‐aware fault injecHon [ P rabhakaran05 ] Device Driver – mount, stat, creat, unlink, read, … B B B • Report user visible results Block Driver • All results are applicable with SubSIST B B B except corrupHon to data blocks Disk 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 20
Normal path traversal SET‐2 ( chmod ) SET‐1 ( stat, … ) getdirentries ext3 SET‐3 ( fsync ) Data truncate readlink umount rename symlink mount unlink loss E mkdir rmdir write creat read link Cannot INODE mount DIR Ops fail BMAP Data IMAP corrupt Result Matrix INDIRECT Crash DATA Read‐only SUPER JSUPER e Depends GDESC N/A 6/18/09 Tolera'ng File‐System Mistakes with EnvyFS 21
Recommend
More recommend