EIO : E rror-handling i s O ccasionally Correct Haryadi S. Gunawi , - PowerPoint PPT Presentation

EIO : E rror-handling i s O ccasionally Correct Haryadi S. Gunawi , Cindy Rubio-González, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University of Wisconsin – Madison FAST ’08 – February 28, 2008 1

Robustness of File Systems � Today’s file systems have robustness issues � Buggy implementation [FiSC-OSDI’04, EXPLODE-OSDI’06] � Unexpected behaviors in corner-case situations � Deficient fault-handling [IRONFS-SOSP’05] � Inconsistent policies: propagate, retry, stop, ignore � Prevalent ignorance � Ext3: Ignore write failures during checkpoint and journal replay � NFS: Sync-failure at the server is not propagated to client � What is the root cause? 2

Incorrect Error Code Propagation NFS NFS Client Server sync() dosync dosync fdatawrite fdatawrite sync_file sync_file fdatawait fdatawait void dosync() { void dosync() { fdatawrite(); fdatawrite(); ... ... ... ... ... ... ... ... sync_file(); sync_file(); fdatawait(); fdatawait(); ... ... ... ... ... ... ... ... } ... ... ... ... ... ... ... ... ... ... 3

Incorrect Error Code Propagation NFS NFS Client Server sync() Unsaved dosync dosync dosync dosync error-codes fdatawrite fdatawrite fdatawrite fdatawrite sync_file sync_file sync_file sync_file fdatawait fdatawait fdatawait fdatawait void dosync() { void dosync() { X fdatawrite(); fdatawrite(); ... ... ... ... ... ... ... ... ... ... ... ... ... ... X sync_file(); sync_file(); X fdatawait(); fdatawait(); ... ... ... ... ... ... ... ... ... ... ... ... } ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... return return return return return return EIO; EIO; EIO; EIO; EIO; EIO; 4

Implications � Misleading error-codes in distributed systems � NFS client receives SUCCEED instead of ERROR � Useless policies � Retry in NFS client is not invoked � Silent failures � Much harder debugging process 5

EDP : Error Detection and Propagation Analysis � Static analysis � Useful to show how error codes flow � Currently: 34 basic error codes (e.g. EIO, ENOMEM) � Target systems � 51 file systems (all directories in linux/fs/* ) � 3 storage drivers (SCSI, IDE, Software-RAID) 6

Results � Number of violations � Error-codes flow through 9022 function calls � 1153 ( 13% ) calls do not save the returned error-codes � Analysis, a closer look � More complex file systems, more violations � Location distance affects error propagation correctness � Write errors are neglected more than read errors � Many violations are not corner-case bugs − Error-codes are consistently ignored 7

Outline � Introduction � Methodology � Challenges � EDP tool � Results � Analysis � Discussion and Conclusion 8

Challenges in Static Analysis � File systems use many error codes � buffer � state[Uptodate] = 0 � journal � flags = ABORT � int err = -EIO; ... return err; � Error codes transform � Block I/O error becomes journal error � Journal error becomes generic error code � Error codes propagate through: � Function call path � Asynchronous path (e.g. interrupt, network messages) 9

EDP � State � Current State: Integer error-codes, function call path � Future: Error transformation, asynchronous path � Implementation � Utilize CIL: Infrastructure for C program analysis [Necula-CC’02] � EDP: ~4000 LOC in Ocaml � 3 components of EDP architecture � Specifying error-code information (e.g. EIO, ENOMEM) � Constructing error channels � Identifying violation points 10

Constructing Error Channels � Propagate function sys_fsync � Dataflow analysis do_fsync � Connect function filemap_fdatawrite VFS pointers filemap_fdatawrt_rn EIO EIO do_writepages generic_writepages mpage_writepages ext3 ext3_writepage � Generation endpoint if (...) � Generates error code return –EIO –EIO; � Example: return –EIO ext3_writepage (int *err) *err = –EIO; *err = –EIO; 11

Detecting Violations Error-complete endpoint � Termination endpoint func() { err = func_call(); err � Error code is no longer propagated if ( err err ) ... � Two termination endpoints: } Unchecked error-complete ( minimally checks) − func() { error-broken − err = func_call(); err ( unchecked, unsaved, overwritten) } Unsaved / Bad Call � Goal: func() { � Find error-broken endpoints func_call(); } Overwritten func() { err = func_call(); err err = func_call_2(); err } 12

Outline � Introduction � Methodology � Results (unsaved error-codes / bad calls) � Graphical outputs � Complete results � Analysis of Results � Discussion and Conclusion 13

HFS 1 3 2 Functions that generate/propagate error-codes func Functions that make bad calls (do not save error-codes) func Good calls (calls that propagate error-codes) Bad calls (calls that do not save error-codes) 14

HFS (Example 1 ) int find_init(find_data *fd) { … 1 fd->search_key = kmalloc(…); if (!fd->search_key) return –ENOMEM; return –ENOMEM; … } int file_lookup() { … Bad call! find_init(fd); find_init(fd); fd->search_key-> search_key->cat = …; … Null pointer dereference } Inconsistencies Callee Good Calls Bad Calls 3 11 find_init find_init 15

HFS (Example 2) 2 16

HFS (Example 2) int __brec_find __brec_find(key) { 2 Finds a record in an HFS node that best matches the given key. Returns ENOENT if it fails. } int brec_find brec_find(key) { … result = __brec_find(key); result = __brec_find(key); … return result; return result; } Inconsistencies Callee Good Calls Bad Calls 3 11 find_init 1 4 __brec_find __brec_find 18 0 brec_find brec_find 17

HFS (Example 3) 3 18

HFS (Example 3) 3 int free_exts free_exts(…) { Traverses a list of extents and locate the extents to be freed. If not found, returns EIO . “panic?” is written before the return EIO statement. } Inconsistencies Callee Good Calls Bad Calls 3 11 find_init 1 4 __brec_find 18 0 brec_find 1 3 free_exts free_exts 19

HFS (Summary) Inconsistencies Callee Good Calls Bad Calls 3 11 find_init find_init 1 4 __brec_find brec_find 18 0 brec_find brec_find 1 3 free_exts free_exts � Not only in HFS � Almost all file systems and storage systems have major inconsistencies 20

ext3 37 bad / 188 calls = 20% 21

ReiserFS 35 bad / 218 calls = 16% 22

IBM JFS 61 bad / 340 calls = 18% 23

NFS Client 54 bad / 446 calls = 12% 24

Coda 0 bad / 54 calls = 0% (internal) 0 bad / 95 calls = 0% (external) 25

Summary � Incorrect error propagation plagues almost all file systems and storage systems Bad Calls EC Calls Fraction File systems 914 7400 12% Storage drivers 177 904 20% 26

Outline � Introduction � Methodology � Results � Analysis of Results � Discussion and Conclude 27

Analysis of Results � Correlate robustness and complexity � Correlate file system size with number of violations More complex file systems, more violations (Corr = 0.82) − � Correlate file system size with frequency of violations Small file systems make frequent violations (Corr = -0.20) − � Location distance of calls affects correct error propagation � Inter-module > inter-file > intra-file bad calls � Read vs. Write failure-handling � Corner-case or consistent mistakes 28

Read vs. Write Failure-Handling � Filter read/write operations (string comparison) � Callee contains “ write ”, or “ sync ”, or “ wait ” � Write ops � Callee contains “ read ” � Read ops Callee Type Bad Calls EC Calls Fraction Read 26* 603 4% Sync+Wait+Write 177 904 20% mm/readahead.c Lots of write failures Read prefetching in are ignored! Memory Management 29

Corner-Case or Consistent Mistakes? # Bad calls to f() � Define bad call frequency = # All calls to f() � Example: sync_blockdev, 15/21 � Bad call frequency: 71% � Corner-case bugs � Bad call frequency < 20% � Consistent bugs � Bad call frequency > 50% 30

CDF of Bad Call Frequency 850 bad calls fall above the 50% mark Cumulative Cumulative #Bad Calls Fraction Bad Call Frequency Less than 100 sync_blockdev 15 bad calls / 21 EC calls violations are corner- Bad Call Freq: 71 % case bugs At x = 71 , y += 15 31

What’s going on? � Not just bugs � But more fundamental design issues � Checkpoint failures are ignored Why? Maybe because of journaling flaw [IOShepherd-SOSP’07] − Cannot recover from checkpoint failures − Ex: A simple block remap could not result in a consistent state − � Many write failures are ignored Lack of recovery policies? Hard to recover? − � Many failures are ignored in the middle of operations Hard to rollback? − 32

Conclusion (developer comments) � ext3 “there's no way of reporting error to userspace. So ignore it” � XFS “Just ignore errors at this point. There is nothing we can do except to try to keep going” � ReiserFS “we can't do anything about an error here” � IBM JFS “ note: todo: log error handler” � CIFS “should we pass any errors back?” � SCSI “Todo: handle failure” 33

Thank you! Questions? AD vanced S ystems L aboratory www.cs.wisc.edu/adsl 34

Extra Slides 35

EIO : E rror-handling i s O ccasionally Correct Haryadi S. Gunawi , - PowerPoint PPT Presentation

EIO : E rror-handling i s O ccasionally Correct Haryadi S. Gunawi , Cindy Rubio-Gonzlez, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University of Wisconsin Madison FAST 08 February 28, 2008 1 Robustness of File

P UBLIC - KEY CRYPTOGRAPHY (PKC) E RROR - CORRECTING PAIRS FOR A PUBLIC - KEY CRYPTOSYSTEM P UBLIC

EIO: ERROR CHECKING IS OCCASIONALLY CORRECT HARYADI S. GUNAWI, CINDY RUBIO-GONZLEZ, ANDREA C.

Material Handling Chapter 5 Designing material handling systems Overview of material

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

Bias and Uncertainty W hen judgm ents are m ade under uncertainty, tw o general types of

PaSTR TRI: E : Err rror-Bou Bounded Los Lossy Comp Compression on for or Two-El Electron

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

Performance of Correct Statement of the Problem and Impact. Associated Issues. Procedure

Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp

1 Correct at 11 October 2019 - the latest information can be found on GOV.UK 2 Correct at 11

Correct b Correct by Construction A Construction Attack ttack-Tolerant Syst olerant Systems

Safe and Reliable Test Results Handling Running a practice session on results handling How to

HANDLING B2B OBJECTIONS National Growth Webinar RICK LAMBERT ALAN WHITE Sales Performance

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback ICML 2020 Michihiro

Hypothesis Testing Mark Lunt Centre for Epidemiology Versus Arthritis University of Manchester

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Static typing: beyond the basics of Static typing: beyond the basics of def foo(x: int) ->

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

Website http://exceptionsafecode.com Bibliography Video Comments Contact Email

SMT error analysis and mapping to syntactic, semantic and structural fixes Nora Aranberri IXA

Hypothesis testing get data that differ from the null hypothesis. If the data would be quite

EIO : E rror-handling i s O ccasionally Correct Haryadi S. Gunawi , - PowerPoint PPT Presentation

EIO : E rror-handling i s O ccasionally Correct Haryadi S. Gunawi , Cindy Rubio-Gonzlez, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Ben Liblit University of Wisconsin Madison FAST 08 February 28, 2008 1 Robustness of File

P UBLIC - KEY CRYPTOGRAPHY (PKC) E RROR - CORRECTING PAIRS FOR A PUBLIC - KEY CRYPTOSYSTEM P UBLIC

EIO: ERROR CHECKING IS OCCASIONALLY CORRECT HARYADI S. GUNAWI, CINDY RUBIO-GONZLEZ, ANDREA C.

Material Handling Chapter 5 Designing material handling systems Overview of material

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

Bias and Uncertainty W hen judgm ents are m ade under uncertainty, tw o general types of

PaSTR TRI: E : Err rror-Bou Bounded Los Lossy Comp Compression on for or Two-El Electron

Powerpoint Presentation On Manual Handling Powerpoint Presentation On Manual Handling We proudly

Manual Handling Risk Assessment Powerpoint Presentation Manual handling technique. Hansen Manual

LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN MATERIALS HANDLING LITHIUM ION IN WAREHOUSE

Hand Ball Hand Ball What?? Handling the Ball Handling the Ball Goal - Consistent Calls

Performance of Correct Statement of the Problem and Impact. Associated Issues. Procedure

Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp

1 Correct at 11 October 2019 - the latest information can be found on GOV.UK 2 Correct at 11

Correct b Correct by Construction A Construction Attack ttack-Tolerant Syst olerant Systems

Safe and Reliable Test Results Handling Running a practice session on results handling How to

HANDLING B2B OBJECTIONS National Growth Webinar RICK LAMBERT ALAN WHITE Sales Performance

Graph-based, Self-Supervised Program Repair from Diagnostic Feedback ICML 2020 Michihiro

Hypothesis Testing Mark Lunt Centre for Epidemiology Versus Arthritis University of Manchester

Mean Tests &amp; X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Static typing: beyond the basics of Static typing: beyond the basics of def foo(x: int) -&gt;

BERT Basic Error Response Type Bert Why: Document WG Choice What: method to sign

Website http://exceptionsafecode.com Bibliography Video Comments Contact Email

SMT error analysis and mapping to syntactic, semantic and structural fixes Nora Aranberri IXA

Hypothesis testing get data that differ from the null hypothesis. If the data would be quite

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Static typing: beyond the basics of Static typing: beyond the basics of def foo(x: int) ->