verifying a high performance crash safe file system using
play

Verifying a high-performance crash-safe file system using a tree - PowerPoint PPT Presentation

Verifying a high-performance crash-safe file system using a tree specifica6on Haogang Chen, Tej Chajed , Stephanie Wang, Alex Konradi, Atalay leri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich File systems are difficult to make correct


  1. Verifying a high-performance crash-safe file system using a tree specifica6on Haogang Chen, Tej Chajed , Stephanie Wang, Alex Konradi, Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich

  2. File systems are difficult to make correct • Complicated implementa6ons • on-disk layout • in-memory data structures • Computer can crash at any 6me 2

  3. Despite much effort, file systems have bugs • File systems s6ll have subtle bugs • Well documented [Lu, TOS ’14] [Min, SOSP ’15] • Example from ext4: 
 combina6on of two op6miza6ons allows data to leak from one file to another on crash • Discovered a[er 6 years [Kara 2014] 3

  4. Approach: formal verifica6on • Write a specifica6on • Prove implementa6on meets the specifica6on • Ensures implementa6on handles all corner cases • Proof assistant (Coq) ensures proof is correct • Avoid large class of bugs 4

  5. Exis6ng verified file systems correctness FSCQ [SOSP ’15] BilbyFS [ASPLOS ’16] Yggdrasil [OSDI ’16] verified file systems ext4 btrfs ZFS performance 5

  6. Goal: verified high-performance file system correctness FSCQ [SOSP ’15] ? BilbyFS [ASPLOS ’16] Yggdrasil [OSDI ’16] verified file systems ext4 btrfs ZFS performance 6

  7. Strawman: op6mize FSCQ correctness FSCQ code performance 7

  8. Strawman: op6mize FSCQ spec proof correctness FSCQ code performance 7

  9. Strawman: op6mize FSCQ spec proof? proof correctness FSCQ code fast FSCQ performance 7

  10. Problem: specifica6on incompa6ble with high performance • Achieving high performance requires op6miza6ons • Some op6miza6ons change file-system behavior • Requires changes to specifica6on 8

  11. Example op6miza6on: deferred commit • Deferred commit: buffer system calls un6l fsync • FSCQ’s specifica6on: “if create(f) has returned and computer crashes, f exists” • Deferred commit requires a new specifica6on 9

  12. Op6miza6ons that change crash behavior • Deferred commit: buffer system calls un6l fsync • Log-bypass writes: skip log for data writes • Buffer cache: cache data un6l fdatasync • Exis*ng specifica*ons do not support these op*miza*ons 10

  13. Contribu6on: DFSCQ file system • Precise specifica6on for a subset of POSIX • supports deferred commit and log-bypass writes • Verified, crash-safe file system • Tradi6onal journalling file-system design • Implements most of ext4’s op6miza6ons • Machine-checked proof that implementa6on meets specifica6on • Performance on par with ext4 (but DFSCQ has fewer features) 11

  14. Specifying a file system • Design abstract state 12

  15. Specifying a file system • Design abstract state • Describe how system calls execute 12

  16. Specifying a file system • Design abstract state • Describe how system calls execute • Describe effect of crashes 12

  17. Star6ng point: tree as abstract state Trees are a simplified abstrac6on of a file system g f 13

  18. Specifica6on abstracts implementa6on details g abstract state f implementa6on’s state 14

  19. Specify how system calls affect abstract state specifica6on describes transi6on unlink(g) g f f unlink(g) 15

  20. Challenges in specifying crash behavior • Op6miza6ons mean crashes can be complex • Problem 1: deferred commit • Problem 2: log-bypass writes • Problem 3: caching 16

  21. Problem 1: deferred commit leads to many crash states unlink(g) g f f 17

  22. Problem 1: deferred commit leads to many crash states unlink(g) g f f crash: reset memory 17

  23. Problem 1: deferred commit leads to many crash states g unlink(g) g f f f f crash: reset memory 17

  24. How do we specify crash outcomes with deferred commit? g f f 18

  25. How do we specify crash outcomes with deferred commit? crash g f f 18

  26. Specify deferred commit using tree sequences g tree sequence f 19

  27. Specify deferred commit using tree sequences • Abstract state is a sequence of trees g tree sequence f 19

  28. Specify deferred commit using tree sequences • Abstract state is a sequence of trees • Always read from the latest tree g tree sequence f 19

  29. Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f g unlink(g) f f 20

  30. Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f 21

  31. Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f g truncate(f,2) f f f 22

  32. Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f f 23

  33. Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f f g f rename(f,/) f f f 24

  34. Behavior of tree sequences on crash • What about crash behavior? g f tree sequence f f f 25

  35. Behavior of tree sequences on crash • What about crash behavior? g f tree sequence f f f crash post-crash g tree sequence f 25

  36. Crash specifica6on allows background commits g f tree sequence f f f post-crash states: crash g f f f f 26

  37. Specifica6on for fsync g f f f f fsync("/") f 27

  38. Problem 2: log-bypass writes may reorder updates • Log-bypass writes: update file data blocks in place, skipping log write rename f f f 28

  39. Problem 2: log-bypass writes may reorder updates • Log-bypass writes: update file data blocks in place, skipping log • Effect: data writes and metadata updates can be reordered on crash crash write rename f f f f 28

  40. Log-bypass writes f g f f f f g write(f,…) f f f At minimum, writes to latest tree 29

  41. Log-bypass writes f g f f f f g write(f,…) f f f Affects the same file in earlier trees 30

  42. Specify that other files are unaffected f g f f f ? b21 f g write(f,…) b21 b21 f f f Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence 31

  43. Specify that other files are unaffected f g f f f b21 f g write(f,…) b21 b21 f f f Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence 32

  44. Specify that other files are unaffected f g f f f b21 f g write(f,…) b21 b21 f f b51 f b51 Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence 32

  45. Problem 3: data writes are cached • Write-back buffer cache write crash f f f 33

  46. Problem 3: data writes are cached • Write-back buffer cache • Data can be persisted in any order write crash f f f f f f 33

  47. Specifying data caching: block sets f g f f f uncached two possible values: old ( ) and new ( ) 34

  48. Behavior of block sets on crash f g f f f f g crash f f f

  49. Behavior of block sets on crash f g f f f two degrees of non-determinism in crash states: f g crash f f f f f

  50. Behavior of block sets on crash f g f f f two degrees of non-determinism in crash states: f g crash f f f specifica6on allows f metadata and data updates to be reordered f

  51. Specifica6on for fdatasync f g f f f fdatasync(f) 37

  52. Specifica6on for fdatasync f g f f f f g fdatasync(f) f f f fdatasync specifica6on says block sets collapse in every tree 38

  53. Summary: DFSCQ’s tree-based specifica6on • metadata opera6ons add a new tree • fsync collapses to latest tree • writes update blocksets in every tree • fdatasync collapses blocksets in every tree 39

  54. Prove implementa6on meets specifica6on length: 2 type: file … stat(g) g g f f length: 2 type: file … stat(g) 40

  55. Prove implementa6on meets specifica6on length: 2 type: file … stat(g) g g f f length: 2 type: file … stat(g) return values match 40

  56. Prove implementa6on meets specifica6on length: 2 type: file … stat(g) unlink(g) g g g f f f f length: 2 type: file … stat(g) unlink(g) return values match 40

  57. Prove implementa6on meets specifica6on length: 2 type: file … stat(g) unlink(g) g g g f f f f length: 2 type: file … stat(g) unlink(g) disk con6nues to relate return values match to abstract state 40

  58. DFSCQ Design directory name cache inode k -indirect blocks dirty blocks block allocator free-bit cache avoid re-use logging checksums deferred commit log-bypass API buffer cache 41

  59. Many single-layer op6miza6ons directory • Affect only proof of single layer name cache inode k -indirect blocks dirty blocks block allocator free-bit cache avoid re-use logging checksums deferred commit log-bypass API buffer cache 42

  60. Many single-layer op6miza6ons directory • Affect only proof of single layer name cache inode k -indirect blocks dirty blocks block allocator cache free blocks free-bit cache avoid re-use logging checksums deferred commit log-bypass API buffer cache 42

Recommend


More recommend