Verifying a high-performance crash-safe file system using a tree specifica6on Haogang Chen, Tej Chajed , Stephanie Wang, Alex Konradi, Atalay İleri, Adam Chlipala, M. Frans Kaashoek, Nickolai Zeldovich
File systems are difficult to make correct • Complicated implementa6ons • on-disk layout • in-memory data structures • Computer can crash at any 6me 2
Despite much effort, file systems have bugs • File systems s6ll have subtle bugs • Well documented [Lu, TOS ’14] [Min, SOSP ’15] • Example from ext4: combina6on of two op6miza6ons allows data to leak from one file to another on crash • Discovered a[er 6 years [Kara 2014] 3
Approach: formal verifica6on • Write a specifica6on • Prove implementa6on meets the specifica6on • Ensures implementa6on handles all corner cases • Proof assistant (Coq) ensures proof is correct • Avoid large class of bugs 4
Exis6ng verified file systems correctness FSCQ [SOSP ’15] BilbyFS [ASPLOS ’16] Yggdrasil [OSDI ’16] verified file systems ext4 btrfs ZFS performance 5
Goal: verified high-performance file system correctness FSCQ [SOSP ’15] ? BilbyFS [ASPLOS ’16] Yggdrasil [OSDI ’16] verified file systems ext4 btrfs ZFS performance 6
Strawman: op6mize FSCQ correctness FSCQ code performance 7
Strawman: op6mize FSCQ spec proof correctness FSCQ code performance 7
Strawman: op6mize FSCQ spec proof? proof correctness FSCQ code fast FSCQ performance 7
Problem: specifica6on incompa6ble with high performance • Achieving high performance requires op6miza6ons • Some op6miza6ons change file-system behavior • Requires changes to specifica6on 8
Example op6miza6on: deferred commit • Deferred commit: buffer system calls un6l fsync • FSCQ’s specifica6on: “if create(f) has returned and computer crashes, f exists” • Deferred commit requires a new specifica6on 9
Op6miza6ons that change crash behavior • Deferred commit: buffer system calls un6l fsync • Log-bypass writes: skip log for data writes • Buffer cache: cache data un6l fdatasync • Exis*ng specifica*ons do not support these op*miza*ons 10
Contribu6on: DFSCQ file system • Precise specifica6on for a subset of POSIX • supports deferred commit and log-bypass writes • Verified, crash-safe file system • Tradi6onal journalling file-system design • Implements most of ext4’s op6miza6ons • Machine-checked proof that implementa6on meets specifica6on • Performance on par with ext4 (but DFSCQ has fewer features) 11
Specifying a file system • Design abstract state 12
Specifying a file system • Design abstract state • Describe how system calls execute 12
Specifying a file system • Design abstract state • Describe how system calls execute • Describe effect of crashes 12
Star6ng point: tree as abstract state Trees are a simplified abstrac6on of a file system g f 13
Specifica6on abstracts implementa6on details g abstract state f implementa6on’s state 14
Specify how system calls affect abstract state specifica6on describes transi6on unlink(g) g f f unlink(g) 15
Challenges in specifying crash behavior • Op6miza6ons mean crashes can be complex • Problem 1: deferred commit • Problem 2: log-bypass writes • Problem 3: caching 16
Problem 1: deferred commit leads to many crash states unlink(g) g f f 17
Problem 1: deferred commit leads to many crash states unlink(g) g f f crash: reset memory 17
Problem 1: deferred commit leads to many crash states g unlink(g) g f f f f crash: reset memory 17
How do we specify crash outcomes with deferred commit? g f f 18
How do we specify crash outcomes with deferred commit? crash g f f 18
Specify deferred commit using tree sequences g tree sequence f 19
Specify deferred commit using tree sequences • Abstract state is a sequence of trees g tree sequence f 19
Specify deferred commit using tree sequences • Abstract state is a sequence of trees • Always read from the latest tree g tree sequence f 19
Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f g unlink(g) f f 20
Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f 21
Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f g truncate(f,2) f f f 22
Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f f 23
Specify deferred commit using tree sequences • Metadata updates add new trees in the specifica6on • Always read from the latest tree g f f f g f rename(f,/) f f f 24
Behavior of tree sequences on crash • What about crash behavior? g f tree sequence f f f 25
Behavior of tree sequences on crash • What about crash behavior? g f tree sequence f f f crash post-crash g tree sequence f 25
Crash specifica6on allows background commits g f tree sequence f f f post-crash states: crash g f f f f 26
Specifica6on for fsync g f f f f fsync("/") f 27
Problem 2: log-bypass writes may reorder updates • Log-bypass writes: update file data blocks in place, skipping log write rename f f f 28
Problem 2: log-bypass writes may reorder updates • Log-bypass writes: update file data blocks in place, skipping log • Effect: data writes and metadata updates can be reordered on crash crash write rename f f f f 28
Log-bypass writes f g f f f f g write(f,…) f f f At minimum, writes to latest tree 29
Log-bypass writes f g f f f f g write(f,…) f f f Affects the same file in earlier trees 30
Specify that other files are unaffected f g f f f ? b21 f g write(f,…) b21 b21 f f f Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence 31
Specify that other files are unaffected f g f f f b21 f g write(f,…) b21 b21 f f f Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence 32
Specify that other files are unaffected f g f f f b21 f g write(f,…) b21 b21 f f b51 f b51 Puts an obliga6on on the implementa6on to avoid block re-use within a tree sequence 32
Problem 3: data writes are cached • Write-back buffer cache write crash f f f 33
Problem 3: data writes are cached • Write-back buffer cache • Data can be persisted in any order write crash f f f f f f 33
Specifying data caching: block sets f g f f f uncached two possible values: old ( ) and new ( ) 34
Behavior of block sets on crash f g f f f f g crash f f f
Behavior of block sets on crash f g f f f two degrees of non-determinism in crash states: f g crash f f f f f
Behavior of block sets on crash f g f f f two degrees of non-determinism in crash states: f g crash f f f specifica6on allows f metadata and data updates to be reordered f
Specifica6on for fdatasync f g f f f fdatasync(f) 37
Specifica6on for fdatasync f g f f f f g fdatasync(f) f f f fdatasync specifica6on says block sets collapse in every tree 38
Summary: DFSCQ’s tree-based specifica6on • metadata opera6ons add a new tree • fsync collapses to latest tree • writes update blocksets in every tree • fdatasync collapses blocksets in every tree 39
Prove implementa6on meets specifica6on length: 2 type: file … stat(g) g g f f length: 2 type: file … stat(g) 40
Prove implementa6on meets specifica6on length: 2 type: file … stat(g) g g f f length: 2 type: file … stat(g) return values match 40
Prove implementa6on meets specifica6on length: 2 type: file … stat(g) unlink(g) g g g f f f f length: 2 type: file … stat(g) unlink(g) return values match 40
Prove implementa6on meets specifica6on length: 2 type: file … stat(g) unlink(g) g g g f f f f length: 2 type: file … stat(g) unlink(g) disk con6nues to relate return values match to abstract state 40
DFSCQ Design directory name cache inode k -indirect blocks dirty blocks block allocator free-bit cache avoid re-use logging checksums deferred commit log-bypass API buffer cache 41
Many single-layer op6miza6ons directory • Affect only proof of single layer name cache inode k -indirect blocks dirty blocks block allocator free-bit cache avoid re-use logging checksums deferred commit log-bypass API buffer cache 42
Many single-layer op6miza6ons directory • Affect only proof of single layer name cache inode k -indirect blocks dirty blocks block allocator cache free blocks free-bit cache avoid re-use logging checksums deferred commit log-bypass API buffer cache 42
Recommend
More recommend