finding crash consistency bugs with bounded black box
play

Finding Crash-Consistency Bugs with Bounded Black-Box Testing - PowerPoint PPT Presentation

Finding Crash-Consistency Bugs with Bounded Black-Box Testing Jayashree Mohan , Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, Vijay Chidambaram Crashes I crashed This is very File saved! important File missing Image


  1. Finding Crash-Consistency Bugs with Bounded Black-Box Testing Jayashree Mohan , Ashlie Martinez, Soujanya Ponnapalli, Pandian Raju, Vijay Chidambaram

  2. Crashes 
 I crashed ☹ This is very File saved! important… File missing ☹ Image source : https://www.fotolia.com � 2

  3. I wish filesystems were crash-consistent! � 3

  4. Rename atomicity bug in btrfs Memory Storage � 4

  5. Rename atomicity bug in btrfs A mkdir (A) Memory Storage � 5

  6. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) Memory Storage � 6

  7. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) fsync (A/bar) Memory Storage A bar � 7

  8. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) fsync (A/bar) B mkdir (B) Memory Storage A bar � 8

  9. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) fsync (A/bar) B mkdir (B) bar touch (B/bar) Memory Storage A bar � 9

  10. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) fsync (A/bar) B mkdir (B) touch (B/bar) Memory rename (B/bar, A/bar) Storage A bar � 10

  11. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) foo fsync (A/bar) B mkdir (B) touch (B/bar) Memory rename (B/bar, A/bar) touch (A/foo) Storage A bar � 11

  12. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) foo fsync (A/bar) B mkdir (B) touch (B/bar) Memory rename (B/bar, A/bar) touch (A/foo) fsync (A/foo) Storage A bar foo � 12

  13. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) foo fsync (A/bar) B mkdir (B) touch (B/bar) Memory rename (B/bar, A/bar) touch (A/foo) fsync (A/foo) Storage CRASH! A bar foo Expected � 13

  14. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) foo fsync (A/bar) B mkdir (B) touch (B/bar) Memory rename (B/bar, A/bar) touch (A/foo) Persisted file A/bar fsync (A/foo) missing Storage CRASH! A A bar foo foo Actual Expected � 14

  15. Rename atomicity bug in btrfs A mkdir (A) bar touch (A/bar) foo fsync (A/bar) B mkdir (B) touch (B/bar) Memory rename (B/bar, A/bar) touch (A/foo) Persisted file A/bar fsync (A/foo) missing Storage CRASH! A A bar Exists in the kernel since 2014! foo foo Found by ACE and CrashMonkey Actual Expected � 15

  16. Testing Crash Consistency Today Verified • Annotate filesystems Model Checking • Build FS from scratch • Hard to do for existing FS Filesystems • State of the Art : xfstest suite • Collection of 482 regression tests Only 5% of tests in xfstest check for file system crash consistency � 16

  17. Challenges with systematic testing Systematically generate workloads Infinite Lack of Challenges workload automated space infrastructure Our work addresses both these issues, to provide a systematic testing framework � 17

  18. Bounded Black-Box Crash Testing (B 3 ) New approach to testing file-system crash consistency Bounds: Target Filesystem ➡ Focus on reproducible bugs (length, operations, args) resulting in metadata corruption, data loss. Automatic Crash Explorer(ACE) ➡ Found 10 new bugs across btrfs and F2FS; … Workload 1 Workload n ➡ Found 1 bug in FSCQ (verified file system) CrashMonkey ➡ Filesystem agnostic – works with any POSIX file system Output: Bug report with workload, expected state, actual state www.github.com/utsaslab/crashmonkey � 18

  19. Outline • CrashMonkey • Bounded Black Box Crash Testing • Automatic Crash Explorer (ACE) • Demo � 19

  20. Challenges with systematic testing Infinite Lack of Challenges workload automated space infrastructure � 20

  21. CrashMonkey • Efficient infrastructure to record and replay block level IO requests • Simulate crash at different points in the workload • Automatically test for consistency after crash. • Copy-on-write RAM block device � 21

  22. IO due to workload CrashMonkey in Action Persistence point Workload Initial FS state Final FS state � 22

  23. IO due to workload CrashMonkey in Action Persistence point Workload Initial FS state � 23

  24. IO due to workload Phase 1 : Record IO Persistence point IO forced by unmount Workload Initial FS state Oracle Record IO up to persistence point Safely unmount � 24

  25. IO due to workload Phase 2 : Replay IO Persistence point IO forced by unmount Workload Initial FS state Oracle Record IO up to persistence point Safely unmount Initial FS state Crash State Replay IO up to persistence point � 25

  26. IO due to workload Phase 3 : Test for consistency Persistence point IO forced by unmount Workload Initial FS state Oracle Record IO up to persistence point Safely unmount Auto Checker After recovery Initial FS state Crash State Replay IO up to persistence point � 26

  27. IO due to workload Phase 3 : Test for consistency Persistence point IO forced by unmount Workload Initial FS state Oracle Record IO up to persistence point Safely unmount Auto Checker Initial FS state Crash State Replay IO up to persistence point Bug Report � 27

  28. Challenges with Systematic Testing Infinite Lack of Challenges workload automated space infrastructure CrashMonkey So Far… • Given a workload compliant to POSIX API, we saw how CrashMonkey generates crash states and automatically tests for consistency � 28

  29. Challenges with Systematic Testing Infinite Lack of Challenges workload automated space infrastructure CrashMonkey So Far… • Given a workload compliant to POSIX API, we saw how CrashMonkey generates crash states and automatically tests for consistency • Next question : How to automatically generate workloads in an the infinite workload space? � 29

  30. Exploring the infinite workload space Challenges: • Infinite length of workloads • Large set of filesystem operations • Infinite parameter options (file/directory names, depth) • Infinite options for initial filesystem state • When in the workload to simulate a crash? � 30

  31. Outline • CrashMonkey • Bounded Black Box Crash Testing • Automatic Crash Explorer (ACE) • Demo � 31

  32. B 3 : Bounded Black Box Crash Testing Length of workloads Arguments to system calls Initial FS state � 32

  33. B 3 : Bounded Black Box Crash Testing Length of workloads Arguments to system calls Initial FS state � 33

  34. B 3 : Bounded Black Box Crash Testing Length of workloads Arguments to system calls Image source: https://en.wikipedia.org/wiki/Cube Initial FS state � 34

  35. B 3 : Bounded Black Box Crash Testing Length of workloads Arguments to system calls Initial FS state � 35

  36. B 3 : Bounded Black Box Crash Testing Length of workloads Arguments to system calls Initial FS state � 36

  37. B 3 : Bounded Black Box Crash Testing Length of workloads Arguments to system calls Initial FS state � 37

  38. B 3 : Bounded Black Box Crash Testing Choice of crash point • Only after fsync(), fdatasync() or sync() • Not in the middle of system call • Developers are motivated to patch mkdir (A) touch (A/bar) bugs that break semantics of Crash Point 1 fsync (A/bar) persistence operations mkdir (B) • Crashing in the middle of system touch (B/bar) rename (B/bar, A/bar) calls leads to exponentially large touch (A/foo) Crash Point 2 crash-states. fsync (A/foo) � 38

  39. Limitations of B 3 • No guarantee of finding all crash-consistency bugs in a filesystem • Assumes the correct working of crash-consistency mechanism like journaling or CoW • Does not crash in the middle of system calls • Can only reveal if a bug has occurred, not the reason or origin of bug. • Needs larger compute to test higher sequence lengths � 39

  40. Outline • CrashMonkey • Bounded Black Box Crash Testing • Automatic Crash Explorer (ACE) • Demo � 40

  41. Bounds chosen by ACE Length of workloads Bounds picked based on Arguments to system calls insights from the study of crash-consistency bugs reported on Linux file systems over the last 5 years Initial FS state � 41

  42. Bounds chosen by ACE Length of workloads Maximum # core ops is 3 Arguments to system calls Initial FS state � 42

  43. Bounds chosen by ACE Length of workloads Maximum # core ops is 3 Arguments to system calls A (foo, bar) Root B (foo, bar) Overwrites to start, middle, end of a file and append Initial FS state � 43

  44. Bounds chosen by ACE Length of workloads Maximum # core ops is 3 Arguments to system calls A (foo, bar) Root B (foo, bar) Overwrites to start, middle, end and append New, 100MB FS Initial FS state � 44

  45. Operation Set Phases of ACE creat() link() rename() write() Generating skeletons of sequence-2. : 4*4 = 16 creat() creat() link() link() creat() link() link() rename() creat() creat() link() rename() write() creat() link() rename() rename() write() write() creat() write() creat() rename() rename() write() rename() link() write() write() write() link() rename() � 45

Recommend


More recommend