verifying concurrent crash safe systems with perennial
play

Verifying concurrent, crash-safe systems with Perennial Tej Chajed , - PowerPoint PPT Presentation

Verifying concurrent, crash-safe systems with Perennial Tej Chajed , Joseph Tassarotti*, Frans Kaashoek, Nickolai Zeldovich MIT and *Boston College Many systems need concurrency and crash safety Examples: file systems, databases, and key-value


  1. Verifying concurrent, crash-safe systems with Perennial Tej Chajed , Joseph Tassarotti*, Frans Kaashoek, Nickolai Zeldovich MIT and *Boston College

  2. Many systems need concurrency and crash safety Examples: file systems, databases, and key-value stores Make strong guarantees about keeping your data safe Achieve high performance with concurrency 2

  3. Simple example: replicated disk replicated disk library disk 1 disk 2 3

  4. Simple example: replicated disk read/write replicated disk library disk 1 disk 2 3

  5. Simple example: replicated disk read/write replicated disk library disk 1 disk 2 3

  6. Replicated disk is subtle func write(a: addr, v: block) { lock_address(a) d1.write(a, v) d2.write(a, v) unlock_address(a) } 4

  7. Replicated disk is subtle func write(a: addr, v: block) { lock_address(a) d1.write(a, v) d2.write(a, v) what if system crashes here? unlock_address(a) what if disk 1 fails? } 4

  8. Replicated disk is subtle func write(a: addr, v: block) { lock_address(a) d1.write(a, v) d2.write(a, v) what if system crashes here? unlock_address(a) what if disk 1 fails? } // runs on reboot func recover() { for a in … { // copy from d1 to d2 } } 4

  9. Replicated disk is subtle func write(a: addr, v: block) { func read(a: addr): block { lock_address(a) lock_address(a) d1.write(a, v) v, ok := d1.read(a) d2.write(a, v) if !ok { what if system crashes here? unlock_address(a) v, _ = d2.read(a) what if disk 1 fails? } } unlock_address(a) return v } // runs on reboot func recover() { for a in … { // copy from d1 to d2 } } 4

  10. Goal: systematically reason about all executions with formal verification 5

  11. Existing verification frameworks do not support concurrency and crash safety verified crash safety verified concurrency FSCQ [SOSP ’15] CertiKOS [OSDI ’16] Yggdrasil [OSDI ’16] CSPEC [OSDI ’18] DFSCQ [SOSP ’17] AtomFS [SOSP ’19] … … no system can do both 6

  12. Combining verified crash safety and concurrency is challenging Crash and recovery can interrupt a critical section ➡ leases Crash wipes in-memory state ➡ memory versioning Recovery logically completes crashed threads’ operations ➡ recovery helping 7

  13. Perennial’s techniques address challenges integrating crash safety into concurrency reasoning Crash and recovery can interrupt a critical section ➡ leases Crash wipes in-memory state ➡ memory versioning Recovery logically completes crashed threads’ operations ➡ recovery helping 8

  14. Perennial’s techniques address challenges integrating crash safety into concurrency reasoning Crash and recovery can interrupt a critical section ➡ leases see paper Crash wipes in-memory state ➡ memory versioning Recovery logically completes crashed threads’ operations this talk ➡ recovery helping 8

  15. Contributions Perennial: framework for reasoning about crashes and concurrency Goose: reasoning about Go implementations see paper Evaluation: verified mail server written in Go with Perennial 9

  16. Specifying correctness: concurrent recovery refinement All operations are correct and atomic wrt concurrency and crashes Recovery repairs system after reboot 10

  17. Proving the replicated disk correct 11

  18. Background Proving refinement with forward simulation: relate code and spec states spec σ d 1 code d 2 12

  19. Background Proving refinement with forward simulation: prove every operation has a commit point tid: write(a, v) spec S 1 1. Write down abstraction relation between code and spec states code C 1 C 2 C 3 C 4 C 5 lock d1.write d2.write unlock 13

  20. Background Proving refinement with forward simulation: prove every operation has a commit point tid: tid: write(a, v) write(a, v) spec S 2 S 1 1. Write down abstraction relation between code and spec states 2. Prove every operation commits code C 1 C 2 C 3 C 4 C 5 lock d1.write d2.write unlock 13

  21. Background Proving refinement with forward simulation: prove every operation has a commit point tid: tid: write(a, v) write(a, v) spec S 2 S 1 1. Write down abstraction relation between code and spec states 2. Prove every operation commits 3. Prove abstraction relation is preserved code C 1 C 2 C 3 C 4 C 5 lock d1.write d2.write unlock 13

  22. Abstraction relation for the replicated disk σ abstraction relation: σ [ a ] = d 1 [ a ] ! locked ( a ) ⟹ ∧ σ [ a ] = d 2 [ a ] (if the disk has not failed) d 1 d 2 14

  23. Crashing breaks the abstraction relation func write(a: addr, v: block) { lock_address(a) d1.write(a, v) abstraction relation: lock reverts to being free, σ [ a ] = d 1 [ a ] ! locked ( a ) ⟹ but disks are not in-sync ∧ σ [ a ] = d 2 [ a ] 15

  24. So far: abstraction relation always holds spec abstraction relation R R R ? code crash 16

  25. Separate a crash invariant from the abstraction relation spec abstraction relation R crash invariant C C R R code crash 17

  26. Recovery proof uses the crash invariant to restore the abstraction relation crash spec abstraction relation R crash invariant C C R R R R code crash recover() 18

  27. Proving recovery correct: makes writes atomic func write(a: addr, v: block) { lock_address(a) d1.write(a, v) func recover() { for a in … { v, ok := d1.read(a) if !ok { … } d2.write(a, v) } } 19

  28. User sees an atomic write due to recovery pending crash spec operation tid: user’s view (spec) write(a, v) code execution 20

  29. User sees an atomic write due to recovery pending crash spec operation tid: user’s view (spec) write(a, v) code execution tid: w1(a,v) crash 20

  30. User sees an atomic write due to recovery pending crash spec operation tid: user’s view (spec) write(a, v) code execution r1(a) w2(a,v) tid: w1(a,v) return recover() crash 20

  31. User sees an atomic write due to recovery pending crash recovery helping spec operation tid: tid: user’s view (spec) write(a, v) write(a, v) code execution r1(a) w2(a,v) tid: w1(a,v) return recover() crash 20

  32. Recovery helping: recovery can commit writes from before the crash func write(a: addr, v: block) { tid: write(a, v) lock_address(a) d1.write(a, v) func recover() { for a in … { v, ok := d1.read(a) if !ok { … } tid: d2.write(a, v) write(a, v) } } 21

  33. Crash invariant says “if disks disagree, some thread was writing the value on the first disk” func write(a: addr, v: block) { tid: write(a, v) lock_address(a) d1.write(a, v) crash invariant: d 1 [ a ] ≠ d 2 [ a ] ⟹ func recover() { for a in … { ∃ tid. tid: v, ok := d1.read(a) write(a, ) d 1 [ a ] if !ok { … } tid: d2.write(a, v) write(a, v) } } 22

  34. Crash invariant says “if disks disagree, some thread was writing the value on the first disk” func write(a: addr, v: block) { tid: write(a, v) lock_address(a) d1.write(a, v) crash invariant: d 1 [ a ] ≠ d 2 [ a ] ⟹ func recover() { for a in … { ∃ tid. tid: v, ok := d1.read(a) write(a, ) d 1 [ a ] if !ok { … } tid: d2.write(a, v) write(a, v) } } 22

  35. Key idea: crash invariant can refer to interrupted spec operations func write(a: addr, v: block) { tid: write(a, v) lock_address(a) d1.write(a, v) crash invariant: d 1 [ a ] ≠ d 2 [ a ] ⟹ func recover() { for a in … { ∃ tid. tid: v, ok := d1.read(a) write(a, ) d 1 [ a ] if !ok { … } tid: d2.write(a, v) write(a, v) } } 23

  36. Recovery proof shows code restores the abstraction relation by completing all interrupted writes func write(a: addr, v: block) { tid: write(a, v) lock_address(a) d1.write(a, v) func recover() { for a in … { v, ok := d1.read(a) if !ok { … } tid: d2.write(a, v) write(a, v) } abstraction relation: } crash σ [ a ] = d 1 [ a ] ! locked ( a ) ⟹ ∧ σ [ a ] = d 2 [ a ] 24

  37. Proving concurrent recovery refinement Recovery proof uses crash invariant to restore abstraction relation Proof can refer to interrupted operations, enabling recovery helping reasoning Users get correct behavior and atomicity 25

  38. Implementation Perennial (9k lines of Coq) - leases - memory versioning - recovery helping developer-written Iris concurrency framework this paper prior work Coq 26

  39. Implementation Go source go build Perennial (9k lines of Coq) - leases exe - memory versioning - recovery helping developer-written Iris concurrency framework this paper prior work Coq 26

  40. Implementation see paper Goose translator Go source Proof (2k lines of Go) go build Perennial (9k lines of Coq) - leases exe - memory versioning - recovery helping developer-written Iris concurrency framework this paper prior work Coq 26

  41. Implementation see paper Goose translator Go source Proof (2k lines of Go) go build Perennial (9k lines of Coq) machine - leases checked by Coq exe - memory versioning - recovery helping developer-written Iris concurrency framework this paper prior work Coq 26

  42. Evaluation This talk: • proof-e ff ort comparison See paper: • verified examples • TCB • bug discussion 27

  43. Methodology: 
 Verify the same mail server as previous work, CSPEC [OSDI ’18] Users can read, deliver, and delete mail Implemented on top of a file system Operations are atomic (and crash safe in Perennial) 28

Recommend


More recommend