we crashed now what
play

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew - PowerPoint PPT Presentation

We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije Universiteit Amsterdam 6 th Usenix Workshop on Hot Topics in System Dependability October 3, 2010, Vancouver, BC, Canada We Crashed, Now What? Cristiano


  1. We Crashed, Now What? Lorenzo Cavallaro Cristiano Giuffrida Andrew S. Tanenbaum Vrije Universiteit Amsterdam 6 th Usenix Workshop on Hot Topics in System Dependability October 3, 2010, Vancouver, BC, Canada We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 1

  2. OS Dependability Threats We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 3

  3. OS Dependability Threats We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 3

  4. OS Dependability Threats We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 3

  5. Are Core Components Safe? We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

  6. Are Core Components Safe? ”We’re getting bloated and huge. We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

  7. Are Core Components Safe? ”We’re getting bloated and huge. Yes, it’s a problem. We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

  8. Are Core Components Safe? ”We’re getting bloated and huge. Yes, it’s a problem. [ . . . ] I’d like to say we have a plan.” We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

  9. Are Core Components Safe? ”We’re getting bloated and huge. Yes, it’s a problem. [ . . . ] I’d like to say we have a plan.” Linus Torvalds on the Linux kernel, 2009 We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 4

  10. High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

  11. High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery How? We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

  12. High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery How? 1. Extend existing work on isolated subsystems to the entire OS We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

  13. High-coverage Crash Recovery Rapid evolution and huge size cause more bugs Crash recovery solution with smaller TCB needed Whole-OS crash recovery How? 1. Extend existing work on isolated subsystems to the entire OS 2. Design a new high-coverage crash recovery infrastructure We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 5

  14. ? Entire OS Isolated Subsystems Work on extensions and drivers e.g., Safedrive , Nooks , Minix 3 Filesystems e.g., Membrane Assume isolated untrusted parties with well-defined interfaces Several recoverer-recoveree pairs to scale to the entire OS Complex and hard-to-maintain recovery infrastructure High exposure of the recovery code to the programmer We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 6

  15. ? Entire OS Isolated Subsystems Work on extensions and drivers e.g., Safedrive , Nooks , Minix 3 Filesystems e.g., Membrane Assume isolated untrusted parties with well-defined interfaces Several recoverer-recoveree pairs to scale to the entire OS . . . it is like a dog chasing its tail! Complex and hard-to-maintain recovery infrastructure High exposure of the recovery code to the programmer We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 6

  16. Emerging High-coverage Solutions Shadow kernel vs Pure instrumentation e.g., Otherworld e.g., Recovery Domains We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 7

  17. Emerging High-coverage Solutions Shadow kernel vs Pure instrumentation e.g., Otherworld e.g., Recovery Domains Best-effort (weak failure model) We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 7

  18. Emerging High-coverage Solutions Shadow kernel vs Pure instrumentation e.g., Otherworld e.g., Recovery Domains Best-effort Heavyweight (weak failure model) (high complexity) (poor performance) (poor scalability) We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 7

  19. WWW: What We Want We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

  20. WWW: What We Want High coverage We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

  21. WWW: What We Want Low complexity We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

  22. WWW: What We Want Reasonable performance and scalability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

  23. WWW: What We Want Good maintainability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

  24. WWW: What We Want Address the many challenges of the crash recovery problem We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 8

  25. The Crash Recovery Problem — I Crash detection Detect crashes proactively or reactively Isolate crashes so they do not disrupt the recovery process We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 9

  26. The Crash Recovery Problem — I Crash detection Detect crashes proactively or reactively Isolate crashes so they do not disrupt the recovery process State transfer Create a new execution context to restart execution Transfer the state from the old execution context We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 9

  27. The Crash Recovery Problem — I Crash detection Detect crashes proactively or reactively Isolate crashes so they do not disrupt the recovery process State transfer Create a new execution context to restart execution Transfer the state from the old execution context State consistency Restore a stable and consistent state in the new context Allow for deterministic execution upon restart We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 9

  28. The Crash Recovery Problem — II State dependency tracking Preserve state dependencies among different contexts Allow for a globally coherent state upon restart We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 10

  29. The Crash Recovery Problem — II State dependency tracking Preserve state dependencies among different contexts Allow for a globally coherent state upon restart State corruption Detect arbitrary data corruption Attempt to recover from arbitrary data corruption We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 10

  30. The Crash Recovery Problem — II State dependency tracking Preserve state dependencies among different contexts Allow for a globally coherent state upon restart State corruption Detect arbitrary data corruption Attempt to recover from arbitrary data corruption Restart Determine a safe execution point to resume operation Attempt to avoid further crashes We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 10

  31. Our Approach Combine OS design and lightweight instumentation We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 11

  32. Our Approach Combine OS design and lightweight instumentation OS Design Reduce complexity at recovery time Good performance and scalability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 11

  33. Our Approach Combine OS design and lightweight instumentation OS Design Reduce complexity at recovery time Good performance and scalability Lightweight Compiler-based Instrumentation High coverage and component-agnostic recovery Good maintainability and evolvability We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 11

  34. OS Architecture R3 . . . App App App App App . . . VFS SCH NET VM PM . . . PRN HDD NDD SND RS R0 Microkernel We break down the OS into several userspace components Multiserver microkernel architecture based on message-passing We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 12

  35. The Programming Model O.S. Component We rely on an event-driven model We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

  36. The Programming Model O.S. Component Events trigger execution of the task loop We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

  37. The Programming Model O.S. Component Idempotent messages possible within the task loop We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

  38. The Programming Model O.S. Component Idempotent messages possible within the task loop We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

  39. The Programming Model O.S. Component Idempotent messages possible within the task loop We Crashed, Now What? Cristiano Giuffrida , Lorenzo Cavallaro, Andrew S. Tanenbaum 13

Recommend


More recommend