improving the reliability of commodity operating systems
play

Improving the Reliability of Commodity Operating Systems Mike - PowerPoint PPT Presentation

Improving the Reliability of Commodity Operating Systems Mike Swift, Brian Bershad, Hank Levy University of Washington Slides courtesy of Michael Swift University of Wisconsin-Madison Outline Introduction Vision Design


  1. Improving the Reliability of Commodity Operating Systems Mike Swift, Brian Bershad, Hank Levy University of Washington Slides courtesy of Michael Swift University of Wisconsin-Madison

  2. Outline • Introduction • Vision • Design • Evaluation • Summary

  3. The Problem • Operating system crashes are a huge problem today – 5% of Windows systems crash every day • Device drivers are the biggest cause of crashes – Drivers cause 85% of Windows XP crashes – Drivers are 7 times buggier than the kernel in Linux • We built Nooks, a system that prevents drivers from crashing the OS – We can prevent 99% of faults in our tests that crash native Linux

  4. Crashes Today User User Program Program Driver Kernel

  5. Crashes Today User User Program Program Driver Kernel

  6. Crashes Today User User Program Program Driver Kernel

  7. Outline • Introduction • Vision • Design • Evaluation • Summary

  8. Vision User User Program Program Driver Kernel

  9. Vision User User Program Program Driver Kernel

  10. Reality • Windows XP – 113 million copies sold in 2002 – 40 million lines of code – $1 billion development cost – 35,000 drivers available • Linux: – 18 million users – 30 million lines of code – Equivalent $1 billion development cost

  11. Vision Requirements 1. Isolation 2. Recovery 3. Compatibility No code changes • No new languages • No new OS • No new hardware • No new perspective •

  12. Outline • Introduction • Vision • Design • Evaluation • Summary

  13. Assumptions and Principles • Assumptions: – Drivers are generally well behaved – Don’t need to prevent every crash to be useful • Principles: – Design for fault resistance (not fault tolerance) – Design for mistakes (not abuse)

  14. Goal We want a practical, “best-effort” solution • Prevents many crashes • Good performance • Works with today’s operating systems and drivers

  15. Design of Nooks • Standard Linux kernel and drivers • Plus: – Isolation – Recovery • Compatible with existing code

  16. Existing Kernels User User Program Program Driver Kernel

  17. Isolation - Memory User User Program Program Driver Stack Kernel Heap Lightweight Kernel Protection Domains

  18. Isolation - Control Transfer User User Program Program Driver Kernel

  19. Isolation - Control Transfer User User Program Program Driver XPC Kernel XPC eXtension Procedure Call

  20. Isolation - Data Access User User Program Program Driver Kernel

  21. Isolation - Data Access User User Program Program Driver Kernel Copy-in / Copy-out

  22. Isolation - Interposition User User Program Program Driver Kernel

  23. Isolation - Interposition User User Program Program Driver Kernel XPC XPC Wrappers

  24. Design Summary • Isolation – Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Copy-in/Copy-out – Wrappers

  25. Recovery - Fault Detection User User Program Program Driver Kernel Recovery Processor

  26. Recovery - Fault Detection User User Program Program Driver Kernel Recovery

  27. Recovery - Fault Detection User User Detector Program Program Driver Kernel Recovery

  28. Recovery User User Program Program Driver Kernel STOP Recovery Stop

  29. Recovery User User Program Program Kernel Recovery Stop / Unload

  30. Recovery User User Program Program Driver Kernel GO Recovery Stop / Unload / Reload

  31. Design Summary • Isolation – Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Copy-in/Copy-out – Wrappers • Recovery – Hardware and software checks – Stop / Unload and GC / Reload

  32. Some Limitations • Blame the processor • Blame the operating system • Blame us

  33. Outline • Vision • Design • Evaluation – Reliability – Performance – Implementation Cost • Summary

  34. Tested Drivers • Sound card drivers – SoundBlaster 16 (sb) – Ensoniq 1371 • Network drivers – Intel Pro/1000 Gigabit Ethernet (e1000) – AMD PCnet32 10/100 Mb Ethernet (pcnet32) – 3COM 3c90x 10/100 Mb Ethernet – 3Com 3c59x 10/100 Mb Ethernet • Filesystems – VFAT Windows-compatible filesystem (vfat) • Other – kHTTPd in-kernel web server (khttpd)

  35. Reliability Test Methodology Load driver Inject bugs Test Nothing Failure Reboot

  36. Reliability Test Methodology Load driver Inject bugs Test Nothing Failure Recovery Reboot

  37. Nooks Stops Crashes 200 No Nooks Number of crashes 150 Nooks 119 100 50 0 pcnet32 Extension

  38. Nooks Stops Crashes 200 No Nooks Number of crashes 150 Nooks 119 100 50 0 0 pcnet32 Extension

  39. Nooks Stops Crashes 200 No Nooks Number of crashes 150 Nooks 119 100 52 50 0 0 pcnet32 e1000 Extension

  40. Nooks Stops Crashes 200 No Nooks Number of crashes 150 Nooks 119 100 52 50 0 0 0 pcnet32 e1000 Extension

  41. Nooks Stops Crashes 200 No Nooks Number of crashes 150 Nooks 119 100 52 50 10 0 0 1 0 pcnet32 e1000 sb Extension

  42. Nooks Stops Crashes 200 175 No Nooks Number of crashes 150 Nooks 119 100 52 50 10 10 0 0 1 2 2 0 pcnet32 e1000 sb kHTTPd VFAT Extension

  43. Performance • Dominant cost is XPC – Performance depends frequency of interaction with kernel

  44. Perf. Relative to Native Linux 0.2 0.4 0.6 0.8 0 1 150 Relative Performance sb Play MP3 XPC/sec Receive Stream Send Stream Workload Apache SpecWeb Compile Local Simple Web

  45. Perf. Relative to Native Linux 0.2 0.4 0.6 0.8 0 1 150 Relative Performance sb Play MP3 8,923 Receive e1000 Stream 60,352 Send e1000 Stream Workload XPC/sec Apache SpecWeb Compile Local Simple Web

  46. Perf. Relative to Native Linux 0.2 0.4 0.6 0.8 0 1 150 Relative Performance sb Play MP3 8,923 Receive e1000 Stream 60,352 1,960 Send e1000 Stream Workload Apache e1000 SpecWeb XPC/sec Compile Local Simple Web

  47. Perf. Relative to Native Linux 0.2 0.4 0.6 0.8 0 1 150 Relative Performance sb Play MP3 8,923 Receive e1000 Stream 60,352 1,960 Send e1000 Stream Workload Apace e1000 SpecWeb 22,653 Compile VFAT Local XPC/sec 61,183 Simple kHTTPd Web

  48. Implementation Cost • Changes to old code – Kernel: 924 out of 1.1 million lines – Device drivers+VFAT: 0 out of 33,000 lines – kHTTPd: 13 out of 2,000 lines • New code – Nooks reliability layer: 22,266 lines

  49. Summary • Nooks provides a new reliability layer between drivers and the OS • Nooks prevents 99% of tested faults that cause Linux to crash • Nooks imposes a modest performance cost

  50. Why didn’t we use a microkernel? • Doesn’t address our limitations – Isolation not much better – Fault detection not much better – Recovery not much better – Doesn’t improve performance • Requires more changes to the kernel • Makes compatibility more difficult

  51. Recovery • Goals: – Restore driver state so it can process requests as if it had never failed – Conceal failure from applications • Observation: – Driver interface specifies how driver responds to requests • Approach: Model drivers as state machines

  52. Drivers as State Machines send complete

  53. Drivers as State Machines • Recovery: – Advance driver from initial state to open close state at time of crash – Reply to requests with valid config responses according to driver state

  54. Shadow Drivers • Generic code that: – Normally: • Records state-changing inputs – On failure: • Restarts driver • Replays inputs to recover • Emulates driver to applications/OS  One shadow driver handles recovery for an entire class of drivers

  55. Shadow Driver Overview Device write(…) Driver write(…) Kernel Tap write(…) Shadow Driver

  56. Preparing for Recovery Device config(…) Driver config(…) Kernel Tap config(…) Shadow config Driver …

  57. Recovering a Failed Driver Device Device ) … Driver Driver ( r e t s i g e r c c i o n o Kernel Tap Tap n i n t n ( f … i e g c ) t register(…) Shadow config Driver …

  58. Recovering a Failed Driver • Summary: – Reset driver – Reinitialize driver – Replay logged requests

  59. Spoofing a Failed Driver Device Driver write(…) return Kernel Tap write(…) return Shadow Driver

  60. Spoofing a Failed Driver Shadow acts as driver -- replies to requests with valid possible responses – Applications and OS unaware that driver failed – No device control General Strategies: 1. Answer request from log 2. Act busy 3. Block caller 4. Queue request 5. Drop request

  61. Completing Recovery Device Driver Kernel Tap Tap Tap Shadow Driver

  62. Design Summary • Isolation – Lightweight Kernel Protection Domains – eXtension Procedure Call (XPC) – Object Table – Wrappers • Recovery – Shadow Drivers

  63. Outline • Introduction • Problem • Design • Evaluation – Implementation – Benefit – Cost • Summary and Future Work

Recommend


More recommend