Surviving Sensor Network Software Faults Presented by Jacek Migdal
Software crashes Software bugs are common: • tests may not reveal rare problems • hard to identify and fix ... but sensor network should be able to work for years. Ariane 5 Flight 501
Common approach Have you tried rebooting?
Rebooting on failure • works in most cases (memory faults) • recent data is lost • time consuming => reduce availability • cause additional cost for routing protocols
Proposed solution: Neutron Divide software into recovery units and reboots the faulty unit.
Hardware • 1-8 MHz • 4-10 kB SRAM • 40-128 KB flash memory • without hardware memory isolation Low overhead solution is needed.
Architecture Compiler Neutron Deputy extensions compiler TinyOS Safe TinyOS TOSThreads Neutron recovery code
Recovery unit Definition: Recovery unit: • application • application recovery unit • kernel may not call directly into a different recovery unit • instanties at least one thread (kernel has exactly one) • every component belongs at most to one application recovery unit or to kernel recovery unit
How to divide program into recovery unit • Use annotations to define kernel boundaries (@syscall_base, @syscall_ext) • Use Deputy compiler to divide program into recovery unit and isolate them
How to recover application unit 1. Cancel system 3. Re-initialize calls and halt application unit threads RAM (pending flag) 4. Restart the 2. Reclaim application unit allocated thread memory
How to recover kernel unit 1. Cancel 3. Reboot the outstanding TinyOS (skip system calls thread state initialization) 2. Save application dependent 4. Restart the state. saved state.
Precious state • Losing state of application is too costly. • Maintain variable value across application unit restart (mark them with @precious flag).
Precious state Recovery: Features: 1.Check for 1.Groups corruption 2.Atomic 2.Push to stack operations 3.Re-initialize 3.(Optional) Check recovery unit integrity on 4.Pop from stack application level and copy 4.Pop from stack and copy
Evaluation availability
Evaluation routing protocol cost
Evaluation - overhead Low programmer overhead (mostly cost of adding annotations)
Related work • kernel level safety (most OS, using virtual address space) • language-level safety • micro reboots (Java Enterprise Edition)
Conclusion Neutron: • recovers from memory safety bugs • divide program into recovery unit • re-initialize faulty unit on error • implement as part of compiler and TinyOS • designed for limited architecture • reduce time to synchronization by 94% and cost of routing protocol by 99.5%
References • Y. Chen, O. Gnawali, M. Kazandjieva, P . Levis, J. Regehr: “Surviving Sensor Network Software Faults,” in Proceedings ACM SOSP 2009, Big Sky, MT, USA, October 2009. • Image sources: • http://top10latest.com/top-10-costliest-software-bugs • http://www.personal.kent.edu/~rmuhamma • http://store.fungizmos.com • http://omrumfuneraltransport.com • http://www.moddergamer.com
Recommend
More recommend