operating system support for redundant multithreading
play

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Bj orn D - PowerPoint PPT Presentation

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Bj orn D obel (TU Dresden) Hermann H artig (TU Dresden) Michael Engel (TU Dortmund) Tampere, 08.10.2012 Fault Tolerance: State of the Union Software errors Hardware errors


  1. OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Bj ¨ orn D ¨ obel (TU Dresden) Hermann H¨ artig (TU Dresden) Michael Engel (TU Dortmund) Tampere, 08.10.2012

  2. Fault Tolerance: State of the Union Software errors Hardware errors non- COTS COTS Operating System Support for Redundant Multithreading slide 1 of 13

  3. Fault Tolerance: State of the Union Software errors RAD-hard Redundant Hardware CPUs Multithr. errors non- COTS COTS Operating System Support for Redundant Multithreading slide 1 of 13

  4. Fault Tolerance: State of the Union Software errors IBM z/OS HP NonStop RAD-hard Redundant Hardware CPUs Multithr. errors non- COTS COTS Operating System Support for Redundant Multithreading slide 1 of 13

  5. Fault Tolerance: State of the Union Software SeL4 errors Minix3 IBM z/OS HP NonStop Carburizer RAD-hard Redundant Hardware CPUs Multithr. errors non- COTS COTS Operating System Support for Redundant Multithreading slide 1 of 13

  6. Fault Tolerance: State of the Union Software SeL4 errors Minix3 IBM z/OS HP NonStop Carburizer SWIFT Encoded RAD-hard Redundant Hardware Processing CPUs Multithr. errors non- COTS COTS Operating System Support for Redundant Multithreading slide 1 of 13

  7. Fault Tolerance: State of the Union Software SeL4 errors Minix3 IBM z/OS HP NonStop Carburizer SWIFT Romain Encoded RAD-hard Redundant Hardware Processing CPUs Multithr. errors non- COTS COTS Operating System Support for Redundant Multithreading slide 1 of 13

  8. Process-Level Redundancy [Shye 2007] Binary recompilation • Complex, unprotected compiler • Architecture-dependent System calls for replica synchronization Virtual memory fault isolation • Restricted to Linux user-level programs Operating System Support for Redundant Multithreading slide 2 of 13

  9. Process-Level Redundancy [Shye 2007] Binary recompilation • Complex, unprotected compiler • Architecture-dependent Reuse OS mechanisms System calls for replica synchronization Additional synchronization events Virtual memory fault isolation • Restricted to Linux user-level programs Microkernel-based Operating System Support for Redundant Multithreading slide 2 of 13

  10. Transparent Replication as OS Service Application L4 Runtime Environment L4/Fiasco.OC microkernel Operating System Support for Redundant Multithreading slide 3 of 13

  11. Transparent Replication as OS Service Replicated Application L4 Runtime Romain Environment L4/Fiasco.OC microkernel Operating System Support for Redundant Multithreading slide 3 of 13

  12. Transparent Replication as OS Service Unreplicated Replicated Application Application L4 Runtime Romain Environment L4/Fiasco.OC microkernel Operating System Support for Redundant Multithreading slide 3 of 13

  13. Transparent Replication as OS Service Unreplicated Replicated Replicated Application Application Driver L4 Runtime Romain Environment L4/Fiasco.OC microkernel Operating System Support for Redundant Multithreading slide 3 of 13

  14. Transparent Replication as OS Service Unreplicated Replicated Replicated Application Application Driver L4 Runtime Romain Environment L4/Fiasco.OC microkernel Reliable Computing Base Operating System Support for Redundant Multithreading slide 3 of 13

  15. Romain: Structure Master Operating System Support for Redundant Multithreading slide 4 of 13

  16. Romain: Structure Replica Replica Replica Master Operating System Support for Redundant Multithreading slide 4 of 13

  17. Romain: Structure Replica Replica Replica = Master Operating System Support for Redundant Multithreading slide 4 of 13

  18. Romain: Structure Replica Replica Replica Resource System = Manager Call Proxy Master Operating System Support for Redundant Multithreading slide 4 of 13

  19. Resource Management: Capabilities Replica 1 1 2 3 4 5 6 Operating System Support for Redundant Multithreading slide 5 of 13

  20. Resource Management: Capabilities Replica 1 Replica 2 1 2 3 4 5 6 1 2 3 4 5 6 Operating System Support for Redundant Multithreading slide 5 of 13

  21. Resource Management: Capabilities Replica 1 Replica 2 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 Master Operating System Support for Redundant Multithreading slide 5 of 13

  22. Partitioned Capability Tables Replica 1 Replica 2 1 2 3 4 5 6 1 2 3 4 5 6 Marked used 1 2 3 4 5 6 Master Master private Operating System Support for Redundant Multithreading slide 6 of 13

  23. Replica Memory Management Replica 1 Replica 2 rw ro ro rw ro ro Master Operating System Support for Redundant Multithreading slide 7 of 13

  24. Replica Memory Management Replica 1 Replica 2 rw ro ro rw ro ro Master Operating System Support for Redundant Multithreading slide 7 of 13

  25. Replica Memory Management Replica 1 Replica 2 rw ro ro rw ro ro Master Operating System Support for Redundant Multithreading slide 7 of 13

  26. Shared Memory • Not in complete control of master • Standard technique: trap&emulate – Execution overhead (x100 - x1000) – Adds complexity to RCB Disassembler 6,000 LoC Tiny emulator 500 LoC • Our implementation: copy & execute Operating System Support for Redundant Multithreading slide 8 of 13

  27. Copy&Execute Replica Master Operating System Support for Redundant Multithreading slide 9 of 13

  28. Copy&Execute Replica Master mov eax, [ebx] X Operating System Support for Redundant Multithreading slide 9 of 13

  29. Copy&Execute Replica Master mov eax, [ebx] Operating System Support for Redundant Multithreading slide 9 of 13

  30. Copy&Execute Replica Master mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state Operating System Support for Redundant Multithreading slide 9 of 13

  31. Copy&Execute Replica Master mov eax, [ebx] mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state Operating System Support for Redundant Multithreading slide 9 of 13

  32. Copy&Execute Replica Master mov eax, [ebx] load repl. state NOP; NOP; ...; mov eax, [ebx] NOP restore master state Operating System Support for Redundant Multithreading slide 9 of 13

  33. Copy&Execute Replica Master mov eax, [ebx] load repl. state NOP; NOP; ...; mov eax, [ebx] NOP restore master state Operating System Support for Redundant Multithreading slide 9 of 13

  34. Copy&Execute Replica Master mov eax, [ebx] load repl. state NOP; NOP; ...; mov eax, [ebx] NOP restore master state Operating System Support for Redundant Multithreading slide 9 of 13

  35. Benchmarks • MiBench suite • Fault injection to confirm fault distribution ratios • Overhead for DMR and TMR • Microbenchmarks for shared memory Operating System Support for Redundant Multithreading slide 10 of 13

  36. Overhead vs. Unreplicated Execution Operating System Support for Redundant Multithreading slide 11 of 13

  37. Romain Lines of Code Base code (main, logging, locking) 325 Application loader 375 Replica manager 628 Redundancy 153 Memory manager 445 System call proxy 311 Shared memory 281 T otal 2,518 Fault injector 668 GDB server stub 1,304 Operating System Support for Redundant Multithreading slide 12 of 13

  38. Conclusion • Redundant Multithreading as an OS service • Support for binary-only applications • Overheads < 30%, often < 5% • Shared memory handling is slow • Work in progress: – Multithreading – Device drivers Operating System Support for Redundant Multithreading slide 13 of 13

  39. Nothing to see here This slide intentionally left blank. Except for above text. Operating System Support for Redundant Multithreading slide 14 of 13

  40. Hardening the RCB • We need: Dedicated mechanisms to protect the RCB (HW or SW) NonRes NonRes NonRes • We have: Full control over software Core Core Core • Use FT -encoding compiler? NonRes NonRes – Has not been done for kernel Core Core code yet ResCore – Only protects SW components NonRes NonRes Core Core • RAD-hardened hardware? – Too expensive NonRes NonRes NonRes Core Core Core • Our proposal: Split HW into ResCores and NonRes-Cores Operating System Support for Redundant Multithreading slide 15 of 13

  41. Signaling Performance Overhead by notification method • Overhead compared to single, unreplicated run Local Faults 60 Migration • Benchmarks with highest overhead Sync IPC in EMSOFT paper 50 Overhead in % Shared Mem • Test machine: 40 – 12x Intel Core2 2.6 GHz 30 – Replicas pinned to dedicated physical cores 20 – Hyperthreading off 10 susan CRC32 susan CRC32 DMR TMR Operating System Support for Redundant Multithreading slide 16 of 13

  42. What about signalling failures? Missed CPU exceptions detected by watchdog → Spurious CPU exceptions detected by watchdog / state comparison → Transmission of corrupt state detected during state comparison → Overwriting remote state during transmission • NonResCore memory • Accessible by ResCores, but not by other NonResCores • Prevents overwriting other states • Already available in HW: IBM/Cell Operating System Support for Redundant Multithreading slide 17 of 13

Recommend


More recommend