asteroid an analyzable resilient embedded real time
play

ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING - PowerPoint PPT Presentation

ASTEROID AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING SYSTEM DESIGN Bj orn D obel, Hermann H artig (TU Dresden) Philip Axer, Rolf Ernst (TU Braunschweig) B oblingen, 07 .02.2013 The Many Faces of Hardware Faults


  1. ASTEROID – AN ANALYZABLE, RESILIENT, EMBEDDED REAL-TIME OPERATING SYSTEM DESIGN Bj ¨ orn D ¨ obel, Hermann H¨ artig (TU Dresden) Philip Axer, Rolf Ernst (TU Braunschweig) B ¨ oblingen, 07 .02.2013

  2. The Many Faces of Hardware Faults • Radiation-induced soft errors – Mainly an issue in avionics+space 1 1 Shirvani, McCluskey: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project , 1998 2 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study , SIGMETRICS 2009 3 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design , ASPLOS 2012 4 Pinheiro, Weber, Barroso: Failure Trends in a Large Disk Drive Population , FAST 2007 5 Shivakumar, Kistler, Keckler: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic , DSN 2002 ASTEROID slide 1 of 18

  3. The Many Faces of Hardware Faults • Radiation-induced soft errors – Mainly an issue in avionics+space 1 • DRAM errors in large data centers – Google Study: > 2 % failing DRAM DIMMs per year 2 – ECC is not going to even detect a significant amount 3 – Disk failure rate about 5% 4 1 Shirvani, McCluskey: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project , 1998 2 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study , SIGMETRICS 2009 3 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design , ASPLOS 2012 4 Pinheiro, Weber, Barroso: Failure Trends in a Large Disk Drive Population , FAST 2007 5 Shivakumar, Kistler, Keckler: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic , DSN 2002 ASTEROID slide 1 of 18

  4. The Many Faces of Hardware Faults • Radiation-induced soft errors – Mainly an issue in avionics+space 1 • DRAM errors in large data centers – Google Study: > 2 % failing DRAM DIMMs per year 2 – ECC is not going to even detect a significant amount 3 – Disk failure rate about 5% 4 • Furthermore: decreasing transistor sizes, higher rate of transient errors in CPU functional units 5 1 Shirvani, McCluskey: Fault-Tolerant Systems in A Space Environment: The CRC ARGOS Project , 1998 2 Schroeder, Pinheiro, Weber: DRAM Errors in the Wild: A Large-Scale Field Study , SIGMETRICS 2009 3 Hwang, Stefanovici, Schroeder: Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design , ASPLOS 2012 4 Pinheiro, Weber, Barroso: Failure Trends in a Large Disk Drive Population , FAST 2007 5 Shivakumar, Kistler, Keckler: Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic , DSN 2002 ASTEROID slide 1 of 18

  5. Transparent Replication as OS Service Application L4 Runtime Environment L4/Fiasco.OC microkernel ASTEROID slide 2 of 18

  6. Transparent Replication as OS Service Replicated Application L4 Runtime Romain Environment L4/Fiasco.OC microkernel ASTEROID slide 2 of 18

  7. Transparent Replication as OS Service Unreplicated Replicated Application Application L4 Runtime Romain Environment L4/Fiasco.OC microkernel ASTEROID slide 2 of 18

  8. Transparent Replication as OS Service Unreplicated Replicated Replicated Application Application Driver L4 Runtime Romain Environment L4/Fiasco.OC microkernel ASTEROID slide 2 of 18

  9. Transparent Replication as OS Service Unreplicated Replicated Replicated Application Application Driver L4 Runtime Romain Environment L4/Fiasco.OC microkernel Reliable Computing Base 6 ASTEROID slide 2 of 18 6 D ¨ obel, H¨ artig, Engel Operating System Support for Redundant Multithreading , EMSOFT 2012

  10. Hardening the RCB • Use FT -encoding compiler? – Has not been done for kernel code yet – Only protects SW components • RAD-hardened hardware? – Too expensive – Rather provide small, separate building blocks 7 D ¨ obel, H¨ artig: Who watches the watchmen? – Protecting Operating System Reliability Mechanisms , HotDep 2012 ASTEROID slide 3 of 18

  11. Hardening the RCB • Use FT -encoding compiler? Our proposal: Split HW into ResCores and – Has not been done for NonRes-Cores 7 kernel code yet – Only protects SW components NonRes NonRes NonRes Core Core Core • RAD-hardened hardware? NonRes NonRes Core Core – Too expensive ResCore – Rather provide small, NonRes NonRes Core Core separate building blocks NonRes NonRes NonRes Core Core Core 7 D ¨ obel, H¨ artig: Who watches the watchmen? – Protecting Operating System Reliability Mechanisms , HotDep 2012 ASTEROID slide 3 of 18

  12. Fast State Comparison in Hardware   Fingerprinting: Signature generation and checking in HW 8  • XOR, CRC8, CRC16, CRC32  • little hardware overhead: 3-9%  • used for basic-block signature checking   exception IF Chunk CNT retire ID RA result + Data FP EXE MEM inst + Instruction FP X WB 8 Axer, Ernst, D ¨ obel, H¨ artig: Designing an Analyzable and Resilient Embedded Operating System , SOBRES 2012 ASTEROID slide 4 of 18

  13. Basic-Block Signature (SparcV8 - LLVM) 1. Custom LLVM pass annotates basic blocks 2. Signature setup loads a precomputed instruction signature 3. Fingerprint is computed during basic block execution 4. Signature is automatically checked on controll flow changes (e.g. jmp) 5. Trap is caused if signature does not match Entry Signature Setup Basic Block Signature generation Signature check Exit ASTEROID slide 5 of 18

  14. Prelimiary Results - LLVM Assisted Signature Checking 1. Signature checking detects invalid control-flow errors 2. Code-size overhead in the order of 30%, since most basic blocks in our benchmarks were 6 . 6 instructions. 3. How many control-flow errors are expected? Error-Injection experiment 100% data correct and no 80% control flow change 60% data correct and control flow change 40% data incorrect (dc) 20% exception (exc) not terminating (nt) 0% F D RA E ASTEROID slide 6 of 18

  15. Three Paths for ASTEROID v2 1. Increased protection against hardware errors – Not only the processor might fail! – How do we perform real-time analysis? – Can we analyze software’s vulnerability? 2. Extended support for replicating software – Shared Memory – Multithreading – Device I/O 3. Integrated HW/SW platform ASTEROID slide 7 of 18

  16. Protecting Against Errors in the NoC • First step: harden NoC routers, detect packet abnormalities • Problems: data corruption in header (route, type, . . . ) • Protect packets with checksums (i.e. CRC) ASTEROID slide 8 of 18

  17. Real-Time Analysis Timing effects in the hardware architecture (e.g. retransmission on buses, NoC) 9 10 0 10 − 1 F1 b 10 − 2 10 − 3 F1 c 10 − 4 10 − 5 F1 m 10 − 6 X + ( t ) 10 − 7 higher priority interference F9 b 10 − 8 10 − 9 F9 c τ 1 10 − 10 C 1 reexecution 10 − 11 F9 m 10 − 12 τ 2 C 2 C 2 10 − 13 E 1 10 − 14 10 − 15 0 20 40 60 80 100 120 140 160 180 signaling overhead t t [ ms ] 9 Axer, Ernst Stochastic Response-Time Guarantee for Non-Premptive, Fixed-Priority Scheduling Under Error , 2013, to appear ASTEROID slide 9 of 18

  18. Real-Time Analysis 1. Timing effects in software (e.g. replication in Romain framework) 2. Problem: redundant copies must wait and synchronize on state externalization. 3. Solution: Response-time analysis for the Romain framework 4. Results: amount of externalization and priority has a major impact on response times: Bitcount Rijndael Parallel Parallel 800 Sequential 1600 Sequential 700 1400 600 1200 WCRT [ms] 500 WCRT [ms] 1000 400 800 300 600 200 400 1.0 1.0 100 200 0.8 0.8 0 0 0.6 0.6 Load core 1 Load core 1 1.0 1.0 0.8 0.4 0.8 0.4 0.6 0.6 0.4 0.2 0.4 0.2 L o a L o a d c d c o r 0.2 o r 0.2 e 2 e 2 0.0 0.0 0.0 0.0 10 10 Real-Time Analysis with pyCPA – http://code.google.com/p/pycpa ASTEROID slide 10 of 18

  19. Analyzing Program Vulnerability • Varying application fault Register: EDX tolerance requirements 1 • Optimize resource usage for 0.8 PVF 0.6 replication 0.4 • Analyze RCB components to 0.2 0 know what to protect 1 FI Failure 0.8 Ratio • ASTEROID + DanceOS + 0.6 0.4 FEHLER: 11 0.2 0 | PVF - FI | Abs. diff. – Evaluate usefulness of 0.5 PVF vs. fault injection 0 – Outline challenges and 0 50 100 150 200 250 300 350 possible solutions Time [10k Instruction Blocks] 11 D ¨ obel, Schirmeier, Engel: Investigating the Limitations of PVF for Program Vulnerability Analysis , DFR 2013 ASTEROID slide 11 of 18

  20. Shared Memory • Not in complete control of master • Standard technique: trap&emulate – Execution overhead (x100 - x1000) – Adds complexity to RCB Disassembler 6,000 LoC Tiny emulator 500 LoC • Our implementation: copy & execute ASTEROID slide 12 of 18

  21. Copy&Execute Replica Master ASTEROID slide 13 of 18

  22. Copy&Execute Replica Master mov eax, [ebx] X ASTEROID slide 13 of 18

  23. Copy&Execute Replica Master mov eax, [ebx] ASTEROID slide 13 of 18

  24. Copy&Execute Replica Master mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state ASTEROID slide 13 of 18

  25. Copy&Execute Replica Master mov eax, [ebx] mov eax, [ebx] load repl. state NOP; NOP; ...; NOP restore master state ASTEROID slide 13 of 18

Recommend


More recommend