intermittent hardware errors recovery modeling and
play

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y - PowerPoint PPT Presentation

INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N INTERMITTENT FAULTS-DEFINITION Hardware errors that appear


  1. INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N

  2. INTERMITTENT FAULTS-DEFINITION • Hardware errors that appear non-deterministically at the same microarchitectural location. • 40% of the real-world failures in processors are caused by intermittent faults [1] . Error start time Transient fault Permanent fault Intermittent fault 2

  3. CONTRIBUTIONS • Build a model of chip multiprocessor running a parallel application using Stochastic Activity Networks. • Propose intermittent fault models that abstract real intermittent faults at the system level. • Evaluate the performance of a processor after applying different recovery options. 3

  4. RECOVERY-MOTIVATION Program Execution Program Execution CHKPT CHKPT 4

  5. RECOVERY-MOTIVATION Hardware Error Problem! Program Execution CHKPT

  6. RECOVERY-MOTIVATION Transient Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint

  7. RECOVERY-MOTIVATION Permanent Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint Core Reconf.

  8. RECOVERY-MOTIVATION Intermittent Hardware Error Program Execution Problem! CHKPT ? Recovery Restore to Checkpoint Core Reconf.

  9. MODEL OVERVIEW System Model Processor Model Fault Model • Rollback-Only • Base • Permanent Reconfiguration • Exponential • Temporary Reconfiguration • Weibull 9

  10. KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . 10

  11. PROCESSOR MODEL Error Program Error Execution Detection Rollback Perf. Degradation to Checkpoint Unit Shutdown- No No Permanent Reconfigure? Fine-Grained Yes Diagnosis 11

  12. PROCESSOR MODEL Error Program Error Execution Detection Full Throughput Rollback to No Enable Checkpoint Unit? No Program Execution Reconfigure? Perf. Degradation Yes Unit Shutdown- Fine-Grained Temporary Diagnosis 12

  13. FAULT MODEL-BASE FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ p 1-p Pulse Error 13

  14. FAULT MODEL-EXPONENTIAL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 p 1-p Error Inactive Active Pulse d 14

  15. FAULT MODEL- WEIBULL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 , σ p 1-p Error Inactive Active Pulse d 15

  16. EXPERIMENT SETUP • Used Mobius [2] to simulate the system for 48 hours with a confidence interval of 95%. • Used useful work [3] measure to model processor throughput in a certain a mount of time. • Analyzed a model of multiprocessor running coordinated checkpoint. 16

  17. SYSTEM PARAMETERS Checkpoint 30sec/5-60min 70% Accuracy Error Program Error Execution Detection Rollback to Perf. Degradation Checkpoint 0-35% 0-60sec Unit No No Shutdown- Permanent Reconfigure? 2sec Fine-Grained Yes Diagnosis 17

  18. RESEARCH QUESTIONS • When should we recover from an intermittent fault by shutting down the defective component? • For errors that are tolerated by shutting down the defective component, should the shutdown be permanent or temporary? 18

  19. RESULTS-DIFFERENT FAULT MODELS • Permanent/temporary reconfiguration leads to 27% more useful work than rollback-only for exponential and Weibull fault models. 19

  20. RESEARCH QUESTION What is the granularity of the disabled component that maximizes the processor’s performance? 20

  21. COMPONENT RANK • The maximum percentage of useful work that is lost when the component is disabled. • 4-core processor, each core has two LSUs and is running a LSU LSU program that is using all the 8 LSU LSU LSUs for 60% of the time. LSU LSU LSU LSU LSU • Using Amdahl’s law, LSU rank is 19% or 1/(0.4 + (0.6/0.125)) 21

  22. RESULT-EFFECT OF COMPONENT RANK • For this experiment, components with a rank of 35% or more should be disabled if diagnosed with intermittent errors. 22

  23. Sensitivity to Fault Rate 23

  24. RESULTS- SENSITIVITY TO FAULT RATE • If lost useful work outweighs the rank of the defective component, then the defective component should be disabled. 24

  25. KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . [1] Eurosys, 2011 [2] Tools of MME, 2003. 25 [3] DSN, 2005.

Recommend


More recommend