INTERMITTENT HARDWARE ERRORS RECOVERY: MODELING AND EVALUATION L A Y A L I R A S H I D , K A R T H I K PAT TA B I R A M A N A N D S AT H I S H G O PA L A K R I S H N A N
INTERMITTENT FAULTS-DEFINITION • Hardware errors that appear non-deterministically at the same microarchitectural location. • 40% of the real-world failures in processors are caused by intermittent faults [1] . Error start time Transient fault Permanent fault Intermittent fault 2
CONTRIBUTIONS • Build a model of chip multiprocessor running a parallel application using Stochastic Activity Networks. • Propose intermittent fault models that abstract real intermittent faults at the system level. • Evaluate the performance of a processor after applying different recovery options. 3
RECOVERY-MOTIVATION Program Execution Program Execution CHKPT CHKPT 4
RECOVERY-MOTIVATION Hardware Error Problem! Program Execution CHKPT
RECOVERY-MOTIVATION Transient Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint
RECOVERY-MOTIVATION Permanent Hardware Error Program Execution Problem! CHKPT Recovery Restore to Checkpoint Core Reconf.
RECOVERY-MOTIVATION Intermittent Hardware Error Program Execution Problem! CHKPT ? Recovery Restore to Checkpoint Core Reconf.
MODEL OVERVIEW System Model Processor Model Fault Model • Rollback-Only • Base • Permanent Reconfiguration • Exponential • Temporary Reconfiguration • Weibull 9
KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . 10
PROCESSOR MODEL Error Program Error Execution Detection Rollback Perf. Degradation to Checkpoint Unit Shutdown- No No Permanent Reconfigure? Fine-Grained Yes Diagnosis 11
PROCESSOR MODEL Error Program Error Execution Detection Full Throughput Rollback to No Enable Checkpoint Unit? No Program Execution Reconfigure? Perf. Degradation Yes Unit Shutdown- Fine-Grained Temporary Diagnosis 12
FAULT MODEL-BASE FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ p 1-p Pulse Error 13
FAULT MODEL-EXPONENTIAL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 p 1-p Error Inactive Active Pulse d 14
FAULT MODEL- WEIBULL FAULT MODEL • Abstract physical fault models. • Prune down the space of system configurations. λ 2 λ 1 , σ p 1-p Error Inactive Active Pulse d 15
EXPERIMENT SETUP • Used Mobius [2] to simulate the system for 48 hours with a confidence interval of 95%. • Used useful work [3] measure to model processor throughput in a certain a mount of time. • Analyzed a model of multiprocessor running coordinated checkpoint. 16
SYSTEM PARAMETERS Checkpoint 30sec/5-60min 70% Accuracy Error Program Error Execution Detection Rollback to Perf. Degradation Checkpoint 0-35% 0-60sec Unit No No Shutdown- Permanent Reconfigure? 2sec Fine-Grained Yes Diagnosis 17
RESEARCH QUESTIONS • When should we recover from an intermittent fault by shutting down the defective component? • For errors that are tolerated by shutting down the defective component, should the shutdown be permanent or temporary? 18
RESULTS-DIFFERENT FAULT MODELS • Permanent/temporary reconfiguration leads to 27% more useful work than rollback-only for exponential and Weibull fault models. 19
RESEARCH QUESTION What is the granularity of the disabled component that maximizes the processor’s performance? 20
COMPONENT RANK • The maximum percentage of useful work that is lost when the component is disabled. • 4-core processor, each core has two LSUs and is running a LSU LSU program that is using all the 8 LSU LSU LSUs for 60% of the time. LSU LSU LSU LSU LSU • Using Amdahl’s law, LSU rank is 19% or 1/(0.4 + (0.6/0.125)) 21
RESULT-EFFECT OF COMPONENT RANK • For this experiment, components with a rank of 35% or more should be disabled if diagnosed with intermittent errors. 22
Sensitivity to Fault Rate 23
RESULTS- SENSITIVITY TO FAULT RATE • If lost useful work outweighs the rank of the defective component, then the defective component should be disabled. 24
KEY FINDINGS • Error rate and the relative importance of the error location are the main factors in finding the best recovery for high intermittent failure rates. • Permanent shutdown of the defective unit results in a slight improvement of the performance compared to the temporary shutdown . [1] Eurosys, 2011 [2] Tools of MME, 2003. 25 [3] DSN, 2005.
Recommend
More recommend