Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC Jacopo Panerati ∗ , Samar Abdi † , and Giovanni Beltrame ∗ ∗ ´ eal, † Concordia University Ecole Polytechnique de Montr´ MPSoC 2015 - Ventura Beach, CA, USA POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Table of contents 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 2/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 3/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Motivation • Aerospace: high-frequency of Single Event Upsets • Usually critical systems, requiring high availability • Classical countermeasures: • Modular redundancy • Shielding • Issues: • Cost • Extra hardware = ⇒ more power = ⇒ higher temperature = ⇒ shorter lifetime • What is a good trade-off? J. Panerati et al. – Liferime, Fault-tolerance, Power 4/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Research Goal • Reliability and fault-tolerance are essential for critical, autonomous systems • We propose a methodology to quantify, and maximize, reliability in the presence of transient errors for MPSoC • Fault-tolerance is traded-off with power consumption • We target homogeneous multi-processor systems • Goal: keep a certain level of reliability/lifetime with varying fault rates J. Panerati et al. – Liferime, Fault-tolerance, Power 5/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 6/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions System Model • Multiprocessor System-on-Chip (we’re in the right place!) • Identical processing elements (PEs) w/ private caches ... PE 1,1 PE 1,2 • Voltage scaling: a set of operating points for each PE Fault models ... PE 2,1 PE 2,2 • Transient faults (SEUs) w/ data ... ... scrubbing • Permanent Faults • Total Ionizing Does (TID) effects J. Panerati et al. – Liferime, Fault-tolerance, Power 7/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Real-Time Application Model • A set of tasks τ 1 , τ 2 ..., τ m is A ≺ B executed B ≺ D A B • Each task has a WCET associaded with the slowes WCETOPk ( A )=2 WCETOPk ( B )=4 operating point of a PE D • The speedup is proportional WCETOPk ( D )=5 C ≺ D C to the frequency increase WCETOPk ( C )=7 WCET OP ( f i , − ) = WCET OP ( f 0 , − ) · f 0 f i • Precedences via a Directed Acyclic Graph (DAG) J. Panerati et al. – Liferime, Fault-tolerance, Power 8/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Single Event Upsets We use probability theory to model the occurrence of faults. SEUs are 1 caused by high-energy particles: 0 . 8 • Whose impacts are 0 . 6 independent. P SEU • Which happen at a constant 0 . 4 average rate. 0 . 2 • The rate is mission 0 phase-dependent. 0 20 40 60 80 100 The number of impacts in a average SEUs/day scrubbing period of length T is T = 1h T = 30’ T = 10’ a Poisson rand variable. J. Panerati et al. – Liferime, Fault-tolerance, Power 9/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Permanent Faults • We consider the most common wear-out phenomena: hot carriers, negative bias temperature instabiliti (NBTI), time dependent dielectric breakdown (TDDB), electromigration, and self-heating • Hypothesize that Mean Time To Fail (MTTF) has an exponential relationship with PE load (utilization U ) MTTF U ∝ ( MTTF 100% ) U − 1 0.3 MTTF = 1yrs 1 MTTF = 5yrs 0.8 MTTF = 10yrs 0.2 0.6 CDF pmf 0.4 0.1 0.2 0 0 0 10 20 30 40 50 years J. Panerati et al. – Liferime, Fault-tolerance, Power 10/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Power Model • Total power = sum of each PE • Standard model with capacitance, frequency, activation factor P = α · C · V 2 · f 30 1.8 Power Dynamic Power (W) 25 Voltage 1.6 Voltage (V) 20 1.4 15 1.2 10 1 5 0.8 600 800 1,000 1,200 1,400 1,600 frequency (Mhz) J. Panerati et al. – Liferime, Fault-tolerance, Power 11/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 12/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Methodology Task Mapping • Enumerate all possible mappings • Prune the design space according to WCET and slowest operating point • Compute the utilization for each mapping Power, Fault-tolerance, and Lifetime Optimization • Compute the total energy according to utilization and operating points • Utilizations reflect exponentially on the probability of system-wide error • Slack provides fault-tolerance • We consider the effect of utilization on lifetime and the failure of multiple resources for lifetime optimization J. Panerati et al. – Liferime, Fault-tolerance, Power 13/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 14/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Case Study (actually a toy example) • Dual core, four tasks, each PE has four operating points • Implementation on a Virtex 4 board • 16.5 faults/day in Low Earth Orbit (LEO) • 62 faults/day in Highly Elliptical Orbit (HEO) Operating Point OP 1 OP 2 OP 3 f 1 = 600MHz f 2 = 1.2Ghz f 3 = 1.6Ghz A 8.0 4.0 3.0 B 4.0 2.0 1.5 Task C 8.0 4.0 3.0 D 12.0 6.0 4.5 J. Panerati et al. – Liferime, Fault-tolerance, Power 15/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Results • Overall 29 acceptable points, 15 different points shown here • Trade-offs for utilization (lifetime), power efficiency, or fault-tolerance Average Best Power System Errors Utilization Consumption LEO HEO 0 . 600 30.00W 12 42 0.650 27.70W 13 45 0.675 26.55W 14 47 0.700 25.40W 15 49 0.725 24.25W 15 50 0.800 20 . 80W 16 56 0.850 27.30W 17 59 J. Panerati et al. – Liferime, Fault-tolerance, Power 16/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Results • Design space as an n -dimensional space of utilization levels, with reliability and power consumption design points 1 0.8 U PE 2 0.6 0.4 best reliability best power eff. 0.2 0.4 0.6 0.8 1 U PE 1 J. Panerati et al. – Liferime, Fault-tolerance, Power 17/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 18/20 – mistlab.ca
POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Conclusions • Methodology for scheduling real-tiem tasks in homogeneous MPSoCs • Energy, fault-tolerance, and lifetime-aware Future Work • Use a detailed temperature model instead of the utilization proxy • Extend to the effects of interconnects • More detailed modelling of permanent faults J. Panerati et al. – Liferime, Fault-tolerance, Power 19/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions The End Questions? http://mistlab.ca J. Panerati et al. – Liferime, Fault-tolerance, Power 20/20 – mistlab.ca
Recommend
More recommend