Trading Off Lifetime, Fault-tolerance, and Power Consumption in - PDF document

Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC Jacopo Panerati ∗ , Samar Abdi † , and Giovanni Beltrame ∗ ∗ ´ eal, † Concordia University Ecole Polytechnique de Montr´ MPSoC 2015 - Ventura Beach, CA, USA POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Table of contents 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 2/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 3/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Motivation • Aerospace: high-frequency of Single Event Upsets • Usually critical systems, requiring high availability • Classical countermeasures: • Modular redundancy • Shielding • Issues: • Cost • Extra hardware = ⇒ more power = ⇒ higher temperature = ⇒ shorter lifetime • What is a good trade-off? J. Panerati et al. – Liferime, Fault-tolerance, Power 4/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Research Goal • Reliability and fault-tolerance are essential for critical, autonomous systems • We propose a methodology to quantify, and maximize, reliability in the presence of transient errors for MPSoC • Fault-tolerance is traded-off with power consumption • We target homogeneous multi-processor systems • Goal: keep a certain level of reliability/lifetime with varying fault rates J. Panerati et al. – Liferime, Fault-tolerance, Power 5/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 6/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions System Model • Multiprocessor System-on-Chip (we’re in the right place!) • Identical processing elements (PEs) w/ private caches ... PE 1,1 PE 1,2 • Voltage scaling: a set of operating points for each PE Fault models ... PE 2,1 PE 2,2 • Transient faults (SEUs) w/ data ... ... scrubbing • Permanent Faults • Total Ionizing Does (TID) effects J. Panerati et al. – Liferime, Fault-tolerance, Power 7/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Real-Time Application Model • A set of tasks τ 1 , τ 2 ..., τ m is A ≺ B executed B ≺ D A B • Each task has a WCET associaded with the slowes WCETOPk ( A )=2 WCETOPk ( B )=4 operating point of a PE D • The speedup is proportional WCETOPk ( D )=5 C ≺ D C to the frequency increase WCETOPk ( C )=7 WCET OP ( f i , − ) = WCET OP ( f 0 , − ) · f 0 f i • Precedences via a Directed Acyclic Graph (DAG) J. Panerati et al. – Liferime, Fault-tolerance, Power 8/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Single Event Upsets We use probability theory to model the occurrence of faults. SEUs are 1 caused by high-energy particles: 0 . 8 • Whose impacts are 0 . 6 independent. P SEU • Which happen at a constant 0 . 4 average rate. 0 . 2 • The rate is mission 0 phase-dependent. 0 20 40 60 80 100 The number of impacts in a average SEUs/day scrubbing period of length T is T = 1h T = 30’ T = 10’ a Poisson rand variable. J. Panerati et al. – Liferime, Fault-tolerance, Power 9/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Permanent Faults • We consider the most common wear-out phenomena: hot carriers, negative bias temperature instabiliti (NBTI), time dependent dielectric breakdown (TDDB), electromigration, and self-heating • Hypothesize that Mean Time To Fail (MTTF) has an exponential relationship with PE load (utilization U ) MTTF U ∝ ( MTTF 100% ) U − 1 0.3 MTTF = 1yrs 1 MTTF = 5yrs 0.8 MTTF = 10yrs 0.2 0.6 CDF pmf 0.4 0.1 0.2 0 0 0 10 20 30 40 50 years J. Panerati et al. – Liferime, Fault-tolerance, Power 10/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Power Model • Total power = sum of each PE • Standard model with capacitance, frequency, activation factor P = α · C · V 2 · f 30 1.8 Power Dynamic Power (W) 25 Voltage 1.6 Voltage (V) 20 1.4 15 1.2 10 1 5 0.8 600 800 1,000 1,200 1,400 1,600 frequency (Mhz) J. Panerati et al. – Liferime, Fault-tolerance, Power 11/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 12/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Methodology Task Mapping • Enumerate all possible mappings • Prune the design space according to WCET and slowest operating point • Compute the utilization for each mapping Power, Fault-tolerance, and Lifetime Optimization • Compute the total energy according to utilization and operating points • Utilizations reflect exponentially on the probability of system-wide error • Slack provides fault-tolerance • We consider the effect of utilization on lifetime and the failure of multiple resources for lifetime optimization J. Panerati et al. – Liferime, Fault-tolerance, Power 13/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 14/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Case Study (actually a toy example) • Dual core, four tasks, each PE has four operating points • Implementation on a Virtex 4 board • 16.5 faults/day in Low Earth Orbit (LEO) • 62 faults/day in Highly Elliptical Orbit (HEO) Operating Point OP 1 OP 2 OP 3 f 1 = 600MHz f 2 = 1.2Ghz f 3 = 1.6Ghz A 8.0 4.0 3.0 B 4.0 2.0 1.5 Task C 8.0 4.0 3.0 D 12.0 6.0 4.5 J. Panerati et al. – Liferime, Fault-tolerance, Power 15/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Results • Overall 29 acceptable points, 15 different points shown here • Trade-offs for utilization (lifetime), power efficiency, or fault-tolerance Average Best Power System Errors Utilization Consumption LEO HEO 0 . 600 30.00W 12 42 0.650 27.70W 13 45 0.675 26.55W 14 47 0.700 25.40W 15 49 0.725 24.25W 15 50 0.800 20 . 80W 16 56 0.850 27.30W 17 59 J. Panerati et al. – Liferime, Fault-tolerance, Power 16/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Results • Design space as an n -dimensional space of utilization levels, with reliability and power consumption design points 1 0.8 U PE 2 0.6 0.4 best reliability best power eff. 0.2 0.4 0.6 0.8 1 U PE 1 J. Panerati et al. – Liferime, Fault-tolerance, Power 17/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Outline 1 Introduction 2 System Model 3 Methodology 4 Case Study 5 Conclusions J. Panerati et al. – Liferime, Fault-tolerance, Power 18/20 – mistlab.ca

POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions Conclusions • Methodology for scheduling real-tiem tasks in homogeneous MPSoCs • Energy, fault-tolerance, and lifetime-aware Future Work • Use a detailed temperature model instead of the utilization proxy • Extend to the effects of interconnects • More detailed modelling of permanent faults J. Panerati et al. – Liferime, Fault-tolerance, Power 19/20 – mistlab.ca POLYTECHNIQUE MONTR´ EAL Introduction System Model Methodology Case Study Conclusions The End Questions? http://mistlab.ca J. Panerati et al. – Liferime, Fault-tolerance, Power 20/20 – mistlab.ca

Trading Off Lifetime, Fault-tolerance, and Power Consumption in - PDF document

Trading Off Lifetime, Fault-tolerance, and Power Consumption in Real-time MPSoC Jacopo Panerati , Samar Abdi , and Giovanni Beltrame eal, Concordia University Ecole Polytechnique de Montr MPSoC 2015 - Ventura Beach, CA,

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

CH NG 8: FAULT TOLERANCE TS. Tr n H i Anh Content 2 1. Introduction to fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

On the Design of Fault-Tolerance in a Decentralized Software Platform for Power Systems Purboday

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Hypervisor-Based Fault-Tolerance Thomas C. Bressoud, Isis

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

iQIM 5 December 2011 Quantum fault tolerance Error correction and fault tolerance will be

Fault Tolerance For Sparse Linear Algebra Computations Implemented In A Grid Environment

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Improving Scalability and Fault Improving Scalability and Fault Tolerance in an Application

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Formal Verification of Automatic Circuit Transformations for Fault-Tolerance Dmitry Burlyaev

AFT: A Serverless Fault- Tolerance Shim Vikram Sreekanti , Chenggang Wu, Saurav Chhatrapati,

MC714 - Sistemas Distribuidos slides by Maarten van Steen (adapted from Distributed System - 3rd

Fault-Tolerance for PastryGrid Middleware erin 1 , Heithem Abbes 1 , 2 , Mohamed Jemni 2 , Yazid