EMT 368 Reliability and Testability in Integrated Circuit Design - PowerPoint PPT Presentation

EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1

Course content • Reliability and availability concept • Robust design principle • Time and failure dependent reliability • Estimation methods of the parameters of failure time distribution • Parametric reliability model • Overview of testing • Ad-hoc techniques • Scan-path design • Boundary scan testing • Built-in self test (BIST) A. Harun 2

CHAPTER 2 Robust Design Principle A. Harun 3

Chapter 2 – Robust Design Principle • Unit of design • Failure recovery groups • Redundancy • Robust design principles • Robust protocols • Robust concurrency controls • Overload control • Process, resource and throughput monitoring • Data auditing • Fault correlation • Failed error detection, isolation or recovery • Geographic redundancy • Security, availability and system robustness • Error detection A. Harun 4

Robust design principle • 2.1 Unit of design – HW and SW are organized into small comp or modules. – System architecture or design define the comp come together  system – Thought of as logical container • Accept logical input  success correct output  fail/inconsistent provide error/exception • If major fault  hang/unresponsive • Organized into hierarchical design. A. Harun 5

Robust design principle A. Harun 6

Robust design principle • 2.1 Unit of design – Logical container from biggest to smallest (network application) • Application • User session • Message request – Protocol message or request – Any error found in this is contained in this container • Transactions • Robust exception handling • Subroutines – Natural fault container A. Harun 7

Robust design principle • 2.1 Unit of design – Logical container from biggest to smallest (HW and Platf SW) • System – Necessary to restart entire system • FRU – Modular, e.g. blade server • Processor • Process • Thread A. Harun 8

Robust design principle • 2.2 Failure recovery group – Unit report failure to containing unit or containing unit implicitly detect the failure from errant behavior – What to do ? • One can restart the errant application • Restart the entire operating system • Highly available SW support smaller recovery group e.g. session termination, process, etc. – Failure recovery groups are suites of logical entities that are designed and tested to be recoverable while the remainder of sys remain operational. – Most common failure recovery group is SW process, browser, word processor, etc A. Harun 9

Robust design principle • 2.2 Redundancy – Sys deploy redundancy to increase throughput or capacity. E.g. mult processor core, mem module to proc board – Increase service availability. E.g multi engine on airplane – Redundancy in computer based sys implemented at three levels • Process – prepare mult process in advance • FRU – e.g. compute blade FRUs in blade server • Network element- more DNS servers A. Harun 10

Robust design principle • 2.2 Redundancy – Redundant units are typically organized into one of two common arrangements • Active standby – one serving and one on standby. – Hot, ward and cold are term to characterize readiness » Cold standby – application SW or OS need to be restarted » Ward standby – Apps SW running, volatile data is periodically sync. Time needed to rebuild the sys state » Hot standby – apps running, volatile data is current • Load shared – All operational units actively serving users – N = number of units required, K = # of redundancy unit configured – N + K load sharing – E.g. commercial airplane N + 1 A. Harun 11

Robust design principle • 2.2 high availability Middleware – Recovering service into a redundant unit – failure recovery fast, no impact to user – High availability mechanism/middleware – Practical sys may use some of these: • IP networking mechanism – balance netw load across cluster of servers • Clustering – two or more computers arrange into a pool • High-availability middleware – infra to support sync, data sharing, monitoring, management of apps • Application checkpoint mechanism – system restore • Virtual machine • Redundant array of inexpensive disks (RAID) – arrange mult HDD or called mirrors • Database redundancy and replication • File sys replication A. Harun 12

Robust design principle • 2.3 Robust design principle – Robust design principle to consider: • Redundant, fault-tolerance design • No single point of failure • No single point of repair • Hot swappable FRUs – Sys with no down time for planned activities should consider the following principles • No service impact for SW patch, update and upgrade • No service impact for HW growth or degrowth • Minimal impact for sys reconfiguration A. Harun 13

Robust design principle • 2.3 Robust protocol – Application protocols can be made robust by: • Use reliable protocols • Use confirmations or acknowledgements • Support atomic requests or transactions • Support timeouts and message retries • Use heartbeat or keep-alive mechanisms • Use stateless or minimal-shared state protocols • Support automatic reconnection A. Harun 14

Robust design principle • 2.4 Robust concurrency controls – Concurrency controls enable applications to efficiently share resources across many users simultaneously. – Sys may share procs time, buffer, etc – Access to critical sections controlling shared resources have to be serialized – Cannot have two applications accessing same portion of shared memory or resource pool. – Need platform mechanism like semaphore, mutual exclusion lock for control. – Application also should make sure process that failed can be restarted without restarting entire sys – Concurrency control held by failed process need to be scanned to avoid dead locks standing. A. Harun 15

Robust design principle • 2.5 Overload control – Sys implemented has physical hw constraint • E.g. processing power, storage, IO bandwidth – Translated into capacity limits under acceptable QoS – When demand for service increased, sys unable to deliver required requests – Need overload control to gracefully manage traffic exceeding engineered capacity A. Harun 16

Robust design principle • 2.5 Overload control – Sys overload causes • Unexpected popularity • Under engineered system • Incorrectly configured system • External events – Promotion, NY eve, etc • Power outage and restoration – Spike in reconnection if automated • Network equipment failure and restoration • System failure – Service distributed to multiple sys, one fail causing workload shifted to others • Denial-of-service attack – Cyber vandalism, ransom A. Harun 17

Robust design principle • 2.5 Overload control – Two features of overload control • Control mechanism – Shed load or traffic • Control triggers – Activate control mechanism when congestion occur – Deactivate after congestion ended A. Harun 18

Robust design principle • 2.5 Overload control – Congestion detection techniques • Slower sys response times • Longer work queues • Higher CPU utilization – High sys stress, may not overloading – Congestion control mechanism • Rejecting new sessions – ‘too busy error’ • Rejecting new message requests – reject all traffic frm certain users, certain type of message, etc • Disconnecting alive session – lower priority users • Disabling servers or services – close some or all IP ports having overload. A. Harun 19

Robust design principle • 2.5 Overload control – Architectural considerations • Sys should have three broad classes – Low priority » Not directly impact user, maintenance and bg task (e.g. bkup, audit, etc) – Medium priority » Tasks directly/indirectly interact with end users – High priority » Management visibility and control tasks e.g. overload control. • As sys saturate, low priority will be deferred to higher priority. A. Harun 20

Robust design principle • 2.6 Process, resource and throughput monitoring – Some errors may not immediately seen in normal sys operation. Thus need to detect before become critical failure. – Mechanism to proactively monitor sys health • Heartbeat checks of critical processes – ensure sane enough to respond within reasonable time • Resource usage checks – process size, free space, cpu usage • Data audits • Monitor sys throughput, performance and alarm behavior • Health checks of critical supporting sys – hello, keep-alive, status queries – These normally run at low priority process, but master control need to run at higher priority process. A. Harun 21

EMT 368 Reliability and Testability in Integrated Circuit Design - PowerPoint PPT Presentation

EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1 Course content Reliability and availability concept Robust design principle Time and failure dependent

Adaptation Process for EMT en Espaol TATIANA N. PEREDO & ANN P. KAISER CRIEI 2020 What is

WECC 2026 Studies Program: Sec. 368 Energy Corridors Spatial Assessment Status Update Jim Gazewood,

Day 14: Wrapper Scripts 2012 Fall Cartwright 1 Computer Sciences 368 Scripting for CHTC

Day 14: Wrapper Scripts 2012 Spring Cartwright 1 Computer Sciences 368 Scripting for CHTC

Welcome to CS 368! Introductions, Overview, Course Mechanics, Resources, etc. 2012 Summer

Day 12: Scripting Workflows I Parameter Sweeps 2012 Fall Cartwright 1 Computer Sciences 368

Day 12: Scripting Workflows I Parameter Sweeps 2012 Spring Cartwright 1 Computer Sciences 368

Day 13: Scripting Workflows II DAGMan 2012 Fall Cartwright 1 Computer Sciences 368 Scripting

AOP Summer Presentation Chief of Operations Gideon Cohen, EMT-B 1 st Assistant Chief Caroline

Understanding the Cost Report MO G MO GEMT EMT GEMT Less of an EMS program More of a

EMT 104 Pharmacology for the Paramedic Rory S. Putnam, AA, NREMT-P, I/C 1 Quantitative

Lesson 5 EMT Assessment / History and Physical Exam Boone County Fire Protec/on District EMS

E-EMT Five Phases for a safe return to on campus operations Please Mute your microphone during

EMT 181/3 Physics for Electronics Dr. Mohamad Nazri Abdul Halif Mr. Ramzan Mat Ayub Y.Bhg. Dato

Out of Hospital Arrest: How are W e Doing in Lane County? By: Joshua Moore, FireFighter EMT-P

HEMS US Cynthia Griffin D.O., EMT-P University of Wisconsin Hospital & Clinics MedFlight

Experimentation with network-based security mechanisms GENI Security Workshop UC Davis G.

A Light-Weight Virtual Machine Monitor for Blue Gene/P KIT University of the State of

12/2/2013 The Common Core State Standards and Students with Moderate/Severe Disabilities Using

Sustainable Human Development Index a pragmatic proposal for monitoring sustainability within

Substitute x A x = x cos ( ) - y sin ( ) x 1 0 x 0

A DevOps State of Mind Chris Van Tuin Chief Technologist, West cvantuin@redhat.com In short,

ECE590-03 Enterprise Storage Architecture Fall 2016 Introduction Tyler Bletsch Duke University

Bringing Bro to the Enterprise Comprehensive Visibility & Response for Every Corner of Your

EMT 368 Reliability and Testability in Integrated Circuit Design - PowerPoint PPT Presentation

EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1 Course content Reliability and availability concept Robust design principle Time and failure dependent

Adaptation Process for EMT en Espaol TATIANA N. PEREDO &amp; ANN P. KAISER CRIEI 2020 What is

WECC 2026 Studies Program: Sec. 368 Energy Corridors Spatial Assessment Status Update Jim Gazewood,

Day 14: Wrapper Scripts 2012 Fall Cartwright 1 Computer Sciences 368 Scripting for CHTC

Day 14: Wrapper Scripts 2012 Spring Cartwright 1 Computer Sciences 368 Scripting for CHTC

Welcome to CS 368! Introductions, Overview, Course Mechanics, Resources, etc. 2012 Summer

Day 12: Scripting Workflows I Parameter Sweeps 2012 Fall Cartwright 1 Computer Sciences 368

Day 12: Scripting Workflows I Parameter Sweeps 2012 Spring Cartwright 1 Computer Sciences 368

Day 13: Scripting Workflows II DAGMan 2012 Fall Cartwright 1 Computer Sciences 368 Scripting

AOP Summer Presentation Chief of Operations Gideon Cohen, EMT-B 1 st Assistant Chief Caroline

Understanding the Cost Report MO G MO GEMT EMT GEMT Less of an EMS program More of a

EMT 104 Pharmacology for the Paramedic Rory S. Putnam, AA, NREMT-P, I/C 1 Quantitative

Lesson 5 EMT Assessment / History and Physical Exam Boone County Fire Protec/on District EMS

E-EMT Five Phases for a safe return to on campus operations Please Mute your microphone during

EMT 181/3 Physics for Electronics Dr. Mohamad Nazri Abdul Halif Mr. Ramzan Mat Ayub Y.Bhg. Dato

Out of Hospital Arrest: How are W e Doing in Lane County? By: Joshua Moore, FireFighter EMT-P

HEMS US Cynthia Griffin D.O., EMT-P University of Wisconsin Hospital &amp; Clinics MedFlight

Experimentation with network-based security mechanisms GENI Security Workshop UC Davis G.

A Light-Weight Virtual Machine Monitor for Blue Gene/P KIT University of the State of

12/2/2013 The Common Core State Standards and Students with Moderate/Severe Disabilities Using

Sustainable Human Development Index a pragmatic proposal for monitoring sustainability within

Substitute x A x = x cos ( ) - y sin ( ) x 1 0 x 0

A DevOps State of Mind Chris Van Tuin Chief Technologist, West cvantuin@redhat.com In short,

ECE590-03 Enterprise Storage Architecture Fall 2016 Introduction Tyler Bletsch Duke University

Bringing Bro to the Enterprise Comprehensive Visibility &amp; Response for Every Corner of Your

Adaptation Process for EMT en Espaol TATIANA N. PEREDO & ANN P. KAISER CRIEI 2020 What is

HEMS US Cynthia Griffin D.O., EMT-P University of Wisconsin Hospital & Clinics MedFlight

Bringing Bro to the Enterprise Comprehensive Visibility & Response for Every Corner of Your