EMT 368 Reliability and Testability in Integrated Circuit Design School of Microelectronic Engineering UniMAP A. Harun 1
Course content • Reliability and availability concept • Robust design principle • Time and failure dependent reliability • Estimation methods of the parameters of failure time distribution • Parametric reliability model • Overview of testing • Ad-hoc techniques • Scan-path design • Boundary scan testing • Built-in self test (BIST) A. Harun 2
CHAPTER 2 Robust Design Principle A. Harun 3
Chapter 2 – Robust Design Principle • Unit of design • Failure recovery groups • Redundancy • Robust design principles • Robust protocols • Robust concurrency controls • Overload control • Process, resource and throughput monitoring • Data auditing • Fault correlation • Failed error detection, isolation or recovery • Geographic redundancy • Security, availability and system robustness • Error detection A. Harun 4
Robust design principle • 2.1 Unit of design – HW and SW are organized into small comp or modules. – System architecture or design define the comp come together system – Thought of as logical container • Accept logical input success correct output fail/inconsistent provide error/exception • If major fault hang/unresponsive • Organized into hierarchical design. A. Harun 5
Robust design principle A. Harun 6
Robust design principle • 2.1 Unit of design – Logical container from biggest to smallest (network application) • Application • User session • Message request – Protocol message or request – Any error found in this is contained in this container • Transactions • Robust exception handling • Subroutines – Natural fault container A. Harun 7
Robust design principle • 2.1 Unit of design – Logical container from biggest to smallest (HW and Platf SW) • System – Necessary to restart entire system • FRU – Modular, e.g. blade server • Processor • Process • Thread A. Harun 8
Robust design principle • 2.2 Failure recovery group – Unit report failure to containing unit or containing unit implicitly detect the failure from errant behavior – What to do ? • One can restart the errant application • Restart the entire operating system • Highly available SW support smaller recovery group e.g. session termination, process, etc. – Failure recovery groups are suites of logical entities that are designed and tested to be recoverable while the remainder of sys remain operational. – Most common failure recovery group is SW process, browser, word processor, etc A. Harun 9
Robust design principle • 2.2 Redundancy – Sys deploy redundancy to increase throughput or capacity. E.g. mult processor core, mem module to proc board – Increase service availability. E.g multi engine on airplane – Redundancy in computer based sys implemented at three levels • Process – prepare mult process in advance • FRU – e.g. compute blade FRUs in blade server • Network element- more DNS servers A. Harun 10
Robust design principle • 2.2 Redundancy – Redundant units are typically organized into one of two common arrangements • Active standby – one serving and one on standby. – Hot, ward and cold are term to characterize readiness » Cold standby – application SW or OS need to be restarted » Ward standby – Apps SW running, volatile data is periodically sync. Time needed to rebuild the sys state » Hot standby – apps running, volatile data is current • Load shared – All operational units actively serving users – N = number of units required, K = # of redundancy unit configured – N + K load sharing – E.g. commercial airplane N + 1 A. Harun 11
Robust design principle • 2.2 high availability Middleware – Recovering service into a redundant unit – failure recovery fast, no impact to user – High availability mechanism/middleware – Practical sys may use some of these: • IP networking mechanism – balance netw load across cluster of servers • Clustering – two or more computers arrange into a pool • High-availability middleware – infra to support sync, data sharing, monitoring, management of apps • Application checkpoint mechanism – system restore • Virtual machine • Redundant array of inexpensive disks (RAID) – arrange mult HDD or called mirrors • Database redundancy and replication • File sys replication A. Harun 12
Robust design principle • 2.3 Robust design principle – Robust design principle to consider: • Redundant, fault-tolerance design • No single point of failure • No single point of repair • Hot swappable FRUs – Sys with no down time for planned activities should consider the following principles • No service impact for SW patch, update and upgrade • No service impact for HW growth or degrowth • Minimal impact for sys reconfiguration A. Harun 13
Robust design principle • 2.3 Robust protocol – Application protocols can be made robust by: • Use reliable protocols • Use confirmations or acknowledgements • Support atomic requests or transactions • Support timeouts and message retries • Use heartbeat or keep-alive mechanisms • Use stateless or minimal-shared state protocols • Support automatic reconnection A. Harun 14
Robust design principle • 2.4 Robust concurrency controls – Concurrency controls enable applications to efficiently share resources across many users simultaneously. – Sys may share procs time, buffer, etc – Access to critical sections controlling shared resources have to be serialized – Cannot have two applications accessing same portion of shared memory or resource pool. – Need platform mechanism like semaphore, mutual exclusion lock for control. – Application also should make sure process that failed can be restarted without restarting entire sys – Concurrency control held by failed process need to be scanned to avoid dead locks standing. A. Harun 15
Robust design principle • 2.5 Overload control – Sys implemented has physical hw constraint • E.g. processing power, storage, IO bandwidth – Translated into capacity limits under acceptable QoS – When demand for service increased, sys unable to deliver required requests – Need overload control to gracefully manage traffic exceeding engineered capacity A. Harun 16
Robust design principle • 2.5 Overload control – Sys overload causes • Unexpected popularity • Under engineered system • Incorrectly configured system • External events – Promotion, NY eve, etc • Power outage and restoration – Spike in reconnection if automated • Network equipment failure and restoration • System failure – Service distributed to multiple sys, one fail causing workload shifted to others • Denial-of-service attack – Cyber vandalism, ransom A. Harun 17
Robust design principle • 2.5 Overload control – Two features of overload control • Control mechanism – Shed load or traffic • Control triggers – Activate control mechanism when congestion occur – Deactivate after congestion ended A. Harun 18
Robust design principle • 2.5 Overload control – Congestion detection techniques • Slower sys response times • Longer work queues • Higher CPU utilization – High sys stress, may not overloading – Congestion control mechanism • Rejecting new sessions – ‘too busy error’ • Rejecting new message requests – reject all traffic frm certain users, certain type of message, etc • Disconnecting alive session – lower priority users • Disabling servers or services – close some or all IP ports having overload. A. Harun 19
Robust design principle • 2.5 Overload control – Architectural considerations • Sys should have three broad classes – Low priority » Not directly impact user, maintenance and bg task (e.g. bkup, audit, etc) – Medium priority » Tasks directly/indirectly interact with end users – High priority » Management visibility and control tasks e.g. overload control. • As sys saturate, low priority will be deferred to higher priority. A. Harun 20
Robust design principle • 2.6 Process, resource and throughput monitoring – Some errors may not immediately seen in normal sys operation. Thus need to detect before become critical failure. – Mechanism to proactively monitor sys health • Heartbeat checks of critical processes – ensure sane enough to respond within reasonable time • Resource usage checks – process size, free space, cpu usage • Data audits • Monitor sys throughput, performance and alarm behavior • Health checks of critical supporting sys – hello, keep-alive, status queries – These normally run at low priority process, but master control need to run at higher priority process. A. Harun 21
Recommend
More recommend