2/25/2014 Overview • Introduction ECE 753: FAULT-TOLERANT • System Model COMPUTING • Diagnosis Problem - PMC model • Other Models and Comments Kewal K Saluja Kewal K.Saluja • Sequential Diagnosability Department of Electrical and Computer Engineering • Other Formulations, Algorithms, and Problems • Summary System Diagnosis ECE 753 Fault Tolerant Computing 2 System Model Introduction • Reference • Model and Assumptions • [prad:96] Chapter 8, Original paper in IEEETC (Dec 1967) – Graph model • Diagnosis: an important part of recovery, maintenance and reconfiguration • Processors/processes expressed as nodes • What is system level diagnosis: diagnose • Interconnects as links between nodes failed components in a large possibly failed components in a large, possibly – Each processor is sufficiently powerful to E h i ffi i tl f l t multiprocessor, system test other processors comprehensively • Underlying needs: failures inevitable, units – An example model with four nodes are smart/intelligent to test other units, hence – Test model: node V i tests V j then draw a need a different model and corresponding directed link from V i to V j theory ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Diagnosis - PMC model (contd.) Diagnosis - PMC model (contd.) • Assumptions • Example – Test Model – System with n units – Tests are comprehensive – Test results are binary: good (0) /faulty (1) v 1 v 1 v 2 v 2 – Faulty units can not be trusted for their test Faulty units can not be trusted for their test outcomes (denote x – means can be 0 or 1) – Total number of faulty units in the system v 3 v 4 is upper-bounded to t – Example: system with four nodes and one fault ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1
2/25/2014 Diagnosis - PMC model (contd.) Diagnosis - PMC model (contd.) • One-step diagnosis • Example – Test outcomes – Analysis problem – give a system with n units, all the interconnects, and the test • Assume V 2 is faulty outcomes, identify the faulty units subject to the constraint that no more than t units 1 v 1 v 1 v 2 v 2 in the system are faulty. x x – Design problem – design a system using 0 0 fewest possible test links such that all the v 3 faulty units can be correctly identified in v 4 0 one-step knowing the outcomes of the tests. ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Diagnosis - PMC model (contd.) Diagnosis - PMC model (contd.) • Observations • One-step diagnosis - Example 1. Two possible syndromes associated with the – Consider all possible outcomes - fault V 1 and these are: fault a 12 a 23 a 24 a 31 a 41 a 43 0 0 0 1 1 0 none 0 0 0 0 0 0 and 1 0 0 1 1 0 V 1 faulty x 0 0 1 1 0 2. No two faults have overlapping syndromes V 2 faulty 1 x x 0 0 0 V 3 faulty 0 1 0 x 0 1 V 4 faulty 0 0 1 0 x x Hence: we can correctly identify (diagnose) each row is called Syndrome of the fault the faulty unit ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 Diagnosis - PMC model (contd.) Diagnosis - PMC model (contd.) • Consider two faulty units – say V 1 and V 2 • Result: A system is one-step t-fault possible syndrome diagnosable provided syndrome for x x x 1 1 0 each fault ( 0-fault, 1-fault, 2-faults, …, implies t-faults) are all distinct (non ) ( 0 0 0 1 1 0 a possible overlappling/non intersecting) outcome • More results: - Therefore we can not determine if V 1 alone or but first one more assumption – no two both V 1 and V 2 are faulty. Thus two faults in units test each other this system can not be diagnosed in one- step. ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2
2/25/2014 Diagnosis - PMC model (contd.) Diagnosis - PMC model (contd.) • Result 1: For a system to be one-step t-fault • Design Problem – one-step t-fault diagnosable diagnosable system n ≧ 2t + 1 • Example – n = 7, t = 3 • Result 2: For a system to be one-step t-fault F b f l 0 diagnosable each unit must be tested by at 1 6 least t other units • Theorem: 2 5 A system of n units in which no two units test each other is one step t-fault diagnosable if 4 3 and only if each unit is tested by t other units. ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Diagnosis - PMC model (contd.) Diagnosis - PMC model (contd.) • Systems in which some units test each • Design Problem: Algorithm for a simple one- other step t-fault diagnosable with n ≧ 2t + 1 • One-step t-fault diagnosability 1. Number the nodes from 0 to n-1 conditions are some what complex – 2. draw a link from node i to i+1 (mod n), ( ), See [prad:96] [p ] • How does one check if a given system i+2 (mod n), … , i+t (mod n). is one-step t-fault diagnosable – 3. System so designed is t-fault one-step – Simple if no two units test each other diagnosable. – Some what complex if units test each other – There is a body of literature dealing with diagnosis algorithems ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Other Other Models and Comments Models/Comments(contd.) Consider possible test outcomes when a unit V i tests unit V j – see the listing below – 4,5,6,7 PMC model V i V j outcomes – 8,9,10,11 PMC with complement encoding G G 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 – 0,15 of little value G G F 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 F 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 – etc. t – Some subset of PMC are more interesting – for F G 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 example 5,7 – this implies that a unit being tested F F 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 is always correctly identified, if faulty, independent of the status of the testing unit. Many such 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 variations have been studied. ECE 753 Fault Tolerant Computing 17 ECE 753 Fault Tolerant Computing 18 3
2/25/2014 Other Sequential Diagnosability Models/Comments(contd.) • Consider the following repair strategy – Comparison based testing and diagnosis identify one or more faulty units • A paper is in the IEEE Transactions of repair them Computers - February 2009 Issue test system again and continue till we t t t i d ti till – Basically the model is built on PMC model know that there are no more faulty units –This is called sequential diagnosis ECE 753 Fault Tolerant Computing 19 ECE 753 Fault Tolerant Computing 20 Sequential Diagnosability (contd.) Sequential Diagnosability (contd.) • Assumptions • Result 1: – Same as before: For a system to be sequntially t-fault • System with n units diagnosable • Tests are comprehensive • Test results are binary: good (0) /faulty (1) Test res lts are binar good (0) /fa lt (1) n ≧ 2t + 1 • Faulty units can not be trusted for their test outcomes (denote x – means can be 0 or 1) • Total number of faulty units in the system is It is not necessary for every unit to be upper-bounded to t tested by t units ECE 753 Fault Tolerant Computing 21 ECE 753 Fault Tolerant Computing 22 Sequential Diagnosability (contd.) Sequential Diagnosability (contd.) • It is easy to show that the example • Example – n = 7, t = 3 system is sequentially 3-fault diagnosable 0 • Above construction will require n+2t–1 q 1 6 links 2 5 • A better solution: A system with n+2t-2 links can be designed that is 4 3 sequentially t-fault diagnosable ECE 753 Fault Tolerant Computing 23 ECE 753 Fault Tolerant Computing 24 4
Recommend
More recommend