1/20/2014 Overview • Motivation ECE 753: FAULT-TOLERANT • About the Course and the Instructor – Conduct, Outline, Coursepack COMPUTING • Introduction • Terminology and definitions – Sources, Overview and Comments Sources Overview and Comments Kewal K.Saluja K l K S l j – System defined Department of Electrical and Computer Engineering • Dependability/Security and their attributes • Threat to dependability and modeling FEF chain • Means to attain dependability Motivation and Introduction • Fundamental Principles Lecture Set 1 ECE 753 Fault Tolerant Computing 2 Motivation Motivation • Informal Definition • What is Fault-Tolerance? • Key Attributes • Who, What and Why Study A “fault-tolerant system” is one that continues to perform at desired level of • Examples service in spite of failures in some components that constitute the system. ECE 753 Fault Tolerant Computing 3 ECE 753 Fault Tolerant Computing 4 Motivation (contd.) Motivation (contd.) • Who is concerned about fault-tolerance? • Key attributes – System Users – irrespective of the application but some are a lot more concerned than others • Who is concerned at design stages? Fault - Error - Failure – Universities Universities Performance - Availability - Reliability • R, d, and a (Research, development, applications) – Industry More recently concept of “survivability” • r, D, and A (research, Development, Applications) Inclusions of these constraints at design • Issues stage is likely to be more cost effective. – Design, Analysis/Validation, Implementation, Testing/Validation, Evaluation ECE 753 Fault Tolerant Computing 5 ECE 753 Fault Tolerant Computing 6 1
1/20/2014 Motivation (contd.) Motivation (contd.) Examples Examples • General Purpose Systems • Reliable Systems – PCs: RAMs with parity checks and possibly ECC – Telephone systems (consideration of re-execution on failure detection is ( id ti f ti f il d t ti i – Banking systems e.g. ATM being investigated) – Stock market – Workstations/Servers: error detection (HW), – CAE - exams/projects occasional corrective action (SW), Even ECC – Football games - display/ticketing (HW), keeping log (SW) ECE 753 Fault Tolerant Computing 7 ECE 753 Fault Tolerant Computing 8 Motivation (contd.) Motivation (contd.) Examples Examples • Critical and Life Critical Systems • Reliable -> Critical Systems – Manned and unmanned space borne systems – 911 telephone switching system – Aircraft control systems – Traffic light control system – Nuclear reactor control systems – Automotive control systems (ABS, Fuel injection system) – Life support systems ECE 753 Fault Tolerant Computing 9 ECE 753 Fault Tolerant Computing 10 About the Course and the Instructor Introduction • Conduct – Historical perspective and major push – homeworks, exam, project, grading – New initiatives • Outline – Goals of fault-tolerance Goals of fault-tolerance • Coursepack – Applications of fault-tolerance – references and reading list ECE 753 Fault Tolerant Computing 11 ECE 753 Fault Tolerant Computing 12 2
1/20/2014 Introduction (contd.) Introduction (contd.) • Historical Perspective • New initiatives – not a new concept Density of devices more failures likely – first use by J. van Neumann 1956 Power issue – schedular, on-chip sensors • probabilistic logic and synthesis of reliable organism from Failures due to soft-errors, life time degradations unreliable components, Annals of mathematical studies, p , , - hardening, re-exection, Princeton University Press - on-chip ECC • Major push - erconfiguration – Space program - microarchitectural solutions – HW Fault tolerance - then - architectural solutions – SW Fault tolerance later – Merge the two ECE 753 Fault Tolerant Computing 13 ECE 753 Fault Tolerant Computing 14 Introduction (contd.) Introduction (contd.) • New initiatives (contd.) • Goals - different goals for different Deep submicron technology and time to market pressure applications designs not fully verified The key word is “reliability” – has different meaning Implementation of numerous functionalities on chip/board/system possibility of system hi /b d/ t ibilit f t for different users and applications f diff t d li ti hang-up • Intuitive explanations Speculative execution results may need to be re- – Dependability checked Low cost of HW and SW affordable/ecnomical – Service • Hot issues: Soft errors, Life-time failures, Power – Specification and Thermal Management ECE 753 Fault Tolerant Computing 15 ECE 753 Fault Tolerant Computing 16 Introduction (contd.) Introduction (contd.) • Intuitive concepts • Applications – Reliability – continues to work – Space borne system – Availability – works when I need it • long life system – Safety – does not put me in jeopardy – Airplane control system – Performability P f bili • critical system – Maintainability – Transaction processing system – Testability • high availability system – Survivability – will the system survive – Switching system catastrophic events? • high availability over certain level of performance – Security ECE 753 Fault Tolerant Computing 17 ECE 753 Fault Tolerant Computing 18 3
1/20/2014 Sources, Overview and Comments (1/4) Terminology and definitions Key reference: • Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl • Reliability and concept of probability Landwehr, Basic Concepts and Taxonomy of Dependable and Secure Computing, IEEE Transactions on Dependable – R(t): conditional probability that a system provides and Secure Computing, Vol. 1, No. 1, Jan-Mar 2004. continuous proper service in the interval [0,t] given that it provided desired service at time 0. Other references: • Availability • Availability • Israel Koren and C. Mani Krishna, Fault Tolerant Systems, Elsevier, 2007. • D. K. Pradhan, editor, Fault-Tolerant Computer System Design, Prentice- • Performabiltiy Hall, 1996. – An Example • B. W. Johnson, Design and analysis of fault tolerant digital systems, Addison-Wesley, First edition, 1989. • Dependability • My course (Fault-Tolerant Computing) URL: http://homepages.cae.wisc.edu/~ece753/INFO.html • Security ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 19 Sources, Overview and Comments (3/4) Sources, Overview and Comments (2/4) • How to read the paper? • What does the paper cover? – It is easy to read – scan it first and then read it – Very basic definitions of the terminologies used in – I have organized the material differently – you may dependable computing find it helpful d t e p u – It categorizes definitions in three groups I i d fi i i i h • What is not covered? • System, attributes of dependability, threats to dependability – One attribute almost missing - survivability – Covers very briefly methods to attain – Basic methods of Fault Tolerance and their dependability characterization ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing System Defined (1/4) Sources, Overview and Comments (4/4) • “. . . an entity that interacts with other entities” – First entity (system) – limited to be “electronic (mostly • Chronology of Developments digital)” or “computer based” – Need for fault-tolerance - inception of the space program – Second entity (recall “Voyager” launched in 1977 is still sending signals) • Hardware, software, human, other systems, .. (can also be called “environment”) – First standard glossary in 1985 g y • • Characterization and fundamental properties Characterization and fundamental properties – Integration of performance etc into fault tolerance – and – Functionality hence the term “Dependability” – book published in 1992 – Performance – Recognition of “Security” as a basic attribute of – Dependability and security dependability – this paper in 2004 – Cost (usuability, managability, adaptabilty : not directly included in the paper) ECE 753 Fault Tolerant Computing ECE 753 Fault Tolerant Computing 4
Recommend
More recommend