fault tolerance and robustness in concurrent systems
play

Fault Tolerance and Robustness in Concurrent Systems Faults, - PowerPoint PPT Presentation

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault tolerance have many different definitions. What working definition should we use for fault? What does it mean to be fault-tolerant? 2 Faults, errors,


  1. Fault Tolerance and Robustness in Concurrent Systems

  2. Faults, errors, failures, and fault tolerance have many different definitions. What working definition should we use for fault? What does it mean to be fault-tolerant? 2

  3. Faults, errors, failures, and fault tolerance have many different definitions.  The definition of these three terms is not standardized. • One is that failures are random things and errors are designed in. Both cause faults. • Another is that a fault is the underlying defect that may or may not manifest itself and lead to a failure. This would better lead to fault-tolerant where the system can tolerate faults without failing.  What it means to be fault-tolerant is meant as an open-ended question. 3

  4. If not handled, faults can exhibit themselves in a system in a number of different ways.  Actions – the wrong actions are performed  Timing – the right actions are performed but at the wrong time  Sequence – the right actions are performed but in the wrong sequence  Amount – the wrong number of actions are performed 4

  5. Fault-tolerance is a system level attribute that needs to be designed in rather than tacked on. In a broad sense, what are the two major categories of activities that have to go on to achieve fault-tolerance? 5

  6. Fault-tolerance is a system level attribute that needs to be designed in rather than tacked on.  The two major categories of activities are: detection and recovery or taking action.  We need to have mechanisms in place to detect that something is going wrong, and what the underlying fault is.  Then we need to recover without leading to a failure, or at worst, fail safely. 6

  7. A simple software watchdog is a first detection mechanism. Software components are required to report a heartbeat to their supervisor or to a central monitor. The assumption is that as long as the heartbeat is received the component is working. How much does this tell us about the operation of the component? What could be an extension to the simple watchdog concept that could tell us more? 7

  8. How much does this tell us about the operation of the component?  Hardware watchdogs are regularly built into the hardware of safety-critical systems. Unless the watchdog is reset within its timeout period, a hardware reset will be issued to restart the system.  The heartbeat only tells us that the component is regularly getting to the point in its execution where the heartbeat is sent. Nothing much else about the operation of the component. 8

  9. What could be an extension to the simple watchdog concept that could tell us more?  If we have the component send information that is more than a heartbeat at regular intervals, a watchdog monitor that knows how the component is supposed to operate could check the component for incorrect operation.  This would require that the watchdog understands all of the possible correct paths of execution of the component under observation, that some indication is sent whenever the component gets to a significant point, and the watchdog takes actions when the information does not match with correct operation. 9

  10. There are a number of responses that can be taken once you find out that something is wrong. What are some approaches that can be used to deal with a broken component, and an operation that may not have been done correctly? What concerns do you have to consider? 10

  11. First, we will establish some terminology.  Cancellation • Task level termination • May or may not result in stopping threads  Interruption • Thread level termination • Get a thread to terminate with or without completion of the current operation  Shutdown • Application or service level termination • Stop all tasks, and associated threads, with or without completion These definitions are not necessarily universally accepted. 11

  12. If you are not using a framework with fault handling, you will have to deal with it all yourself.  A framework without fault handling may not give you many options  Define cancellation and interruption policies • How to do it, when it is checked, what is done 12

  13. You have some design decisions to make regarding how to handle being interrupted.  At the task level • Finish current work or stop immediately • Does it own the thread?  Yes, end the thread?  No, i.e. it’s running from a thread pool, let thread manager handle it for the thread – Preserve interrupted status – Throw InterruptedException  At the thread level • Propagate interrupt if where it is detected does not implement interruption policy • Otherwise, implement interruption policy 13

  14. There are other things that you need to consider if you want to build a fault-tolerant system.  What is the most common indication that your program had a problem? Exception in thread "main" java.lang.SomeException at com.example.myproject.Class1.method1(Class2.java:16) at com.example.myproject.Class2.method2(Class3.java:25) at com.example.myproject.TopClass.main(TopClass.java:14) If it is operationally critical that the system keeps running, tries to recover from errors, or at a minimum does a graceful, failsafe shutdown, what do you do? 14

  15. What do you do?  The most common response is “handle all exceptions” but this can not always be done.  If a class you use throws an unchecked exception or an error, you have no indication that it might come at you. Interface Thread.UncaughtExceptionHandler  This provides a mechanism for you to catch all Throwable things which include all Exceptions and Errors. 15

  16. Shutdown of a service should take down all tasks and threads that it owns.  At the task level • Let a running task complete? • Let scheduled but not started tasks complete? • Provide information about what work was not finished.  Once tasks are handled, interrupt threads in pool  ExecutorServices provide some support • shutdown() • shutdownNow() • awaitTermination() 16

Recommend


More recommend