Software Fault Tolerance of Concurrent Programs Using Controlled Re-execution Ashis Tarafdar Vijay K. Garg ashis@cs.utexas.edu garg@ece.utexas.edu Parallel and Distributed Systems Laboratory Department of Electrical and Computer Engineering University of Texas at Austin Austin, USA 78712 http://maple.ece.utexas.edu
Introduction Software Fault Tolerance: to ensure that the system continues normal operation despite the presence of software faults (bugs) software faults cause software failures
Goals A new approach to software fault tolerance The predicate control problem: introduction and results
Background: Software Fault Tolerance The Progressive Retry Approach: [Wang et al, 1997] software failures are often transient rollback and re-execute no guarantees
Background: Races in Concurrent Programs What is a race? A race occurs when two processes can concurrently access the same shared resource. critical section synchronization a a P1 P1 cs1 cs1 P2 P2 b cs2 cs2 b A race in a concurrent computation A race-free computation Races are an important class of software faults. [Iyer & Lee, 95]
The Controlled Re-execution Approach 1. Tracing an execution 2. Detecting a race failure 3. Determining a control strategy 4. Re-executing under control added synchronization cs1 cs1 a cs4 d cs4 P1 P1 cs2 cs2 P2 P2 b P3 P3 c cs3 cs3 Controlling Computation Traced Computation
Model G H cs1 cs4 P1 a cs2 e P2 d P3 f b c cs3 consistent inconsistent states computation (happened before) global state consistent global state global predicate (e.g. mutual exclusion)
The Off-line Predicate Control Problem G G a cs1 d cs4 cs1 cs4 P1 P1 cs2 cs2 P2 P2 b P3 P3 c cs3 cs3 Computation C Controlling Computation C ' of B in C B = mutual exclusion Note : A controlling computation must have no cycles ! Problem Statement: Given a computation C and a global predicate B, find a controlling computation of B in C
Off-line Mutual Exclusion Theorem: The off-line predicate control problem is NP-Hard [Tarafdar & Garg, 98] Off-line Independent Read-Write Mutual Exclusion Off-line Independent Off-line Readers Writers Mutual Exclusion Off-line Mutual Exclusion Variants of Off-line Mutual Exclusion
A Relation on Critical Sections cs1 cs2 iff cs1 starts before cs2 finishes a cs1 P1 a cs1 P1 P2 b cs2 d P2 cs2 a f P3 cs1 P1 b c d P2 e cs2 P3 b c
Off-line Readers Writers: Result Theorem : For a computation C and a global predicate B rw , a controlling computation of B rw in C exists iff all cycles in contain only read critical sections Proof : Key Ideas: Necessary: Sufficient: R R cs1 P1 cs2 R W P2 R P3 R cs3 write critical section strongly connected components
Off-line Readers Writers: Algorithm A B cs1 cs5 P1 cs2 cs6 P2 cs7 P3 cs3 cs8 P4 cs4 n : number of processes p : number of critical sections in computation Algorithm 1: O(p 2 ) Key Idea : An SCC contains at most one CS per process Algorithm 2: O(n 2 p) Key Idea : Only "new" CS's need be considered Algorithm 3: O(np)
Summary A new approach to software fault tolerance introduced the controlled re-execution approach for race faults focussed on the problem of determining a control strategy The off-line predicate control problem: introduction and results defined the off-line predicate control problem necessary and sufficient conditions for the off-line readers writers problem O(np) algorithm for the off-line readers writers problem also: other variants of off-line mutual exclusion
On-line Mutual Exclusion is Impossible P1 cs1 P2 cs2 b G P1 cs1 P2 cs2 a H b c P1 cs1 P2 a d cs2 H
Recommend
More recommend