(Preliminary Version) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Venka,!esan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Juang and zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Richardson, TX 75083-0688 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA CRASH RECOVERY WITH LITTLE OVERHEAD S . Tony T-Y. Computer Science Program, N P 3 1 University of Texas at Dallas juang,venky j@utdallas.edu { such as distributed consensus [ 31) even if the proces- ABSTRACT sor failure mode is restricted to fail-stop failures, Recovering from processor failures in distributed sys- while communication failures are comparatively tems is an important problem in the design and easier to deal with. development of reliable systems. Several solutions to this problem have been presented in the literature. In distributed transaction processing systems, Most of them recover from failures by storing there is a need to recover from processor failures sufficient extra information in stable storage and using quickly to increase the availability of the system. this information when there are failures. In this Checkpointing and rollback recovery is a scheme that state. This is called zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA paper, we present two solutions to this problem which is widely used. Each processor locally saves its involve very little overhead. Without appending any current state and its history in a stable storage from information to the messages of the application pro- time to time so that if the processor fails, it can restart gram, we show that it is possible to recover from from the most recently saved state. This process of failures using O(IVIIEI) messages where IVI is the saving processor states is called checkpointing. For number of processors and I E l is the number of com- the underlying computation to restart from a con- munication links in the system. The second algorithm sistent global state, it may be necessary for some or determination, [17] for domino effect, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA can be used to recover from processor failures all of the processors in the system to restart from a without forcing non-faulty processors to roll back processor state that occurred before the latest saved under certain conditions. With a small modification, rolling back. To prevent the the second algorithm can also be used to recover from domino effect and to rollback the processor states to processor failures even if no stable storage is avail- the maximum consistent state, certain additional ability to tolerate failures. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA able. information is appended to each message of the appli- cation program. The reader is referred to [2] for a discussion on consistent states of a distributed compu- 1. INTRODUCTION tation, [16] for a discussion on repeated global state Distributed systems are becoming popular [5] for max- because of several advantages they have over central- imum consistent states in crash recovery, and [6,12] ized ones. The advantages include efficient utilization for a discussion on appending additional information of resources, ability to enhance the system gradually, to application messages to aid in rolling back. greater degree of fault-tolerance, etc. An important Checkpointing has been widely used and stu- and desirable property of a distributed system is its died by many researchers [2,5-9,12-14,171. There As the size of distributed are two approaches towards checkpointing and crash systems grows, so does the probability that some recovery: the synchronous approach and asynchro- component may fail. Thus, it is important to deal with nous approach. The synchronous approach is to failures of the components of the system. Fault toler- ensure that all processors keep local checkpoints in ance is provided at two levels of the system -- at the stable storage and coordinate their local checkpoint- hardware level and at the protocol level. At the ing actions such that the global checkpoints (the set of hardware level, components are designed and built local checkpoints) in the system is gumteed to be with high reliability. Faults that occur in spite of the consistent [2,7,9,15,17]. When a failure occurs, high reliability of the components are dealt with at the .OO zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA processors roll back and restart from their most recent protocol level. Thus, specific steps must be taken at checkpoints. That is part of the recent global check- the protocol level to increase the reliability of distri- points. While crash recovery is easy and simple in this buted systems. case, adiditional messages are generated for each Coping with processor failures is hard in solv- checkpoint, and synchronization delays are introduced ing simple problems (and is impossible in instances 454 CH2996-7/91/0000/0454$01 0 1991 IEEE
Recommend
More recommend