(Preliminary Version) - PDF document

(Preliminary Version) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Venka,!esan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Juang and zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Richardson, TX 75083-0688 zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA CRASH RECOVERY WITH LITTLE OVERHEAD S . Tony T-Y. Computer Science Program, N P 3 1 University of Texas at Dallas juang,venky j@utdallas.edu { such as distributed consensus [ 31) even if the proces- ABSTRACT sor failure mode is restricted to fail-stop failures, Recovering from processor failures in distributed sys- while communication failures are comparatively tems is an important problem in the design and easier to deal with. development of reliable systems. Several solutions to this problem have been presented in the literature. In distributed transaction processing systems, Most of them recover from failures by storing there is a need to recover from processor failures sufficient extra information in stable storage and using quickly to increase the availability of the system. this information when there are failures. In this Checkpointing and rollback recovery is a scheme that state. This is called zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA paper, we present two solutions to this problem which is widely used. Each processor locally saves its involve very little overhead. Without appending any current state and its history in a stable storage from information to the messages of the application pro- time to time so that if the processor fails, it can restart gram, we show that it is possible to recover from from the most recently saved state. This process of failures using O(IVIIEI) messages where IVI is the saving processor states is called checkpointing. For number of processors and I E l is the number of com- the underlying computation to restart from a con- munication links in the system. The second algorithm sistent global state, it may be necessary for some or determination, [17] for domino effect, zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA can be used to recover from processor failures all of the processors in the system to restart from a without forcing non-faulty processors to roll back processor state that occurred before the latest saved under certain conditions. With a small modification, rolling back. To prevent the the second algorithm can also be used to recover from domino effect and to rollback the processor states to processor failures even if no stable storage is avail- the maximum consistent state, certain additional ability to tolerate failures. zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA able. information is appended to each message of the application program. The reader is referred to [2] for a discussion on consistent states of a distributed compu- 1. INTRODUCTION tation, [16] for a discussion on repeated global state Distributed systems are becoming popular [5] for max- because of several advantages they have over central- imum consistent states in crash recovery, and [6,12] ized ones. The advantages include efficient utilization for a discussion on appending additional information of resources, ability to enhance the system gradually, to application messages to aid in rolling back. greater degree of fault-tolerance, etc. An important Checkpointing has been widely used and stu- and desirable property of a distributed system is its died by many researchers [2,5-9,12-14,171. There As the size of distributed are two approaches towards checkpointing and crash systems grows, so does the probability that some recovery: the synchronous approach and asynchro- component may fail. Thus, it is important to deal with nous approach. The synchronous approach is to failures of the components of the system. Fault toler- ensure that all processors keep local checkpoints in ance is provided at two levels of the system -- at the stable storage and coordinate their local checkpoint- hardware level and at the protocol level. At the ing actions such that the global checkpoints (the set of hardware level, components are designed and built local checkpoints) in the system is gumteed to be with high reliability. Faults that occur in spite of the consistent [2,7,9,15,17]. When a failure occurs, high reliability of the components are dealt with at the .OO zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA processors roll back and restart from their most recent protocol level. Thus, specific steps must be taken at checkpoints. That is part of the recent global check- the protocol level to increase the reliability of distri- points. While crash recovery is easy and simple in this buted systems. case, adiditional messages are generated for each Coping with processor failures is hard in solv- checkpoint, and synchronization delays are introduced ing simple problems (and is impossible in instances 454 CH2996-7/91/0000/0454$01 0 1991 IEEE

(Preliminary Version) - PDF document

(Preliminary Version) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Venka,!esan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Juang and zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Richardson, TX 75083-0688

1 2 3 State R&D Graphic, Version 1 Version 1 4 State R&D Graphic, Version 1,

Preliminary results of Preliminary results of Preliminary results of Invalda Preliminary results

Preliminary Report from Preliminary Report from Preliminary Report from Preliminary Report from

Preliminary results of Preliminary results of Preliminary results of Preliminary results of

Century SAGA Century SAGA Version 7.6 / Version 7.6 / Version 8.2 Version 8.2 Purpose

Fonctionnalits de la version 11 Nouveauts de la version 12 Version 11 and version 12 in a

PRELIMINARY BUDGET TIMELINE Adopt Preliminary budget on June 23 rd The preliminary budget

Version control with subversion A short introduction Outline What is version control?

ENGLISH STANDARD VERSION NEW KINGS JAMES VERSION HEBREW NAMES VERSION The kings heart is

CS 2112 Lab: Version Control CS 2112 Lab: Version Control What is Version Control? Git Structure

2016 Preliminary Results 2016 Preliminary Results 14/03/2017 2 TP ICAP 2016 Preliminary Results

2018 2018 2018 2018- - - -2019 2019 2019 2019 Preliminary Budget Preliminary Budget

HEO Quality of Work Life Survey HEO Quality of Work Life Survey Preliminary Results Preliminary

Preliminary Results Preliminary Results Preliminary Results

NDR Design: Preliminary Progress Report Timeline for NDR Task Due Month NDR Specifications

Release History API version 1.2.7 Version on server 2.7.0 Version on local host

Doomsday Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden Lawrence

Resilient Data Collection of Wireless Sensor Networks in Oil and Gas Refineries Tianyuan Liu,

Cisco Security Authentication Failure Rate Cisco Security Authentication Failure Rate or SHIT

ECE590-03 Enterprise Storage Architecture Fall 2016 Failures in hard disks and SSDs Tyler

7.1 Surface Smoothing Hao Li http://cs599.hao-li.com 1 Administrative Todays Office

str r rr rs

Smoothing Gianpaolo Palma Triangle Mesh List of vertices + List of triangle as triple of vertex

Asset Management in Kentucky Jon Wilcoxson, PE KYTC Division of Maintenance Operations and

Sambuz

Useful Links

Newsletter

Mail Us

(Preliminary Version) - PDF document

(Preliminary Version) zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Venka,!esan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Juang and zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Richardson, TX 75083-0688

1 2 3 State R&amp;D Graphic, Version 1 Version 1 4 State R&amp;D Graphic, Version 1,

Preliminary results of Preliminary results of Preliminary results of Invalda Preliminary results

Preliminary Report from Preliminary Report from Preliminary Report from Preliminary Report from

Preliminary results of Preliminary results of Preliminary results of Preliminary results of

Century SAGA Century SAGA Version 7.6 / Version 7.6 / Version 8.2 Version 8.2 Purpose

Fonctionnalits de la version 11 Nouveauts de la version 12 Version 11 and version 12 in a

PRELIMINARY BUDGET TIMELINE Adopt Preliminary budget on June 23 rd The preliminary budget

Version control with subversion A short introduction Outline What is version control?

ENGLISH STANDARD VERSION NEW KINGS JAMES VERSION HEBREW NAMES VERSION The kings heart is

CS 2112 Lab: Version Control CS 2112 Lab: Version Control What is Version Control? Git Structure

2016 Preliminary Results 2016 Preliminary Results 14/03/2017 2 TP ICAP 2016 Preliminary Results

2018 2018 2018 2018- - - -2019 2019 2019 2019 Preliminary Budget Preliminary Budget

HEO Quality of Work Life Survey HEO Quality of Work Life Survey Preliminary Results Preliminary

Preliminary Results Preliminary Results Preliminary Results

NDR Design: Preliminary Progress Report Timeline for NDR Task Due Month NDR Specifications

Release History API version 1.2.7 Version on server 2.7.0 Version on local host

Doomsday Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden Lawrence

Resilient Data Collection of Wireless Sensor Networks in Oil and Gas Refineries Tianyuan Liu,

Cisco Security Authentication Failure Rate Cisco Security Authentication Failure Rate or SHIT

ECE590-03 Enterprise Storage Architecture Fall 2016 Failures in hard disks and SSDs Tyler

7.1 Surface Smoothing Hao Li http://cs599.hao-li.com 1 Administrative Todays Office

str r rr rs

Smoothing Gianpaolo Palma Triangle Mesh List of vertices + List of triangle as triple of vertex

Asset Management in Kentucky Jon Wilcoxson, PE KYTC Division of Maintenance Operations and

Sambuz

Useful Links

Newsletter

Mail Us

1 2 3 State R&D Graphic, Version 1 Version 1 4 State R&D Graphic, Version 1,