Check Pointing and Rollback Recovery Course: Distributed Computing Faculty: Dr. Rajendra Prasath Spring 2019
About this topic This course covers various concepts in Check Pointing and Rollback Recovery. We will also focus on the essential aspects of check pointing and roll back recovery in distributed contexts 2 Rajendra, IIIT Sri City
RECAP What did you learn so far? What did you learn so far? è Challenges in Message Passing systems è Distributed Sorting è Space-Time Diagram è Partial Ordering / Causal Ordering è Concurrent Events è Local Clocks and Vector Clocks è Distributed Snapshots è Termination Detection è Topology Abstraction and Overlays è Leader Election Problem in Rings è Message Ordering / Group Communications è Distributed Mutual Exclusion Algorithms 3 Rajendra, IIIT Sri City
Topics to focus on opics to focus on … For End Semester è Distributed Mutual Exclusion è Deadlock Detection è Check Pointing and Rollback Recovery è Self-Stabilization è Distributed Consensus è Reasoning with Knowledge è Peer – to – peer computing and Overlays è Authentication in Distributed Systems 4 Rajendra, IIIT Sri City
Distributed Mutual Exclusion(Recap) Distributed Mutual Exclusion(Recap) è No Deadlocks – No processes should be permanently blocked, waiting for messages (Resources) from other sites è No starvation – no site should have to wait indefinitely to enter its critical section, while other sites are executing the CS more than once è Fairness - requests honored in the order they are made. This means processes have to be able to agree on the order of events. (Fairness prevents starvation) è Fault Tolerance – the algorithm is able to survive a failure at one or more sites 5 Rajendra, IIIT Sri City
Deadlock Deadlock – Illustr Illustrated (Recap) ated (Recap) è Vehicular Traffic – A real-time scenario 6 Rajendra, IIIT Sri City
Dining Philosophers (Recap) Dining Philosophers (Recap) è Suggest a Simple è Each philosopher must Solution ?? alternately think and eat è A philosopher can only eat when they have both left and right forks è Problem: How to design a discipline of behavior (a concurrent algorithm) such that no philosopher will starve? 7 Rajendra, IIIT Sri City
Check Pointing and Rollback Recovery Let us explore Check Pointing and Roll Back Recovery algorithms in distributed systems 8 Rajendra, IIIT Sri City
Handling F Handling Failur ailures / Recovery? es / Recovery? Failure of a site/node in a distributed system causes è inconsistencies in the state of the system. Recovery: bringing back the failed node in step with other è nodes in the system. Failures: è Process failure: è Deadlocks, protection violation, erroneous user è input, etc. System failure: è Failure of processor/system. System failure can have è full/partial amnesia. It can be a pause failure (system restarts at the same è state it was in before the crash) or a complete halt. Secondary storage failure: data inaccessible. è Communication failure: network inaccessible. è 9 Rajendra, IIIT Sri City
Recovery in Concurr Recovery in Concurrent Systems ent Systems State involves message exchanges in DS è In distributed systems, rolling back one process can cause è the roll back of other processes Orphan messages & Domino effect: Assume Y fails after è sending m è X has record of m at x3 but Y has no record. M à orphan message. è Y rolls back to y2 à X should go to x2 è If Z rolls back, X and Y has to go to x1 and y1 à Domino effect, roll back of one process causes one or more processes to roll back x1 x3 x2 X m y2 y1 Y z2 Z z1 10 Rajendra, IIIT Sri City
Messages L Messages Lost ost è If Y fails after receiving m, it will rollback to y1 è X will rollback to x1 è m will be a lost message as X has recorded it as sent & Y has no record of receiving it x1 X m y1 Y X Failure 11 Rajendra, IIIT Sri City
Livelocks Livelocks x1 X n1 m1 y1 Y X Failure x1 X n2 n1 m2 y1 Y X 2nd Rollback è Y crashes before receiving n1. Y rolls back to y1 à X to x1 è Y recovers, receives n1 and sends m2 è X recovers, sends n2 but has no record of sending n1 è Hence, Y is forced to rollback second time. X also rolls back as it has received m2 but Y has no record of m2 è Above sequence can repeat indefinitely, causing a livelock 12 Rajendra, IIIT Sri City
Consistent Checkpoints Consistent Checkpoints x1 x3 x2 X m y2 y1 Y z2 Z z1 è Overcoming domino effect and livelocks: checkpoints should not have messages in transit. è Consistent checkpoints: no message exchange between any pair of processes in the set as well as outside the set during the interval spanned by checkpoints. è {x1,y1,z1} is a strongly consistent checkpoint 13 Rajendra, IIIT Sri City
Types of ypes of CRR CRR Algorithms Algorithms è Synchronous Algorithm è Two Phase algorithm proposed by Koo and Toueg è Asynchronous Algorithm è A simple algorithm proposed by Juang & Venkatesan 14 Rajendra, IIIT Sri City
Consistent Set of Consistent Set of Checkpoints Checkpoints Assumptions: è Checkpoint, send / recv are atomic è Take a checkpoint after sending every message è The set of the most recent checkpoints is always consistent è Why? Is it strongly consistent? è What is the main problem with this approach? è Take a checkpoint after every K messages sent? è Is it still consistent? 15 Rajendra, IIIT Sri City
Synchr Synchronous onous Checkpointing Checkpointing Algo Algo è Proposed by Koo ad Toueg 1 (1987) è Assumptions: processes communicate by exchanging messages è through channels channels are FIFO, end-to-end protocols cope up with è the message loss due to rollback recovery Communication failures do not partition the network è Uses two kinds of checkpoints è Tentative è Permanent è 1 R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," in IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23-31, Jan. 1987. doi: 10.1109/TSE.1987.232562 16 Rajendra, IIIT Sri City
Phase - 1 Phase - 1 è Initiator: take tentative checkpoint è Initiator requests all other processes to take tentative checkpoint è All other processes: è can respond `yes' or `no' è Initiator: decide to make checkpoints permanent if everyone has responded `yes’ è A process can fail to take a checkpoint due to the nature of application (e.g.,) lack of log space, unrecoverable transactions 17 Rajendra, IIIT Sri City
Phase - 2 Phase - 2 è If all processes took checkpoints, P i decides to make the checkpoint permanent. è Otherwise, checkpoints are to be discarded. è P i conveys this decision to all the processes as to whether checkpoints are to be made permanent or to be discarded 18 Rajendra, IIIT Sri City
Potential Issues otential Issues è Between tentative checkpoint and commit/ abort of checkpoint process must hold back messages. è Does this guarantee we have a strongly consistent state? è Can you construct an example that shows we can still have lost messages? 19 Rajendra, IIIT Sri City
Synchr Synchronous onous Checkpointing Checkpointing: : Properties operties è All or none of the processes take permanent checkpoints è There is no record of a message being received but not sent è Checkpoints may be taken unnecessarily (Give an example!!) è Can these unnecessarily checkpoints be avoided? 20 Rajendra, IIIT Sri City
Optimizing Checkpoints Optimizing Checkpoints Main IDEA: Record all messages sent and received after the last è checkpoint (last_recv(x, y), first_sent(x, y)) When X requests Y to take a tentative checkpoint: è X sends the last message received from Y with the è request Y takes a tentative checkpoint only if the last message è received by X from Y was sent after Y sent the first message after the last checkpoint (Happened before !!) last_recv(x, y) ≥ first_sent(y, x) When a process takes a checkpoint, it will ask all other è processes (that sent messages to the process) to take checkpoints. 21 Rajendra, IIIT Sri City
Rollback Recovery: P Rollback Recovery: Properties operties è There are two phases: Phase 1 and Phase 2 è Assume that between requests to rollback and decision, no one sends other messages è All or none of the processes restart from checkpoints è After rollback, all processes resume in a consistent state è Can have unnecessary rollback: can use a similar technique as the one in taking checkpoints to eliminate unnecessary rollback 22 Rajendra, IIIT Sri City
Rollback Recovery Rollback Recovery è Phase 1 è Initiator: check whether all processes are willing to restart from last checkpoints è Others: may reply `yes' or `no' è Phase 2 è Initiator: propagate go/nogo decision to all processes è Others: carry out the decision of the initiator 23 Rajendra, IIIT Sri City
Unnecessary Rollbacks Unnecessary Rollbacks è Avoid Rollback in unnecessary situations? è An example è (z 2 does not need to rollback – why?) 24 Rajendra, IIIT Sri City
Disadvantages Disadvantages è Check Pointing Algorithm generates message traffic è Synchronization delays are introduced è These costs may seem high if failures between checkpoints are unlikely 25 Rajendra, IIIT Sri City
Recommend
More recommend