Check Pointing and Rollback Recovery Course: Distributed Computing - PowerPoint PPT Presentation

Check Pointing and Rollback Recovery Course: Distributed Computing Faculty: Dr. Rajendra Prasath Spring 2019

About this topic This course covers various concepts in Check Pointing and Rollback Recovery. We will also focus on the essential aspects of check pointing and roll back recovery in distributed contexts 2 Rajendra, IIIT Sri City

RECAP What did you learn so far? What did you learn so far? è Challenges in Message Passing systems è Distributed Sorting è Space-Time Diagram è Partial Ordering / Causal Ordering è Concurrent Events è Local Clocks and Vector Clocks è Distributed Snapshots è Termination Detection è Topology Abstraction and Overlays è Leader Election Problem in Rings è Message Ordering / Group Communications è Distributed Mutual Exclusion Algorithms 3 Rajendra, IIIT Sri City

Topics to focus on opics to focus on … For End Semester è Distributed Mutual Exclusion è Deadlock Detection è Check Pointing and Rollback Recovery è Self-Stabilization è Distributed Consensus è Reasoning with Knowledge è Peer – to – peer computing and Overlays è Authentication in Distributed Systems 4 Rajendra, IIIT Sri City

Distributed Mutual Exclusion(Recap) Distributed Mutual Exclusion(Recap) è No Deadlocks – No processes should be permanently blocked, waiting for messages (Resources) from other sites è No starvation – no site should have to wait indefinitely to enter its critical section, while other sites are executing the CS more than once è Fairness - requests honored in the order they are made. This means processes have to be able to agree on the order of events. (Fairness prevents starvation) è Fault Tolerance – the algorithm is able to survive a failure at one or more sites 5 Rajendra, IIIT Sri City

Deadlock Deadlock – Illustr Illustrated (Recap) ated (Recap) è Vehicular Traffic – A real-time scenario 6 Rajendra, IIIT Sri City

Dining Philosophers (Recap) Dining Philosophers (Recap) è Suggest a Simple è Each philosopher must Solution ?? alternately think and eat è A philosopher can only eat when they have both left and right forks è Problem: How to design a discipline of behavior (a concurrent algorithm) such that no philosopher will starve? 7 Rajendra, IIIT Sri City

Check Pointing and Rollback Recovery Let us explore Check Pointing and Roll Back Recovery algorithms in distributed systems 8 Rajendra, IIIT Sri City

Handling F Handling Failur ailures / Recovery? es / Recovery? Failure of a site/node in a distributed system causes è inconsistencies in the state of the system. Recovery: bringing back the failed node in step with other è nodes in the system. Failures: è Process failure: è Deadlocks, protection violation, erroneous user è input, etc. System failure: è Failure of processor/system. System failure can have è full/partial amnesia. It can be a pause failure (system restarts at the same è state it was in before the crash) or a complete halt. Secondary storage failure: data inaccessible. è Communication failure: network inaccessible. è 9 Rajendra, IIIT Sri City

Recovery in Concurr Recovery in Concurrent Systems ent Systems State involves message exchanges in DS è In distributed systems, rolling back one process can cause è the roll back of other processes Orphan messages & Domino effect: Assume Y fails after è sending m è X has record of m at x3 but Y has no record. M à orphan message. è Y rolls back to y2 à X should go to x2 è If Z rolls back, X and Y has to go to x1 and y1 à Domino effect, roll back of one process causes one or more processes to roll back x1 x3 x2 X m y2 y1 Y z2 Z z1 10 Rajendra, IIIT Sri City

Messages L Messages Lost ost è If Y fails after receiving m, it will rollback to y1 è X will rollback to x1 è m will be a lost message as X has recorded it as sent & Y has no record of receiving it x1 X m y1 Y X Failure 11 Rajendra, IIIT Sri City

Livelocks Livelocks x1 X n1 m1 y1 Y X Failure x1 X n2 n1 m2 y1 Y X 2nd Rollback è Y crashes before receiving n1. Y rolls back to y1 à X to x1 è Y recovers, receives n1 and sends m2 è X recovers, sends n2 but has no record of sending n1 è Hence, Y is forced to rollback second time. X also rolls back as it has received m2 but Y has no record of m2 è Above sequence can repeat indefinitely, causing a livelock 12 Rajendra, IIIT Sri City

Consistent Checkpoints Consistent Checkpoints x1 x3 x2 X m y2 y1 Y z2 Z z1 è Overcoming domino effect and livelocks: checkpoints should not have messages in transit. è Consistent checkpoints: no message exchange between any pair of processes in the set as well as outside the set during the interval spanned by checkpoints. è {x1,y1,z1} is a strongly consistent checkpoint 13 Rajendra, IIIT Sri City

Types of ypes of CRR CRR Algorithms Algorithms è Synchronous Algorithm è Two Phase algorithm proposed by Koo and Toueg è Asynchronous Algorithm è A simple algorithm proposed by Juang & Venkatesan 14 Rajendra, IIIT Sri City

Consistent Set of Consistent Set of Checkpoints Checkpoints Assumptions: è Checkpoint, send / recv are atomic è Take a checkpoint after sending every message è The set of the most recent checkpoints is always consistent è Why? Is it strongly consistent? è What is the main problem with this approach? è Take a checkpoint after every K messages sent? è Is it still consistent? 15 Rajendra, IIIT Sri City

Synchr Synchronous onous Checkpointing Checkpointing Algo Algo è Proposed by Koo ad Toueg 1 (1987) è Assumptions: processes communicate by exchanging messages è through channels channels are FIFO, end-to-end protocols cope up with è the message loss due to rollback recovery Communication failures do not partition the network è Uses two kinds of checkpoints è Tentative è Permanent è 1 R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," in IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23-31, Jan. 1987. doi: 10.1109/TSE.1987.232562 16 Rajendra, IIIT Sri City

Phase - 1 Phase - 1 è Initiator: take tentative checkpoint è Initiator requests all other processes to take tentative checkpoint è All other processes: è can respond `yes' or `no' è Initiator: decide to make checkpoints permanent if everyone has responded `yes’ è A process can fail to take a checkpoint due to the nature of application (e.g.,) lack of log space, unrecoverable transactions 17 Rajendra, IIIT Sri City

Phase - 2 Phase - 2 è If all processes took checkpoints, P i decides to make the checkpoint permanent. è Otherwise, checkpoints are to be discarded. è P i conveys this decision to all the processes as to whether checkpoints are to be made permanent or to be discarded 18 Rajendra, IIIT Sri City

Potential Issues otential Issues è Between tentative checkpoint and commit/ abort of checkpoint process must hold back messages. è Does this guarantee we have a strongly consistent state? è Can you construct an example that shows we can still have lost messages? 19 Rajendra, IIIT Sri City

Synchr Synchronous onous Checkpointing Checkpointing: : Properties operties è All or none of the processes take permanent checkpoints è There is no record of a message being received but not sent è Checkpoints may be taken unnecessarily (Give an example!!) è Can these unnecessarily checkpoints be avoided? 20 Rajendra, IIIT Sri City

Optimizing Checkpoints Optimizing Checkpoints Main IDEA: Record all messages sent and received after the last è checkpoint (last_recv(x, y), first_sent(x, y)) When X requests Y to take a tentative checkpoint: è X sends the last message received from Y with the è request Y takes a tentative checkpoint only if the last message è received by X from Y was sent after Y sent the first message after the last checkpoint (Happened before !!) last_recv(x, y) ≥ first_sent(y, x) When a process takes a checkpoint, it will ask all other è processes (that sent messages to the process) to take checkpoints. 21 Rajendra, IIIT Sri City

Rollback Recovery: P Rollback Recovery: Properties operties è There are two phases: Phase 1 and Phase 2 è Assume that between requests to rollback and decision, no one sends other messages è All or none of the processes restart from checkpoints è After rollback, all processes resume in a consistent state è Can have unnecessary rollback: can use a similar technique as the one in taking checkpoints to eliminate unnecessary rollback 22 Rajendra, IIIT Sri City

Rollback Recovery Rollback Recovery è Phase 1 è Initiator: check whether all processes are willing to restart from last checkpoints è Others: may reply `yes' or `no' è Phase 2 è Initiator: propagate go/nogo decision to all processes è Others: carry out the decision of the initiator 23 Rajendra, IIIT Sri City

Unnecessary Rollbacks Unnecessary Rollbacks è Avoid Rollback in unnecessary situations? è An example è (z 2 does not need to rollback – why?) 24 Rajendra, IIIT Sri City

Disadvantages Disadvantages è Check Pointing Algorithm generates message traffic è Synchronization delays are introduced è These costs may seem high if failures between checkpoints are unlikely 25 Rajendra, IIIT Sri City

Check Pointing and Rollback Recovery Course: Distributed Computing - PowerPoint PPT Presentation

Check Pointing and Rollback Recovery Course: Distributed Computing Faculty: Dr. Rajendra Prasath Spring 2019 About this topic This course covers various concepts in Check Pointing and Rollback Recovery. We will also focus on the essential

(complete set of slides) Indirect vs. Direct pointing Absolute vs. Relative pointing Absolute:

Finger Pointing for Fun, Finger Pointing for Fun, Profit and War? Profit and War? Profit and

problems of direct input and solutions Indirect vs. Direct pointing Absolute vs. Relative

Pointing and Navigation Beating Fitts law Michel Beaudouin-Lafon Laboratoire de Recherche en

Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and Suspend/Resume

Modular rollback through free monads Conor McBride, Olin Shivers, Aaron Turon Tuesday, September

Rollback-Recovery for Middleboxes Justine Sherry , Peter Xiang Gao, Soumya Basu, Aurojit Panda,

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

What does mean? What does Baptism mean? 1) Baptism is a symbol pointing to the truth of

(complete set of slides) Input devices vs. Finger-based input Indirect vs. Direct pointing

Bob Pointing, Chair North West Leeds Liverpool Bi-Centenary Events Summer 2016 Please see

Year 1 Phonics Screening Check Phonics Screening Check All schools have to administer a

Motivation Atomicity: Transactions may abort (Rollback). Logging and

Continuity and Recovery Planning Continuity and Recovery Planning Continuity and Recovery

RECOVERY OPERATIONS Performing recovery and related operations Acronis Training and Certification

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

CS5412: THE BASE METHODOLOGY VERSUS THE ACID MODEL Lecture VIII Ken Birman Todays lecture

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Artificial Intelligence Game Playing Continued Lecture 9 CS 444 Spring 2019 Dr. Kevin

Spring Partner Forum Return of the bears? 1. Since we last met: Exuberance turned fearfulness

Superferric 3T CIC Dipole R&D 2016/17 Project Report Peter McIntyre Texas A&M University

PAEA Workshop wanted to Facilitation 101 know but were afraid to ask! 1 HELLO! Laura

Writing Classes We've been using predefined classes. Now we will learn to write our own

Check Pointing and Rollback Recovery Course: Distributed Computing - PowerPoint PPT Presentation

Check Pointing and Rollback Recovery Course: Distributed Computing Faculty: Dr. Rajendra Prasath Spring 2019 About this topic This course covers various concepts in Check Pointing and Rollback Recovery. We will also focus on the essential

(complete set of slides) Indirect vs. Direct pointing Absolute vs. Relative pointing Absolute:

Finger Pointing for Fun, Finger Pointing for Fun, Profit and War? Profit and War? Profit and

problems of direct input and solutions Indirect vs. Direct pointing Absolute vs. Relative

Pointing and Navigation Beating Fitts law Michel Beaudouin-Lafon Laboratoire de Recherche en

Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and Suspend/Resume

Modular rollback through free monads Conor McBride, Olin Shivers, Aaron Turon Tuesday, September

Rollback-Recovery for Middleboxes Justine Sherry , Peter Xiang Gao, Soumya Basu, Aurojit Panda,

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

What does mean? What does Baptism mean? 1) Baptism is a symbol pointing to the truth of

(complete set of slides) Input devices vs. Finger-based input Indirect vs. Direct pointing

Bob Pointing, Chair North West Leeds Liverpool Bi-Centenary Events Summer 2016 Please see

Year 1 Phonics Screening Check Phonics Screening Check All schools have to administer a

Motivation Atomicity: Transactions may abort (Rollback). Logging and

Continuity and Recovery Planning Continuity and Recovery Planning Continuity and Recovery

RECOVERY OPERATIONS Performing recovery and related operations Acronis Training and Certification

compsci 514: algorithms for data science Prof. Cameron Musco University of Massachusetts Amherst.

CS5412: THE BASE METHODOLOGY VERSUS THE ACID MODEL Lecture VIII Ken Birman Todays lecture

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Artificial Intelligence Game Playing Continued Lecture 9 CS 444 Spring 2019 Dr. Kevin

Spring Partner Forum Return of the bears? 1. Since we last met: Exuberance turned fearfulness

Superferric 3T CIC Dipole R&amp;D 2016/17 Project Report Peter McIntyre Texas A&amp;M University

PAEA Workshop wanted to Facilitation 101 know but were afraid to ask! 1 HELLO! Laura

Writing Classes We've been using predefined classes. Now we will learn to write our own

Superferric 3T CIC Dipole R&D 2016/17 Project Report Peter McIntyre Texas A&M University