Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC - PowerPoint PPT Presentation

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25

Motivation 1 Background 2 Checkpoint-based 3 Co-ordinated disk-based In-memory double checkpoint Message Logging 4 Pro-active fault tolerance 5 Summary 6 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 2 / 25

Motivation Larger machines available, clusters as well as proprietary MTBF decreases as size of machines increases Long running applications have to tolerate faults Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 3 / 25

Background Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

Background Checkpoint ◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

Background Checkpoint ◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well Message Logging ◮ Pesssimistic: MPICH-V1, MPICH-V2 etc. ◮ Optimistic: cascading rollback, complicated recovery ◮ Causal Logging: causalty tracking, Manetho, MPICH-V3 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

Background Checkpoint ◮ Coordinated: Cocheck, Starfish, Clip ◮ Uncoordinated: suffers from cascading rollbacks ◮ Communication: does not scale well Message Logging ◮ Pesssimistic: MPICH-V1, MPICH-V2 etc. ◮ Optimistic: cascading rollback, complicated recovery ◮ Causal Logging: causalty tracking, Manetho, MPICH-V3 Hybrid: Schultz et al, Bronevetsky et al Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 4 / 25

Solutions in Charm++ Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25

Solutions in Charm++ Reactive: react to a fault ◮ Disk based ◮ In-memory ◮ Message logging with fast recovery Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25

Solutions in Charm++ Reactive: react to a fault ◮ Disk based ◮ In-memory ◮ Message logging with fast recovery Pro-active: act before a fault ◮ Fault prediction ◮ Evacuate processors after fault is predicted Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 5 / 25

Disk-based Checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

Disk-based Checkpoint Blocking Coordinated Checkpoint ◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME) Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

Disk-based Checkpoint Blocking Coordinated Checkpoint ◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME) Restart ◮ Whole job is restarted ◮ Same job can be restarted on different # of processors ◮ Runtime flag: +restart DIRNAME Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

Disk-based Checkpoint Blocking Coordinated Checkpoint ◮ State of chares are checkpointed to parallel file system ◮ Collective MPI Checkpoint(DIRNAME) Restart ◮ Whole job is restarted ◮ Same job can be restarted on different # of processors ◮ Runtime flag: +restart DIRNAME Simple yet effective for common cases Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 6 / 25

Drawbacks of disk-based checkpoint Checkpoints to the parallel file system are slow High Recovery time : Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25

Drawbacks of disk-based checkpoint Checkpoints to the parallel file system are slow High Recovery time : ◮ Time between the last checkpoint and the crash Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25

Drawbacks of disk-based checkpoint Checkpoints to the parallel file system are slow High Recovery time : ◮ Time between the last checkpoint and the crash ◮ Time to resubmit the job and have it run Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 7 / 25

In-memory Double Checkpoint: Checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

In-memory Double Checkpoint: Checkpoint Coordinated checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

In-memory Double Checkpoint: Checkpoint Coordinated checkpoint Each object maintains 2 checkpoints: ◮ On local processor ◮ On a remote buddy processor Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

In-memory Double Checkpoint: Checkpoint Coordinated checkpoint Each object maintains 2 checkpoints: ◮ On local processor ◮ On a remote buddy processor Checkpoints are stored in memory Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 8 / 25

In-memory Double Checkpoint: Restart A dummy process is created to replace the crashed processor Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25

In-memory Double Checkpoint: Restart A dummy process is created to replace the crashed processor New process starts recovery on other processors Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25

In-memory Double Checkpoint: Restart A dummy process is created to replace the crashed processor New process starts recovery on other processors Other processors ◮ Remove all objects ◮ Use the buddy’s checkpoint to recreate objects from the crashed processor ◮ Recreate your own objects from their local copy of the checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 9 / 25

In-memory Double Checkpoint: Pros and Cons Advantages: ◮ Faster checkpoints than disk based ◮ Reading checkpoints during recovery is also faster ◮ Only one processor fetches checkpoint across the network Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 10 / 25

In-memory Double Checkpoint: Pros and Cons Advantages: ◮ Faster checkpoints than disk based ◮ Reading checkpoints during recovery is also faster ◮ Only one processor fetches checkpoint across the network Drawbacks: ◮ High memory overhead ◮ All processors are rolled back even if one crashes ◮ All the work since the last checkpoint is redone on all processors ◮ Recovery time: Time between the crash and the previous checkpoint Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 10 / 25

Message logging Only processed messages affect the state of a processor After a crash, reprocess old messages to regain lost state Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25

Message logging Only processed messages affect the state of a processor After a crash, reprocess old messages to regain lost state Messages are stored during execution After a crash, only crashed processors are rolled back Other processors resend their messages Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25

Message logging Only processed messages affect the state of a processor After a crash, reprocess old messages to regain lost state Messages are stored during execution After a crash, only crashed processors are rolled back Other processors resend their messages Caveat : State of a processor is affected by the sequence of messages as well ◮ Message processing sequence needs to be stored ◮ Processors need to ignore messages they have already processed Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 11 / 25

Message logging: Challenges All the work of the crashed processor is redone by one processor Recovery time: Same as checkpoint/restart Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 12 / 25

Message logging: Challenges All the work of the crashed processor is redone by one processor Recovery time: Same as checkpoint/restart Most parallel applications are tightly coupled Other processors have to wait for the crashed processor to recover Fault free overhead is often high Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 12 / 25

Message logging: Objectives Fast recovery: Faster than time between the crash and the previous checkpoint Do not assume a stable storage Tolerate all single and most multiple processor faults Low performance penalty for the fault free case Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 13 / 25

Message logging: Our idea During restart distribute the work of the restarted processor among the waiting processors Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

Message logging: Our idea During restart distribute the work of the restarted processor among the waiting processors How can the work on one processor be divided ? Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 14 / 25

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC - PowerPoint PPT Presentation

Fault Tolerance in Charm++/AMPI Sayantan Chakravorty PPL, UIUC April 19, 2007 Sayantan Chakravorty (PPL, UIUC) Fault Tolerance in Charm++/AMPI April 19, 2007 1 / 25 Motivation 1 Background 2 Checkpoint-based 3 Co-ordinated disk-based

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Scalable Fault Tolerance with Charm++ Esteban Meneses Gengbin Zheng Celso L. Mendes Laxmikant

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Treasure hunt: mistakes and wrong turnings in the search for good designs R. A. Bailey

Cyberspace: A Fragile Ecosystem Robert F. Lentz Deputy Assistant Secretary of Defense Cyber,

Finding ECM-friendly curves through a study of Galois properties 10th Algorithmic Number Theory

The factorization of RSA-1024 D. J. Bernstein University of Illinois at Chicago Abstract: This

Every graph is easy or hard: dichotomy theorems for graph problems Dniel Marx 1 1 Institute for

Forward-Looking Statements From time to time, the Bank makes written and oral forward-looking

Critical Leadership (23369) Self-leadership Week 16 workshop by Andrew Stewart and Chianu Dibia

Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and