Combining checkpointing and replication for reliable execution of - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining checkpointing and replication for reliable execution of linear workflows Anne Benoit 1 , 2 , Aur´ elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F` 1. LIP, Ecole Normale Sup´ erieure de Lyon, France 2. Georgia Institute of Technology, Atlanta, GA, USA 3. University of Basel, Switzerland 4. University of Tennessee, Knoxville, TN, USA http://graal.ens-lyon.fr/~abenoit/ ICL Lunch Talk, UT Knoxville, June 1st, 2018 ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 1/ 27

Introduction Model DP Algo Experiments Conclusion Linear workflows High-performance computing (HPC) application: chain of tasks T 1 → T 2 → · · · → T n Parallel tasks executed on the whole platform For instance: tightly-coupled computational kernels, image processing applications, ... Goal: efficient execution, i.e., minimize total execution time ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 2/ 27

Introduction Model DP Algo Experiments Conclusion Reliable execution Hierarchical • 10 5 or 10 6 nodes • Each node equipped with 10 4 or 10 3 cores Failure-prone MTBF – one node 1 year 10 years 120 years MTBF – platform 30sec 5mn 1h of 10 6 nodes More nodes ⇒ Shorter MTBF (Mean Time Between Failures) Need to ensure that the execution will be reliable, i.e., without failures ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 3/ 27

Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with checkpoints Checkpoint, rollback, and recovery: (no error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 C 3 T 4 C 4 Time Fail-stop error (error) T 1 C 1 T 2 T 3 R 2 T 2 T 3 C 3 · · · Time Coordinated checkpointing (the platform is a giant macro-processor) Assume instantaneous interruption and detection Rollback to last checkpoint and re-execute ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 4/ 27

Introduction Model DP Algo Experiments Conclusion Coping with fail-stop errors with replication T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) C 1 2 ) Fail-stop error T 1 ( p T 4 ( p 2 ) C 1 T 2 ( p ) T 3 ( p ) C 3 2 ) T 5 ( p ) C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 Fail-stop error T 1 ( p T 4 ( p 2 ) T 2 ( p ) T 3 ( p ) 2 ) T 5 ( p ) C 1 C 3 C 5 T 1 ( p T 4 ( p 2 ) 2 ) C 1 The whole platform is used at all time, some tasks are replicated If failure hits a replicated task, no need to rollback Otherwise, rollback to last checkpoint and re-execute ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 5/ 27

Introduction Model DP Algo Experiments Conclusion Contributions Both checkpointing and replication have been extensively studied Combination of both techniques not yet investigated Detailed model Optimal dynamic programming algorithm Experiments to evaluate impact of using both replication and checkpointing during execution Guidelines about when to checkpoint only, replicate only, or combine both techniques ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 6/ 27

Introduction Model DP Algo Experiments Conclusion Outline Model and objective 1 Optimal dynamic programming algorithm 2 Experiments 3 Conclusion 4 ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 7/ 27

Introduction Model DP Algo Experiments Conclusion Application and platform model Application: Chain T 1 → T 2 → · · · → T n Parallel tasks: (failure-free) execution time of T i using q i � � α i + 1 − α i processors is w i (Amdahl’s law) q i Platform: Homogeneous platform with p processors P i , 1 ≤ i ≤ p Fail-stop errors, Exponential distribution, error rate λ ind P ( X ≤ T ) = 1 − e − q λ ind T on q processors ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 8/ 27

Introduction Model DP Algo Experiments Conclusion Checkpointing Checkpointing time: C i ( q i ) = a i + b i q i + c i q i a i + b i q i : communication time with latency a i c i q i : message passing overhead Downtime D Recovery cost R j +1 (where T j is the last checkpointed task) R i +1 ( q i ) = C i ( q i ) for 1 ≤ i ≤ n − 1: recovering for T i +1 ≈ reading C i T 0 with w 0 = 0 checkpointed (input time R 1 ( q 1 )) T n always checkpointed (output time C n ( q n )) ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 9/ 27

Introduction Model DP Algo Experiments Conclusion No replication T i not replicated: costs C norep and R norep i i � � Failure-free execution time: T norep α i + 1 − α i = w i i p Expected execution time E norep ( i ): � � E norep ( i ) = P ( X p ≤ T norep T norep ( T norep ) + D + R norep + E norep ( i ) ) i lost i i + (1 − P ( X p ≤ T norep )) T norep i i P ( X p ≤ t ) = 1 − e − λ ind pt : probability of failure on one of the p processors before time t T norep ( T norep 1 t ) = λ ind p − e λ ind pTnorep lost i − 1 i E norep ( i ) = ( e λ ind pT norep λ ind p + D + R norep 1 − 1)( ) i i If T i is checkpointed, add C norep i ICL Lunch Talk Anne.Benoit@ens-lyon.fr Combining checkpointing and replication 10/ 27

Combining checkpointing and replication for reliable execution of - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining checkpointing and replication for reliable execution of linear workflows Anne Benoit 1 , 2 , Aur elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F`

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 ,

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Bio Interlude DNA Replication DNA Replication: Basics G T T A A G T T T C 5

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Erasure Coding Research for Reliable Distributed and Cluster Computing James S. Plank Professor

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Reasons for Replication Two primary reasons for replication: reliability and performance .

1 Issues and Techniques for Weak Replication Bayou Basics Issues and Techniques for Weak

Replication and Migration Background, Requirements and Strawman Migration and Replication

Securing Passive Replication Through Verification Bruno Vavala 1,2 , Nuno Neves 1 , Peter

8 Storage, Networks, and Other Peripherals Combining bandwidth and storage . . . enables swift

From Distributed Logs to Database Replication Dr. Samuel Benz How to achieve scalability, fault

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Practical Replication The Dangers of Replication and a Solution (SIGMOD96) The Costs and

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

Reasoning About Replication: State Machine Approach & Chain Replication Partial slides

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

Combining checkpointing and replication for reliable execution of - PowerPoint PPT Presentation

Introduction Model DP Algo Experiments Conclusion Combining checkpointing and replication for reliable execution of linear workflows Anne Benoit 1 , 2 , Aur elien Cavelan 3 , Florina M. Ciorba 3 , evre 1 , Yves Robert 1 , 4 Valentin Le F`

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows Anne Benoit 1 ,

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Todays Topics - Chapter 15 Slide 1 performance enhancement Replication Replication of

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Bio Interlude DNA Replication DNA Replication: Basics G T T A A G T T T C 5

New features in MySQL Replication Lars Thalmann, Development Manager, Replication &amp; Backup

Galera Replication Synchronous Multi-Master Replication for InnoDB ...well, why not for any other

Erasure Coding Research for Reliable Distributed and Cluster Computing James S. Plank Professor

MySQL Replication Tutorial Mats Kindahl Senior Software Engineer Replication Technology Lars

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Reasons for Replication Two primary reasons for replication: reliability and performance .

1 Issues and Techniques for Weak Replication Bayou Basics Issues and Techniques for Weak

Replication and Migration Background, Requirements and Strawman Migration and Replication

Securing Passive Replication Through Verification Bruno Vavala 1,2 , Nuno Neves 1 , Peter

8 Storage, Networks, and Other Peripherals Combining bandwidth and storage . . . enables swift

From Distributed Logs to Database Replication Dr. Samuel Benz How to achieve scalability, fault

Consistency and Replication Chi Zhang czhang@cs.fiu.edu Object Replication (1) Organization of

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Practical Replication The Dangers of Replication and a Solution (SIGMOD96) The Costs and

DISTRIBUTED SYSTEMS II REPLICATION CNT. II The Quorum consensus method for Replication To

Reasoning About Replication: State Machine Approach &amp; Chain Replication Partial slides

GDR ADN, 2-4 mai 2012 Replication in eukaryotic genomes Specific features of eukaryotic

New features in MySQL Replication Lars Thalmann, Development Manager, Replication & Backup

Reasoning About Replication: State Machine Approach & Chain Replication Partial slides