Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - PowerPoint PPT Presentation

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale George Bosilca 1 , Aur´ elien Bouteiller 1 , Elisabeth Brunet 2 , Franck Cappello 3 , Jack Dongarra 1 , Amina Guermouche 4 , erault 1 , Yves Robert 1 , 4 , Thomas H´ eric Vivien 4 , and Dounia Zaidouni 4 Fr´ ed´ 1 . University of Tennessee Knoxville, USA 2 . Telecom SudParis, France 3 . INRIA & University of Illinois at Urbana Champaign, USA 4 . Ecole Normale Sup´ erieure de Lyon & INRIA, France Pittsburgh, June 28, 2012

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Motivation • Very very large number of processing elements (e.g., 2 20 ) = ⇒ Probability of failures dramatically increases • Large application to be executed on whole platform = ⇒ Failure(s) will most likely occur before completion! • Resilience provided through checkpointing 1 Coordinated protocols 2 Hierarchical protocols 2 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Which checkpointing protocol to use? Coordinated checkpointing � No risk of cascading rollbacks � No need to log messages � All processors need to roll back � Rumor: May not scale to very large platforms Hierarchical checkpointing � Need to log inter-groups messages • Slowdowns failure-free execution • Increases checkpoint size/time � Only processors from failed group need to roll back � Faster re-execution with logged messages � Rumor: Should scale to very large platforms 3 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Outline 1 Protocol Overhead Coordinated checkpointing Hierarchical checkpointing 2 Accounting for message logging 3 Instanciating the model Applications Platforms 4 Experimental results Plotting formulas Simulations 4 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Outline 1 Protocol Overhead 2 Accounting for message logging 3 Instanciating the model 4 Experimental results 5 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Framework • Periodic checkpointing policies (of period T ) • Independent and identically distributed failures • Platform failure inter-arrival time: µ • Tightly-coupled application: progress ⇔ all processors available • First-order approximation: at most one failure within a period Waste : fraction of time not spent for useful computations 6 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk 7 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Blocking model: while a checkpoint is taken, no computation can be performed 7 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time Computing the first chunk Checkpointing the first chunk Processing the first chunk Processing the second chunk Non-blocking model: while a checkpoint is taken, computations are not impacted (e.g., first copy state to RAM, then copy RAM to disk) 7 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Checkpointing cost Time spent working Time spent checkpointing Time spent working with slowdown Time Computing the first chunk Checkpointing the first chunk Processing the first chunk General model: while a checkpoint is taken, computations are slowed-down: during a checkpoint of duration C , the same amount of computation is done as during a time αC without checkpointing ( 0 ≤ α ≤ 1 ). 7 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results 1 Protocol Overhead Coordinated checkpointing Hierarchical checkpointing 2 Accounting for message logging 3 Instanciating the model Applications Platforms 4 Experimental results Plotting formulas Simulations 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste in absence of failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 T − C C T Time elapsed since last checkpoint: T Amount of computation saved: ( T − C ) + αC Waste coord − nofailure = T − (( T − C ) + αC ) = (1 − α ) C T T 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 Failure can happen 1 During computation phase 2 During checkpointing phase 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures Time spent working Time spent checkpointing Time spent working with slowdown Time P 0 P 1 P 2 P 3 T lost Coordinated checkpointing protocol: when one processor is victim of a failure, all processors lose their work and must roll back to last checkpoint 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Time P 0 P 1 P 2 P 3 D 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Time P 0 P 1 P 2 P 3 R Coordinated checkpointing protocol: All processors must recover from last checkpoint 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 C αC Redo the work destroyed by the failure, that was done in the checkpointing phase before the computation phase But no checkpoint is taken in parallel, hence this re-computation is faster than the original computation 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 T − C Re-execute the computation phase 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 C Finally, the checkpointing phase is executed First-order approximation: we assume that no other failure occurs during the re-execution 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures in computation phase Time spent working Time spent checkpointing Time spent working with slowdown Downtime Recovery time Re-executing slowed-down work Time P 0 P 1 P 2 P 3 T lost D R αC T − C C T ∆ Re-Exec : ∆ − T = T lost + αC Expectation: T lost = 1 2( T − C ) Re-Exec coord − fail − in − work = T − C + αC 2 8 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Waste due to failures • Failure in the computation phase (probability: T − C ) T Re-Exec coord − fail − in − work = T − C + αC 2 • Failure in the checkpointing phase (probability: C T ) Re-Exec coord − fail − in − checkpoint = T − C 2 + αC T − C � T − C � + C � T − C � + αC 2 + αC T 2 T = αC + T 2 9 / 35

Protocol Overhead Accounting for message logging Instanciating the model Experimental results Overall waste Waste coord = Waste coord − nofailure + 1 µ ( D + R + Re-Exec coord ) � � = (1 − α ) C + 1 D + R + αC + T T µ 2 Minimize Waste coord subject to: • C ≤ T (by construction) • T ≤ 0 . 1 µ ( ⇒ Proba ( Poisson ( T µ ) ≥ 2) ≤ 0 . 005 ) 10 / 35

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - PowerPoint PPT Presentation

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale George Bosilca 1 , Aur elien Bouteiller 1 , Elisabeth Brunet 2 , Franck Cappello 3 , Jack Dongarra 1 , Amina Guermouche 4 , erault 1 , Yves Robert 1 , 4 , Thomas H eric Vivien

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Analysis of Security Protocols Gavin Lowe Analysis of Security Protocols 02 Overview Brief

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Secure Multi-Party Computation Lecture 17 GMW & BGW Protocols MPC Protocols MPC Protocols

From RPC to RMI Protocols for middleware services Protocols for middleware services

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

ASSESSING INTELLECTUAL DISABILITIES ASSESSING INTELLECTUAL DISABILITIES ASSESSING INTELLECTUAL

Assessing Earthquake Disaster Using ALOS Assessing Earthquake Disaster Using ALOS Assessing

1 Digression: local procedure calls Digression: local procedure calls j = f( i, mystring ,

Security in Sensor Networks Written by: Prof. Srdjan Capkun & Others Presented By :

DSRC: Deployment and Beyond WINLAB Research Review John B. Kenney Toyota InfoTechnology Center,

The Direct Stiffness Method Part I IFEM Ch 2 Slide 1 Introduction to FEM The Direct

Automated Connected - Mobile Strategies & Actions towards Automated & Connected

Graph searching with advice Nicolas Nisse David Soguet LRI, Universit e Paris-Sud, France.

Model Checking of Action-Based Concurrent Systems Radu Mateescu INRIA Rhne-Alpes / VASY

PUBLISHING SIMULATIONS IN THE VO AND ELSEWHERE Gerard Lemson MPA Garching, Germany 1 ISSAC

Sambuz

Useful Links

Newsletter

Mail Us

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale - PowerPoint PPT Presentation

Unified Model for Assessing Checkpointing Protocols at Extreme-Scale George Bosilca 1 , Aur elien Bouteiller 1 , Elisabeth Brunet 2 , Franck Cappello 3 , Jack Dongarra 1 , Amina Guermouche 4 , erault 1 , Yves Robert 1 , 4 , Thomas H eric Vivien

CSC2/458 Parallel and Distributed Systems Checkpointing and Recovery Sreepathi Pai April 17,

Basics of Unified Sports Ways to get involved with Unified Sports in Ohio Ohio 1 What are

SARVAM UCS Unified Communication Server Unified Communication Server for Modern Enterprises

Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud - Qingxi

Analysis of Security Protocols Gavin Lowe Analysis of Security Protocols 02 Overview Brief

Adjoint Data-Flow analyses applied to checkpointing - Tradeoff between snapshots and TBR Benjamin

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Secure Multi-Party Computation Lecture 17 GMW &amp; BGW Protocols MPC Protocols MPC Protocols

From RPC to RMI Protocols for middleware services Protocols for middleware services

UNIFIED MEMORY IN CUDA 6 MARK HARRIS NVIDIA CONFIDENTIAL Unified Memory Dramatically Lower

Unified Straight and Curved Steel Girder Design Specifications Introduction Unified Steel

UNIFIED PAYMENTS AT A GLANCE DEAR MERCHANT, WELCOME TO UNIFIED PAYMENTS! At Unified Payments,

SPORTS! Unified Basketball Special Olympics U NIFIED B ASKETBALL Unified Basketball helps

ASSESSING INTELLECTUAL DISABILITIES ASSESSING INTELLECTUAL DISABILITIES ASSESSING INTELLECTUAL

Assessing Earthquake Disaster Using ALOS Assessing Earthquake Disaster Using ALOS Assessing

1 Digression: local procedure calls Digression: local procedure calls j = f( i, mystring ,

Security in Sensor Networks Written by: Prof. Srdjan Capkun &amp; Others Presented By :

DSRC: Deployment and Beyond WINLAB Research Review John B. Kenney Toyota InfoTechnology Center,

The Direct Stiffness Method Part I IFEM Ch 2 Slide 1 Introduction to FEM The Direct

Automated Connected - Mobile Strategies &amp; Actions towards Automated &amp; Connected

Graph searching with advice Nicolas Nisse David Soguet LRI, Universit e Paris-Sud, France.

Model Checking of Action-Based Concurrent Systems Radu Mateescu INRIA Rhne-Alpes / VASY

PUBLISHING SIMULATIONS IN THE VO AND ELSEWHERE Gerard Lemson MPA Garching, Germany 1 ISSAC

Sambuz

Useful Links

Newsletter

Mail Us

Secure Multi-Party Computation Lecture 17 GMW & BGW Protocols MPC Protocols MPC Protocols

Security in Sensor Networks Written by: Prof. Srdjan Capkun & Others Presented By :

Automated Connected - Mobile Strategies & Actions towards Automated & Connected