Fault Tolerance Support for Supercomputers with Multicore Nodes - PowerPoint PPT Presentation

Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011

Exascale Supercomputer: 100 M of cores “an Exascale system could be expected to have a failure ... every 35–39 minutes” Exascale Computing Study “insufficient resilience of the software infrastructure would likely render extreme scale systems effectively unusable” The International Exascale Software Monday, April 18, 2011

Contents • Charm++ Fault Tolerance Infrastructure. • Fault Tolerance in SMP. • Preliminary Results. • Multiple Concurrent Failure Model. • Future Work. Monday, April 18, 2011

Fault Tolerance in Charm++ • Object Migration Node X • Load Balancing • Runtime Support Node Y • SMP version Node Z Monday, April 18, 2011

Strategies SMP Checkpoint Checkpoint Proactive Restart Restart SMP Message Message Logging Logging Monday, April 18, 2011

Proactive Node W Node Y Node Z Node X Checkpoint Predictor Restart Message Logging SMP Checkpoint Restart SMP Message Logging Monday, April 18, 2011

Proactive Node W Node Y Node Z Node X Checkpoint Restart Checkpoint Restart with or in buddy’s without spare memory nodes Message Node X’ Logging 4.0 4.0 Without LB With LB Time/step (s) SMP Time/step (s) 3.0 3.0 Checkpoint Restart 2.0 2.0 1.0 1.0 SMP 0 Message 0 Logging 1 151 301 451 601 1 151 301 451 601 Timestep Timestep Monday, April 18, 2011

m 1 m 2 Proactive Node W Node Y Node Z Node X Team Q Team R Checkpoint Restart Message Logging Team-based Message Logging SMP Checkpoint Restart Parallel Restart SMP Message Logging Monday, April 18, 2011

Proactive Checkpoint Restart 360 108 NoLB(8) GreedyLB(8) Message 320 96 TeamLB(1) Logging TeamLB(8) Team-based 280 84 Load Balancer 240 72 Time (seconds) Memory (MB) 200 60 SMP Checkpoint 160 48 Restart 120 36 80 24 SMP 40 12 Message Logging 0 0 Execution Time Memory Overhead Monday, April 18, 2011

Proactive PE A PE B PE C PE D Checkpoint Restart Shared Memory (SM) Node X Message Logging The minimum unit of failure is a node SMP Checkpoint Restart Single node failure support SMP Message Logging Monday, April 18, 2011

Proactive PE A PE B PE C PE D Checkpoint Restart SM SM Node X Node Y Message Logging Causal Message Logging ➙ determinants in shared memory SMP Checkpoint Restart Lock contention ➙ hybrid scheme SMP Message Load balancing ➙ increase communication inside a node Logging Monday, April 18, 2011

Experiments • Hardware : • Abe @NCSA: 1200 8-way SMP nodes. • Ranger @TACC: 3936 16-way SMP nodes. • Benchmarks : • Ring : Charm++ nearest neighbor exchange. • Jacobi : 7-point stencil. Monday, April 18, 2011

Checkpoint Time :;-0<=1+2>?@-/>A8>! ! !@+2B!.C7-4 !&" )!DE )"!DE (""!DE !%# !%" *+,-!./-0123/4 !$# !$" !# !" !'( !$%) !%#' !#$% !$"%( 56,7-8!19!:18-/ Monday, April 18, 2011

Restart Time 67*489(-0.:;*+./'.! ! !</4(=-!,>=*?!@%!4('*+1 !%"" !$#" &'()'*++!,-.*'/.-(01 !$"" !#" !" !" !# !$" !$# !%" 2-3*!,+*4(05+1 Monday, April 18, 2011

Message Logging Overhead Jacobi (Ranger) 0.25 Message Logging Checkpoint/Restart Time per Iteration (seconds) 0.2 0.15 0.1 0.05 0 128 256 512 1024 Number of Cores Monday, April 18, 2011

Single Node Failure • All protocols presented tolerate a single node failure. • They may recover from a multiple failure. • Multiple concurrent failures are rare. • Cost to tolerate them is high: • Checkpoint/restart: more checkpoint buddies. • Causal Message logging: determinants must be stored in more locations. Monday, April 18, 2011

Distribution of Multiple Failures 100% 10% Frequency 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 1% 0.1% Tsubame MPP2 Mercury Monday, April 18, 2011

Multiple Concurrent Failures • Analytical Model: • Multiple Failure Distribution : (heavy-tailed). • Checkpoint/Restart : probability of losing both a node and its buddy. • Message Logging : probability of losing a node and another node it contacts. Monday, April 18, 2011

Buddy Assignment A B C D Ring Mapping E F G H A B C D Pair Mapping E F G H Monday, April 18, 2011

Checkpoint/Restart Multiple Failure Survivability (n=1024) 1 1 0.8 0.8 0.6 0.4 Probability 0.6 0.2 0 0.4 0 2 4 6 8 10 12 14 16 0.2 0 0 20 40 60 80 100 120 Multiple Concurrent Failures Monday, April 18, 2011

Message Logging Multiple Failure Survivability (n=1024) 1 1 0.8 0.8 0.6 Probability 0.6 0.4 0.2 0.4 0 0 2 4 6 8 10 12 14 16 degree=2 0.2 degree=4 degree=8 degree=16 0 0 20 40 60 80 100 120 Multiple Concurrent Failures Monday, April 18, 2011

Conclusions • Fault Tolerance for SMP better matches the failure reality of supercomputers. • Single node failure support is robust enough for failure pattern in supercomputers. • Load balancer is key to enhance fault tolerance in SMP. Monday, April 18, 2011

Future Work • Optimize message logging in SMP. • Add load balancer to reduce communication overhead. • Early stages of supercomputer: correlated failures. Monday, April 18, 2011

Aknowledgments • Ana Gainaru (NCSA). • Leonardo Bautista Gómez (Tokyo Tech). • This research was supported in part by the US Department of Energy under grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N. Monday, April 18, 2011

Thanks! Q&A Monday, April 18, 2011

Multiple Failures Model Multiple Failure Distribution (n=1024) 1 p=0.7 0.9 0.8 0.7 f(x)=(1-p) (x-1) p Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 Multiple Concurrent Failures Monday, April 18, 2011

Survivability S Checkpoint/Restart 0.999402 Message Logging (2) 0.997624 Message Logging (4) 0.995285 Message Logging (8) 0.990716 Message Logging (16) 0.981973 Monday, April 18, 2011

Fault Tolerance Support for Supercomputers with Multicore Nodes - PowerPoint PPT Presentation

Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011 Exascale Supercomputer: 100 M of cores an Exascale system could be expected to have a failure ... every 3539 minutes

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured

r sst r

SEM PLEXOS Model Validation Public Workshop RAs Market Modelling Group 15:00, 13 th June 2011,

Fermilab, Science, SMP & Whats after SMP? Sowjanya Gollapinni University of Tennessee,

Computational Social Choice: Spring 2015 Ulle Endriss Institute for Logic, Language and

HPC Architectures Types of HPC hardware platforms currently in use Funding Partners bioexcel.eu

Computing Christopher G. Baker Michael A. Heroux Sandia National Laboratories LACCS 2008 Sandia

Threads, SMP, and Microkernels Chapter 4 1 Current View of Process Process is a program in

Sambuz

Useful Links

Newsletter

Mail Us

Fault Tolerance Support for Supercomputers with Multicore Nodes - PowerPoint PPT Presentation

Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011 Exascale Supercomputer: 100 M of cores an Exascale system could be expected to have a failure ... every 3539 minutes

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Lecture 10: Fault Tolerance Fault Tolerant Concurrent Computing The main principles of fault

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Rigorous fault-tolerance thresholds Ben Reichardt UC Berkeley N gate circuit 0/1 N gate

Fault Tolerance and Robustness in Concurrent Systems Faults, errors, failures, and fault

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

No SQL? Image credit: http://browsertoolkit.com/fault-tolerance.png No SQL? Image credit:

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Class Overview Introduction

PERFORMANCE FAULT TOLERANCE AVAILABILITY FEATURE VELOCITY PERFORMANCE FAULT TOLERANCE

Asynchronous Distributed-Memory Task-Parallel Algorithm for Compressible Flows on 3D Unstructured

r sst r

SEM PLEXOS Model Validation Public Workshop RAs Market Modelling Group 15:00, 13 th June 2011,

Fermilab, Science, SMP &amp; Whats after SMP? Sowjanya Gollapinni University of Tennessee,

Computational Social Choice: Spring 2015 Ulle Endriss Institute for Logic, Language and

HPC Architectures Types of HPC hardware platforms currently in use Funding Partners bioexcel.eu

Computing Christopher G. Baker Michael A. Heroux Sandia National Laboratories LACCS 2008 Sandia

Threads, SMP, and Microkernels Chapter 4 1 Current View of Process Process is a program in

Sambuz

Useful Links

Newsletter

Mail Us

Fermilab, Science, SMP & Whats after SMP? Sowjanya Gollapinni University of Tennessee,