Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011
Exascale Supercomputer: 100 M of cores “an Exascale system could be expected to have a failure ... every 35–39 minutes” Exascale Computing Study “insufficient resilience of the software infrastructure would likely render extreme scale systems effectively unusable” The International Exascale Software Monday, April 18, 2011
Contents • Charm++ Fault Tolerance Infrastructure. • Fault Tolerance in SMP. • Preliminary Results. • Multiple Concurrent Failure Model. • Future Work. Monday, April 18, 2011
Fault Tolerance in Charm++ • Object Migration Node X • Load Balancing • Runtime Support Node Y • SMP version Node Z Monday, April 18, 2011
Strategies SMP Checkpoint Checkpoint Proactive Restart Restart SMP Message Message Logging Logging Monday, April 18, 2011
Proactive Node W Node Y Node Z Node X Checkpoint Predictor Restart Message Logging SMP Checkpoint Restart SMP Message Logging Monday, April 18, 2011
Proactive Node W Node Y Node Z Node X Checkpoint Restart Checkpoint Restart with or in buddy’s without spare memory nodes Message Node X’ Logging 4.0 4.0 Without LB With LB Time/step (s) SMP Time/step (s) 3.0 3.0 Checkpoint Restart 2.0 2.0 1.0 1.0 SMP 0 Message 0 Logging 1 151 301 451 601 1 151 301 451 601 Timestep Timestep Monday, April 18, 2011
m 1 m 2 Proactive Node W Node Y Node Z Node X Team Q Team R Checkpoint Restart Message Logging Team-based Message Logging SMP Checkpoint Restart Parallel Restart SMP Message Logging Monday, April 18, 2011
Proactive Checkpoint Restart 360 108 NoLB(8) GreedyLB(8) Message 320 96 TeamLB(1) Logging TeamLB(8) Team-based 280 84 Load Balancer 240 72 Time (seconds) Memory (MB) 200 60 SMP Checkpoint 160 48 Restart 120 36 80 24 SMP 40 12 Message Logging 0 0 Execution Time Memory Overhead Monday, April 18, 2011
Proactive PE A PE B PE C PE D Checkpoint Restart Shared Memory (SM) Node X Message Logging The minimum unit of failure is a node SMP Checkpoint Restart Single node failure support SMP Message Logging Monday, April 18, 2011
Proactive PE A PE B PE C PE D Checkpoint Restart SM SM Node X Node Y Message Logging Causal Message Logging ➙ determinants in shared memory SMP Checkpoint Restart Lock contention ➙ hybrid scheme SMP Message Load balancing ➙ increase communication inside a node Logging Monday, April 18, 2011
Experiments • Hardware : • Abe @NCSA: 1200 8-way SMP nodes. • Ranger @TACC: 3936 16-way SMP nodes. • Benchmarks : • Ring : Charm++ nearest neighbor exchange. • Jacobi : 7-point stencil. Monday, April 18, 2011
Checkpoint Time :;-0<=1+2>?@-/>A8>! ! !@+2B!.C7-4 !&" )!DE )"!DE (""!DE !%# !%" *+,-!./-0123/4 !$# !$" !# !" !'( !$%) !%#' !#$% !$"%( 56,7-8!19!:18-/ Monday, April 18, 2011
Restart Time 67*489(-0.:;*+./'.! ! !</4(=-!,>=*?!@%!4('*+1 !%"" !$#" &'()'*++!,-.*'/.-(01 !$"" !#" !" !" !# !$" !$# !%" 2-3*!,+*4(05+1 Monday, April 18, 2011
Message Logging Overhead Jacobi (Ranger) 0.25 Message Logging Checkpoint/Restart Time per Iteration (seconds) 0.2 0.15 0.1 0.05 0 128 256 512 1024 Number of Cores Monday, April 18, 2011
Single Node Failure • All protocols presented tolerate a single node failure. • They may recover from a multiple failure. • Multiple concurrent failures are rare. • Cost to tolerate them is high: • Checkpoint/restart: more checkpoint buddies. • Causal Message logging: determinants must be stored in more locations. Monday, April 18, 2011
Distribution of Multiple Failures 100% 10% Frequency 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 1% 0.1% Tsubame MPP2 Mercury Monday, April 18, 2011
Multiple Concurrent Failures • Analytical Model: • Multiple Failure Distribution : (heavy-tailed). • Checkpoint/Restart : probability of losing both a node and its buddy. • Message Logging : probability of losing a node and another node it contacts. Monday, April 18, 2011
Buddy Assignment A B C D Ring Mapping E F G H A B C D Pair Mapping E F G H Monday, April 18, 2011
Checkpoint/Restart Multiple Failure Survivability (n=1024) 1 1 0.8 0.8 0.6 0.4 Probability 0.6 0.2 0 0.4 0 2 4 6 8 10 12 14 16 0.2 0 0 20 40 60 80 100 120 Multiple Concurrent Failures Monday, April 18, 2011
Message Logging Multiple Failure Survivability (n=1024) 1 1 0.8 0.8 0.6 Probability 0.6 0.4 0.2 0.4 0 0 2 4 6 8 10 12 14 16 degree=2 0.2 degree=4 degree=8 degree=16 0 0 20 40 60 80 100 120 Multiple Concurrent Failures Monday, April 18, 2011
Conclusions • Fault Tolerance for SMP better matches the failure reality of supercomputers. • Single node failure support is robust enough for failure pattern in supercomputers. • Load balancer is key to enhance fault tolerance in SMP. Monday, April 18, 2011
Future Work • Optimize message logging in SMP. • Add load balancer to reduce communication overhead. • Early stages of supercomputer: correlated failures. Monday, April 18, 2011
Aknowledgments • Ana Gainaru (NCSA). • Leonardo Bautista Gómez (Tokyo Tech). • This research was supported in part by the US Department of Energy under grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N. Monday, April 18, 2011
Thanks! Q&A Monday, April 18, 2011
Multiple Failures Model Multiple Failure Distribution (n=1024) 1 p=0.7 0.9 0.8 0.7 f(x)=(1-p) (x-1) p Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 Multiple Concurrent Failures Monday, April 18, 2011
Survivability S Checkpoint/Restart 0.999402 Message Logging (2) 0.997624 Message Logging (4) 0.995285 Message Logging (8) 0.990716 Message Logging (16) 0.981973 Monday, April 18, 2011
Recommend
More recommend