fault tolerance support for supercomputers with multicore
play

Fault Tolerance Support for Supercomputers with Multicore Nodes - PowerPoint PPT Presentation

Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011 Exascale Supercomputer: 100 M of cores an Exascale system could be expected to have a failure ... every 3539 minutes


  1. Fault Tolerance Support for Supercomputers with Multicore Nodes Esteban Meneses Xiang Ni Monday, April 18, 2011

  2. Exascale Supercomputer: 100 M of cores “an Exascale system could be expected to have a failure ... every 35–39 minutes” Exascale Computing Study “insufficient resilience of the software infrastructure would likely render extreme scale systems effectively unusable” The International Exascale Software Monday, April 18, 2011

  3. Contents • Charm++ Fault Tolerance Infrastructure. • Fault Tolerance in SMP. • Preliminary Results. • Multiple Concurrent Failure Model. • Future Work. Monday, April 18, 2011

  4. Fault Tolerance in Charm++ • Object Migration Node X • Load Balancing • Runtime Support Node Y • SMP version Node Z Monday, April 18, 2011

  5. Strategies SMP Checkpoint Checkpoint Proactive Restart Restart SMP Message Message Logging Logging Monday, April 18, 2011

  6. Proactive Node W Node Y Node Z Node X Checkpoint Predictor Restart Message Logging SMP Checkpoint Restart SMP Message Logging Monday, April 18, 2011

  7. Proactive Node W Node Y Node Z Node X Checkpoint Restart Checkpoint Restart with or in buddy’s without spare memory nodes Message Node X’ Logging 4.0 4.0 Without LB With LB Time/step (s) SMP Time/step (s) 3.0 3.0 Checkpoint Restart 2.0 2.0 1.0 1.0 SMP 0 Message 0 Logging 1 151 301 451 601 1 151 301 451 601 Timestep Timestep Monday, April 18, 2011

  8. m 1 m 2 Proactive Node W Node Y Node Z Node X Team Q Team R Checkpoint Restart Message Logging Team-based Message Logging SMP Checkpoint Restart Parallel Restart SMP Message Logging Monday, April 18, 2011

  9. Proactive Checkpoint Restart 360 108 NoLB(8) GreedyLB(8) Message 320 96 TeamLB(1) Logging TeamLB(8) Team-based 280 84 Load Balancer 240 72 Time (seconds) Memory (MB) 200 60 SMP Checkpoint 160 48 Restart 120 36 80 24 SMP 40 12 Message Logging 0 0 Execution Time Memory Overhead Monday, April 18, 2011

  10. Proactive PE A PE B PE C PE D Checkpoint Restart Shared Memory (SM) Node X Message Logging The minimum unit of failure is a node SMP Checkpoint Restart Single node failure support SMP Message Logging Monday, April 18, 2011

  11. Proactive PE A PE B PE C PE D Checkpoint Restart SM SM Node X Node Y Message Logging Causal Message Logging ➙ determinants in shared memory SMP Checkpoint Restart Lock contention ➙ hybrid scheme SMP Message Load balancing ➙ increase communication inside a node Logging Monday, April 18, 2011

  12. Experiments • Hardware : • Abe @NCSA: 1200 8-way SMP nodes. • Ranger @TACC: 3936 16-way SMP nodes. • Benchmarks : • Ring : Charm++ nearest neighbor exchange. • Jacobi : 7-point stencil. Monday, April 18, 2011

  13. Checkpoint Time :;-0<=1+2>?@-/>A8>! ! !@+2B!.C7-4 !&" )!DE )"!DE (""!DE !%# !%" *+,-!./-0123/4 !$# !$" !# !" !'( !$%) !%#' !#$% !$"%( 56,7-8!19!:18-/ Monday, April 18, 2011

  14. Restart Time 67*489(-0.:;*+./'.! ! !</4(=-!,>=*?!@%!4('*+1 !%"" !$#" &'()'*++!,-.*'/.-(01 !$"" !#" !" !" !# !$" !$# !%" 2-3*!,+*4(05+1 Monday, April 18, 2011

  15. Message Logging Overhead Jacobi (Ranger) 0.25 Message Logging Checkpoint/Restart Time per Iteration (seconds) 0.2 0.15 0.1 0.05 0 128 256 512 1024 Number of Cores Monday, April 18, 2011

  16. Single Node Failure • All protocols presented tolerate a single node failure. • They may recover from a multiple failure. • Multiple concurrent failures are rare. • Cost to tolerate them is high: • Checkpoint/restart: more checkpoint buddies. • Causal Message logging: determinants must be stored in more locations. Monday, April 18, 2011

  17. Distribution of Multiple Failures 100% 10% Frequency 1 node 2 nodes 3 nodes 4 nodes > 4 nodes 1% 0.1% Tsubame MPP2 Mercury Monday, April 18, 2011

  18. Multiple Concurrent Failures • Analytical Model: • Multiple Failure Distribution : (heavy-tailed). • Checkpoint/Restart : probability of losing both a node and its buddy. • Message Logging : probability of losing a node and another node it contacts. Monday, April 18, 2011

  19. Buddy Assignment A B C D Ring Mapping E F G H A B C D Pair Mapping E F G H Monday, April 18, 2011

  20. Checkpoint/Restart Multiple Failure Survivability (n=1024) 1 1 0.8 0.8 0.6 0.4 Probability 0.6 0.2 0 0.4 0 2 4 6 8 10 12 14 16 0.2 0 0 20 40 60 80 100 120 Multiple Concurrent Failures Monday, April 18, 2011

  21. Message Logging Multiple Failure Survivability (n=1024) 1 1 0.8 0.8 0.6 Probability 0.6 0.4 0.2 0.4 0 0 2 4 6 8 10 12 14 16 degree=2 0.2 degree=4 degree=8 degree=16 0 0 20 40 60 80 100 120 Multiple Concurrent Failures Monday, April 18, 2011

  22. Conclusions • Fault Tolerance for SMP better matches the failure reality of supercomputers. • Single node failure support is robust enough for failure pattern in supercomputers. • Load balancer is key to enhance fault tolerance in SMP. Monday, April 18, 2011

  23. Future Work • Optimize message logging in SMP. • Add load balancer to reduce communication overhead. • Early stages of supercomputer: correlated failures. Monday, April 18, 2011

  24. Aknowledgments • Ana Gainaru (NCSA). • Leonardo Bautista Gómez (Tokyo Tech). • This research was supported in part by the US Department of Energy under grant DOE DE-SC0001845 and by a machine allocation on the Teragrid under award ASC050039N. Monday, April 18, 2011

  25. Thanks! Q&A Monday, April 18, 2011

  26. Multiple Failures Model Multiple Failure Distribution (n=1024) 1 p=0.7 0.9 0.8 0.7 f(x)=(1-p) (x-1) p Probability 0.6 0.5 0.4 0.3 0.2 0.1 0 2 4 6 8 10 12 14 16 Multiple Concurrent Failures Monday, April 18, 2011

  27. Survivability S Checkpoint/Restart 0.999402 Message Logging (2) 0.997624 Message Logging (4) 0.995285 Message Logging (8) 0.990716 Message Logging (16) 0.981973 Monday, April 18, 2011

Recommend


More recommend