spare node substitution for failure nodes
play

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN - PowerPoint PPT Presentation

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS Background In the Exa-flops era, faults could happen more frequently than ever System MTBF becomes shorter Important Issue : Recovery from faults Conventional


  1. Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS

  2. Background • In the Exa-flops era, faults could happen more frequently than ever → System MTBF becomes shorter • Important Issue : Recovery from faults • Conventional method : System-level Checkpoint-Restart – Requires massive I/O • Many mechanisms to survive failures have been proposed and investigated – Less I/O Size – One of the mechanisms is ULFM(User-Level Fault Mitigation). • User program handles failures • The program can survive from the failures and continue its execution • But there is no discussion how a job should survive from node failures

  3. Purpose of this Research • What is the best way to survive from node failures ? – Assuming a job can survive from a node failure by using an existing fault mitigation software – Not to propose a new fault mitigation mechanism – Propose recovery strategy

  4. Survival from Node Failure • Applications with dynamic load balancing – e.g. Distributed Master-Worker model – Avoiding failure nodes method – Applications continue its execution only with healthy nodes after failure • How about applications without dynamic load balancing? – e.g. Stencil Computation

  5. Avoiding Failure Node(s) for Stencil Computation x1.5 computation Stencil computation characteristics • – Communication pattern is fixed Failure – Load can be balanced When a recovery happens, above stencil • computation characteristics must be preserved However, New comm. pattern • – Hard to balance loads – Impossible to preserve communication pattern – Every time a new failure happens, communication pattern can differ Hard to program !!! • Using spare nodes to solve these problems

  6. Using Spare Nodes • An application runs with spare nodes • If node failure happens, migrate the task running on failed node to the spare node – Loads are balanced (continues with the same # procs.) – Preserve logical communication pattern – No change in the kernel part of application – Some penalties

  7. Spare Node Penalty-1 -System utilization Degradation- • Spare node allocation • System utilization is decreased 14 12 % Spare Nodes 10 3D(3,1) 8 3D(2,1) 6 3D(1,1) 4 2D(2,1) 2 2D(1,1) 0 1,000 10,000 100,000 1,000,000 # Nodes nD (α,β) n: Dimensions of networks α: # dimensions of spare nodes β: spare nodes width

  8. Spare Node Penalty-2 -Communication Performance Degradation- • Logical communication pattern can be preserved • by creating a new MPI communicator to exclude the failed node and include a spare node. • However, physical communication pattern is not the same, and communication performance(CP) can be degraded. • Larger hop counts (latency), and • Possible message collisions

  9. Ex. CP Degradation of Spare Node Substitution • Nodes on the topmost row work as spare nodes • Up to 5 possible collisions after 1 node failure – Independent from the # 2D Cartesian network topology nodes (XY routing ) 5-point Stencil Computation How faulty nodes should be replaced by spare nodes?

  10. Sliding Substitution(1) • We proposed “Sliding Substitution” methods – 0D Sliding (simple replace) Failed rank is continued on an alternative node • – 1D Sliding Processes between the failure node and the spare node are shifted • – 2D Sliding • Whole processes between the failure node's row(column) and the spare node's row(column) are shifted – 3D Sliding, 4D , 5D… 20 32 30 31 32 33 34 35 30 31 32 33 34 35 30 31 26 32 33 34 35 30 24 25 31 26 32 33 27 28 34 35 29 24 25 26 27 28 29 24 25 20 26 27 28 29 24 18 19 25 20 26 27 21 22 28 29 23 18 19 20 20 21 22 23 18 19 20 21 20 22 23 18 19 18 19 18 19 20 21 20 21 20 21 22 23 22 23 22 23 12 13 14 15 16 17 12 13 14 15 16 17 12 13 14 15 16 17 6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0D Sliding 1D Sliding 2D Sliding

  11. Preliminary Evaluation -5D stencil on 2D network- • Spare Allocation 30 30 0D : 2D(1,1) 0D : 2D(2,1) 2D(2,1) > 2D(1,1) 25 25 Max. Collisions Mesh 20 20 Torus 15 15 • Max. Failure 10 10 – 0D: up to # Spare 5 5 – 1D: 3 (or more) 0 0 – 2D: up to 2 (2D 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 Cart. Topo.) 8 8 1D : 2D(2,1) 2D : 2D(2,1) Max. Collisions • Comm. Perf. 6 6 2D > 1D > 0D 4 4 2 2 0 0 1 2 3 4 5 1 2 3 4 5 # Failed Nodes # Failed Nodes

  12. Sliding Substitution(2) The higher the dimension • – The better the performance – The smaller the number of the failure nodes it can handle 2D or higher dimension Sliding • – Migrate tasks running on healthy nodes – Free nodes works as new spare nodes Hybrid Sliding • – 3D → 2D → 1D → 0D (on 3D network) 3D Sliding Works as new spare nodes

  13. Evaluation : 7P-Stencil on the K and BG/Q (Hybrid, 3D(2,1), 4MiB) 45 40 40 35 35 Smaller is better 30 Relative latency 30 25 25 Sim. Avg. 20 Sim. Worst 20 15 Sim. Best 15 10 Exp. Worst 10 5 5 0 0 0 100 200 300 0 50 100 150 200 # Failed Nodes # Failed Nodes The K Computer BG/Q 12x12x12 Nodes (calc. 11x11x12) 16x8x8 Nodes (calc. 15x7x8) K computer : up to 8 times slower • BG/Q : up to 12 times slower •

  14. Evaluation: Collectives on the K and BG/Q (Hybrid, 3D(2,1)) Smaller is better 6 6 Allreduce(K) Barrier(K) 5 5 (Worst Case) Rel. latency 4 4 3 3 2 2 1 1 0 0 1 2 100 200 276 1 2 100 200 276 # Failed Nodes # Failed Nodes Smaller is better 1.2 1.2 (Based on 16x8x8) (Based on 16x8x8) (Worst Case) 2 10 (Worst Case) Rel. latency 1 1 Rel. latency 8 1.5 0.8 0.8 6 0.6 0.6 1 4 0.4 0.4 Barrier(BG/Q) Allreduce(BG/Q) 0.5 2 0.2 0.2 0 0 0 0 1 2 100 184 1 2 100 184 # Failed Nodes # Failed Nodes On the K and BG/Q, collective operations are optimized for their network • Having spare nodes makes the optimization very difficult • BG/Q’s optimization works only with MPI_COMM_WORLD •

  15. Summary • We proposed and compared “Sliding Substitution” methods. • Communication performance degradation is observed – 7P-Stencil : • Simulation results: up to 40 collisions • Experimental results: up to 12 times larger latency – Collective communications: • up to 12 times lager latency (BG/Q, Barrier)

  16. Future Work • Evaluations with real applications • Node-Rank re-mapping algorithms, or better substitution methods • Discussion on the other network topology – Experiments using Tsubame 2.5 (Fat-tree) is scheduled

Recommend


More recommend