Sliding Substitution of Failed Nodes Atsushi Hori, Kazumi Yoshinaga, Yutaka Ishikawa RIKEN AICS Thomas Herault, Aurélien Bouteiller, George Bosilca University of Tennessee, ICL 15 年 10 月 2 日金曜日
2 Motivation • Having spare node set seems to be the last resort • “in such case, spare node can be used.” • Having spare node is not the answer, but new research issue EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
3 Fault Resilience • Fault tolerance in Exa-flops era • High failure rate • High I/O bandwidth requirement • User-level fault resilience • Less I/O bandwidth required • e.g., ULFM (User-Level Fault Mitigation) • We need a recovery strategy !! EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
4 Survival from Node Failure • Jobs with dynamic load balancing • e.g., Task bag, PIC, ... • Job shrinking to exclude failed nodes • Tasks running on failed node(s) are migrated to live nodes • Jobs without dynamic load balancing • e.g., Stencil computation, ... • Very difficult to balance load • Having spare nodes seems to be the answer ... EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
5 Stencil Computation • Survival from a node failure • Load balancing • Preserving communication pattern • Less code modification Shift the load on to healthy nodes New complex communication pattern EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
6 Spare Node • In an error handler (of ULFM, for example) • create a new MPI communicator to • exclude the failed node, and • include a spare node. • then, migrate the task running on the failed node to the spare node • No change in the kernel part of application • However, at the network level, the regular stencil communication pattern can be lost ! EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
7 Is spare node really the answer ? • Our scope • Is there any penalty? If any, how much? • How spare nodes should be allocated? • How many spare nodes should be allocated? • How failed nodes should be substituted be spare nodes? • Out of scope • How (soft/hard) errors are detected • How checkpoints are taken • How tasks are migrated EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
8 Spare Node Penalty (1) • Spare node allocation and node utilization 14 0 1 2 3 4 5 3D(3,1) 6 7 8 9 10 11 12 3D(2,1) 12 13 14 15 16 17 3D(1,1) 10 18 19 20 21 22 23 % Spare Nodes 2D(2,1) 24 25 26 27 28 29 8 2D(1,1) 30 31 32 33 34 35 6 4 2D(2,2) 2 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 11 6 7 8 9 10 11 0 10,000 100,000 1,000,000 12 13 14 15 16 17 12 13 14 15 16 17 # Nodes 18 19 20 21 22 23 18 19 20 21 22 23 24 25 26 27 28 29 24 25 26 27 28 29 30 31 32 33 34 35 30 31 32 33 34 35 2D(1,1) 2D(2,1) EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
9 How many spare nodes ? • MTBF of a node 3D(3,1) 2D(2,1) 3D(2,1) 2D(1,1) • 50,000 Hr. ≈ 5 Years 3D(1,1) 10,000 • MTBF of Exa (10 6 nodes) System MTBF (50,000H/Node) • 0.05 Hr. = 3 Min. 1,000 • #Spare = 10,000 (1%) • 500 Hr. ≈ 20 Days 100 • 10 4 out of 10 6 10 10,000 100,000 1,000,000 # Nodes EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
10 Spare Node Allocation • Changing spare node allocation method according to the number of nodes 14 5 3D(3,1) 12 3D(2,1) 4 3D(1,1) 10 % Spare Nodes % Spare Nodes 2D(2,1) 3 8 2D(1,1) 6 2 4 1 2 0 0 10,000 100,000 1,000,000 10,000 100,000 1,000,000 # Nodes # Nodes EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
11 Spare Node Penalty (2) • Possibility of communication performance degradation • 5P Stencil communication pattern Spare Nodes S 2D Cartesian Network and XY Routing F Normal After substitution EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
12 Sliding Substitution Node 21 fails • 0D Sliding 0 1 2 3 4 5 6 7 8 9 10 11 • 1D Sliding Spare Nodes 12 13 14 15 16 17 • 2D Sliding 18 19 20 21 22 23 24 25 26 27 28 29 • 3D, 4D, .... Sliding 30 31 32 33 34 35 Spare Nodes 2D Sliding 0D Sliding 1D Sliding 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 11 6 7 8 9 10 11 6 7 8 9 10 11 12 13 14 15 16 17 12 13 14 15 16 17 12 13 14 15 16 17 18 19 20 22 23 18 19 20 22 23 21 24 25 26 21 28 29 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 34 35 30 31 32 33 34 35 27 24 25 26 27 28 29 33 30 31 32 33 34 35 EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
13 5P Stencil on 2D Network • Simulated Results Mesh Torus B C 35 35 30 30 Max. Collisions B • Spare Allocation 25 25 B C 20 20 • 2D(2,1) > 2D(1,1) B C C B 15 15 C B C B ??? 10 10 B combinatory B B B • Max. Failure 5 5 explosion B B B B B 0 0 • 0D: up to #Spare 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 10 8 • 1D: 3 (or more) Max. Collisions 8 • 2D: up to 2 6 up to 3 failures in worst case 6 (2D Cart. Topo.) 4 B 4 B - no message collision B - up to 2 failures 2 2 • Comm. Perf. B B 0 0 • 2D > 1D > 0D 1 2 3 4 5 1 2 3 4 5 # Failures # Failures EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
14 the K BG/Q 15 年 10 月 2 日金曜日 EuroMPI 2015, Bordeaux 5P Stencil Comm. Perf. Relative Latency Relative Latency 0 1 2 3 4 0 1 2 3 4 5 6 KKK KKK K K K K K K K K K K 256KiB 256KiB 1MiB K K 1MiB 4MiB 4MiB -- -- KKK KKK K K K K K K K K K K 256KiB 256KiB K K 1MiB 1MiB 4MiB 4MiB -- -- KKK KKK K K K K K K K K K K 256KiB 256KiB K K 1MiB 1MiB 4MiB 4MiB -- -- KKK KKK K K K K K K K K K K 256KiB 256KiB 1MiB K K 1MiB 4MiB 4MiB -- -- KKK KKK 256KiB K K K K K K 256KiB K K K K K K 1MiB 1MiB 4MiB 4MiB -- -- KKK KKK K K K K K K 256KiB K K K K 256KiB K K 1MiB 1MiB 4MiB 4MiB -- -- KKK KKK K K K K K K K K K K K K 256KiB 256KiB 1MiB 1MiB 4MiB 4MiB -- -- KKK KKK K K K K K K K K 256KiB 256KiB K K K 1MiB 1MiB K 4MiB 4MiB -- -- 0 1 2 3 4 5 6 0 1 2 3 4 Relative Latency Relative Latency Smaller is better
15 Collective Performance 3 3 3 3 Rel. Perf. (based on 23x23) Rel. Perf. (based on 23x23) Rel. Perf. (based on 23x23) Rel. Perf. (based on 23x23) • On K and BG/Q, K collective ops are K K K K K K K K K K K K K K K K K K K K K 2 2 2 2 optimized for K K K K K K K K K K K K K K their network. K K K K 1 1 1 1 K K Smaller is better 0 0 0 0 • Having spare 0D 1D 2D 0D 1D 1D+ 2D 0D 1D 2D 0D 1D 1D+ 2D - - - - nodes makes the optimization very 3 12 3 3 Rel. Perf. (based on 15x31) Rel. Perf. (based on 16x32) Rel. Perf. (based on 15x31) Rel. Perf. (based on 16x32) K K K K difficult. K K K K 10 K K K KKK K K K K K KKK K K K 2 8 2 2 KKK K K K K K K K KKK K K K K • BG/Q’ optimization 6 K K K K K K K K K K K K K KKK K K works only with K K K K K K KKK K K KKK 1 K K K 4 1 1 KKK K K K K MPI_COMM_WORLD 2 0 0 0 0 0D 1D 1D+ 2D 0D 1D 1D+ 2D 0D 1D 1D+ 2D 0D 1D 1D+ 2D 0D 1D 2D - 0D 1D 2D - 0D 1D 2D - 0D 1D 2D - - - - - EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
16 Summary • Study on spare node substitution has just begun • Comm. perf. degradation is observed • 5P stencil : • Simulation: up to 100 times larger latency • Experiment: < 20 times larger latency • Collective : up to 12 times larger latency EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
17 Current and Future Work • Evaluations with real applications • Node-Rank re-mapping algorithms, or better substitution methods • Dragonfly and/or Fat-tree network ? • Experiments using Tsubame 2.5 (Fat-tree) is scheduled • At this moment, it is still unclear if having spare nodes is a promising technique EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
18 Acknowledgement Thank to Dr. Norbert Attig at Jülich Supercomputing Center to give us a chance to use JUQUEEN. EuroMPI 2015, Bordeaux 15 年 10 月 2 日金曜日
Recommend
More recommend