APIs, Architecture and Modeling for Extreme Scale Resilience Dagstuhl Seminar: Resilience in Exascale Computing 9/30/2014 Kento Sato LLNL-PRES-661421 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Failures on HPC systems ! System resilience is critical for future extreme-scale computing ! 191 failures out of 5-million node-hours • A production application using Laser-plasma interaction code ( pF3D ) • Hera,&Atlas&and&Coastal&clusters&@LLNL&=>&MTBF:&1.2&day& — C.f.&)&TSUBAME2.0&=>&MTBF:&a&day& ! In&extreme&scale,&failure&rate&will&increase& ! Now,&&HPC&systems&must&consider&failures&as&usual& events&& Lawrence Livermore National Laboratory 2 LLNL-PRES-661421
Motivation to resilience APIs Start ! Current MPI implementation does not have the MPI initialization capabilities • Standard MPI employs a fail-stop model End Application run ! When a failure occurs … Checkpointing Recovery • MPI terminates all processes Failure • The user locate, replace failed nodes with spare nodes Terminate processes • Re-initialize MPI • Restore the last checkpoint Locate failed node ! Applications will use more time for recovery Replace failed node • Users manually locate and replace the failed nodes with spare nodes via machinefile • The&manual&recovery&operaNons&may&introduce&extra& MPI re-initialization overhead&and&human&errors& � APIs to handle the failures are critical Restore checkpoint Lawrence Livermore National Laboratory 3 LLNL-PRES-661421
Resilience APIs, Architecture and the model ! Resilience APIs � Fault tolerant messaging Res esilien ence e API PIs: Fault tolerant messaging interface (FMI) interface (FMI) Compute nodes Parallel file system Lawrence Livermore National Laboratory 4 LLNL-PRES-661421
FMI: Fault Tolerant Messaging Interface [IPDPS2014] FMI&overview& FMI rank (virtual rank) 4 6 0 1 2 3 5 7 MPI-like interface User’s view FMI FMI’s view Fast checkpoint/restart Parity 0 Parity 0 P6-0 Parity 0 Parity 1 Parity 1 Parity 1 P2-0 P3-0 P4-0 P5-0 P7-0 P0-0 P1-0 P0-0 P0-0 P1-0 P1-0 Parity 2 Parity 3 P4-1 P5-1 P6-1 P7-1 P0-1 P0-1 P0-1 P1-1 P1-1 P2-1 P3-1 Parity 4 Parity 5 P6-2 P7-2 P1-1 P0-2 P0-2 P1-2 P1-2 P2-2 P3-2 P4-2 Parity 6 Parity 7 P0-2 P1-2 P5-2 P0 P2 P3 P4 P5 P6 P7 P8 P9 P1 Node 4 Node 0 Node 1 Node 2 Node 3 Dynamic node allocation 0 1 7 Scalable failure detection 6 2 3 5 4 ! FMI is a survivable messaging interface providing MPI-like interface • Scalable failure detection � Overlay network • Dynamic node allocation � FMI ranks are virtualized • Fast checkpoint/restart � In-memory diskless checkpoint/restart Lawrence Livermore National Laboratory 5 LLNL-PRES-661421
How FMI applications work ? FMI&example&code& FMI_Loop enables transparent recovery and ! roll-back on a failure int main (int *argc, char *argv[]) { Periodically write a checkpoint • FMI_Init(&argc, &argv); Restore the last checkpoint on a failure • FMI_Comm_rank(FMI_COMM_WORLD, &rank); /* Application’s initialization */ Processes are launched via fmirun ! n = FMI_Loop(…) while (( ) < numloop) { fmirun spawns fmirun.task on each • /* Application’s program */ node } fmirun.task calls fork/exec a user program • /* Application’s finalization */ fmirun broadcasts connection information • FMI_Finalize(); (endpoints) for FMI_init( … ) } Launch&FMI&processes& machine_file node0.fmi.gov node1.fmi.gov node2.fmi.gov fmirun node3.fmi.gov node4.fmi.gov Node&0& Node&1& Node&3& Node&4& Node&2& fmirun.task fmirun.task fmirun.task fmirun.task Spare node P1& P3& P7& P0& P2& P4& P5& P6& Lawrence Livermore National Laboratory 6 LLNL-PRES-661421
User perspective: No failures Node 0 Node 1 Node 2 Node 3 FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ n = FMI_Loop(…) while (( ) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) 4 = FMI_Loop(…) FMI_Finalize • User&perspecNve&when&no&failures&happens& • IteraNons:&4& • Checkpoint&frequency:&Every&2&iteraNons& • FMI_Loop&returns&incremented&iteraNon&id&& Lawrence Livermore National Laboratory 7 LLNL-PRES-661421
User perspective : Failure FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ while ((n = FMI_Loop(…)) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) restart: 1 2 = FMI_Loop(…) Transparently&migrate&FMI&rank&0& • &&1&to&a&spare&node& 3 = FMI_Loop(…) Restart&form&the&last&checkpoint& • 4 = FMI_Loop(…) – 2 th &checkpoint&at&iteraNon&2& With&FMI,&applicaNons&sNll&use&the& • FMI_Finalize same&series&of&ranks&even&a[er& failures & Lawrence Livermore National Laboratory 8 LLNL-PRES-661421
Resilience API: FMI_Loop FMI_Loop int FMI_Loop(void **ckpt, size_t *sizes, int len) ckpt : Array&of&pointers&to&variables&containing&data&that&needs&to&be&checkpointed& sizes: Array&of&sizes&of&each&checkpointed&variables& len : Length&of&arrays,& ckpt &and& sizes returns iteration id FMI constructs in-memory RAID-5 across compute nodes ! Checkpoint group size ! e.g.) group_size = 4 • FMI&checkpoinNng& Encoding group Encoding group Parity 0 P2-0 P4-0 P6-0 Parity 1 P3-0 P7-0 P5-0 0 2 4 6 8 10 12 14 P0-0 P6-1 P7-1 Parity 2 P4-1 P1-0 Parity 3 P5-1 P6-2 P0-1 P2-1 Parity 4 P1-1 P3-1 Parity 5 P7-2 P0-2 P2-2 P4-2 Parity 6 P1-2 P3-2 P5-2 Parity 7 Parity 1 P3-0 P5-0 P7-0 Parity 0 P2-0 P4-0 P6-0 1 3 5 7 9 11 13 15 P1-0 Parity 3 P5-1 P7-1 P0-0 Parity 2 P4-1 P6-1 P1-1 P3-1 Parity 5 P7-2 P0-1 P2-1 Parity 4 P6-2 P1-2 P3-2 Parity 7 P0-2 P2-2 P4-2 P5-2 Parity 6 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Lawrence Livermore National Laboratory 9 LLNL-PRES-661421
Application runtime with failures • Benchmark: Poisson’s equation solver using Jacobi iteration method – Stencil application benchmark – MPI_Isend, MPI_Irecv, MPI_Wait and MPI_Allreduce within a single iteration • For MPI, we use the SCR library for checkpointing – Since MPI is not survivable messaging interface, we write checkpoint memory on tmpfs • Checkpoint interval is optimized by Vaidya’s model for FMI and MPI P2P communication performance 2500 MPI 1-byte Latency Bandwidth (8MB) FMI MPI 3.555 usec 3.227 GB/s 2000 Performance (GFlops) FMI 3.573 usec 3.211 GB/s MPI + C FMI + C 1500 FMI + C/R FMI directly writes checkpoints via memcpy, and 1000 can exploit the bandwidth MTBF: 1 minute 500 Even with the high failure rate, FMI incurs only a 28% overhead 0 0 500 1000 1500 Lawrence Livermore National Laboratory # of Processes (12 processes/node) 10 LLNL-PRES-661421
Asynchronous multi-level checkpointing (MLC) [SC12] Time RAID-5 Level-1 checkpoint PFS Level-2 checkpoint Source: K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka, “Design and Modeling of a Non- Blocking Checkpointing System,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’12. Salt Lake City, Utah: IEEE Computer Society Press, 2012 Asynchronous&MLC&is&a&technique&for&achieving&high& • Failure analysis on Coastal cluster reliability&while&reducing&checkpoinNng&overhead& MTBF Failure rate Asynchronous&MLC&Use&storage&levels&hierarchically & • L1 failure 130 hours 2.13 -6 – RAID_5&checkpoint:&Frequent&&for&one&node&or&a&few& L2 failure 650 hours 4.27 -7 node&failure& – PFS&checkpoint:&Less&frequent&and&asynchronous&for& Source: A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable mulN_node&failure& Multi-level Checkpointing System,” in Proceedings of the 2010 ACM/IEEE International Conference for High Our&previous&work&model&the&asynchronous& • Performance Computing, Networking, Storage and Analysis (SC 10). MLC& & Lawrence Livermore National Laboratory 11 11 LLNL-PRES-661421
Recommend
More recommend