Billion-Way Resiliency for Extreme Scale Computing Seminar at - PowerPoint PPT Presentation

Billion-Way Resiliency for Extreme Scale Computing Seminar at German Research School for Simulation Sciences, Aachen October 6 th , 2014 Kento Sato Lawrence Livermore National Laboratory LLNL-PRES-662034 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Failures on HPC systems ! Exponential growth in computational power • Enables finer grained simulations with shorter period time ! Overall failure rate increase accordingly because of the increasing system size ! 191 failures out of 5-million node-hours • A production application of Laser-plasma interaction code ( pF3D ) • Hera,&Atlas&and&Coastal&clusters&@LLNL& Estimated MTBF (w/o hardware reliability improvement per component in future) 1,000 nodes 10,000 nodes 100,000 nodes 1.2 days 2.9 hours 17 minutes MTBF (Measured) (Estimation) (Estimation) Source: A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) • Will&be&difficult&for&applica:ons&to&con:nuously&run&for&a&long& :me&without&fault&tolerance&at&extreme&scale& Lawrence Livermore National Laboratory - Kento Sato 2 LLNL-PRES-662034

Conventional fault tolerance in MPI apps Start ! Checkpoint/Recovery (C/R) MPI initialization • Long running MPI applications are required to write checkpoints End Application run ! MPI Checkpointing • De-facto communication library enabling parallel Failure computing • Standard MPI employs a fail-stop model Terminate processes ! When a failure occurs … Locate failed node • MPI terminates all processes • The user locate, replace failed nodes with spare nodes Replace failed node • Re-initialize MPI • Restore the last checkpoint MPI re-initialization ! The fail-stop model of MPI is quite simple • All processes synchronize at each step to restart Restore checkpoint Lawrence Livermore National Laboratory - Kento Sato 3 LLNL-PRES-662034

Requirement of fast and transparent recovery Start ! Failure rate will increase in future MPI initialization extreme scale systems End Application run • Applications will use more time for Recovery Checkpointing Failure recovery – Whenever a failure occurs, users manually locate and Terminate processes replace the failed nodes with spare nodes via machinefile – The&manual&recovery&opera:ons&may&introduce&extra& Locate failed node overhead&and&human&errors& • Resilience&APIs&for&fast&and&transparent& Replace failed node recovery&is&becoming&more&cri:cal&for& extreme&scale&compu:ng& MPI re-initialization Restore checkpoint Lawrence Livermore National Laboratory - Kento Sato 4 LLNL-PRES-662034

Resilience APIs, Architecture and the model ! Resilience APIs � Fault tolerant messaging Res esilien ence e API PIs: Fault tolerant messaging interface (FMI) interface (FMI) Compute nodes Parallel file system Lawrence Livermore National Laboratory - Kento Sato 5 LLNL-PRES-662034

Challenges for fast and transparent recovery Start ! Scalable failure detection • When recovering from a failure, all processes need MPI initialization to be notified Application run ! Survivable messaging interface Checkpointing • At extreme scale, even termination and Failure Initialization of processes will be expensive • Not terminating non-failed processes is important Terminate processes ! Transparent and dynamic node allocation Locate failed node • Manually locating, and replacing failed nodes will introduce extra overhead and human errors Replace failed node ! Fast checkpoint/restart MPI re-initialization restore checkpoint Lawrence Livermore National Laboratory - Kento Sato 6 LLNL-PRES-662034

FMI: Fault Tolerant Messaging Interface [IPDPS2014] FMI&overview& FMI rank (virtual rank) 0 1 2 3 4 5 6 7 MPI-like interface User’s view FMI FMI’s view Fast checkpoint/restart Parity 0 Parity 1 Parity 0 Parity 0 Parity 1 Parity 1 P2-0 P3-0 P4-0 P5-0 P6-0 P7-0 P0-0 P0-0 P0-0 P1-0 P1-0 Parity 2 Parity 3 P4-1 P5-1 P6-1 P7-1 P1-0 P0-1 P0-1 P1-1 P1-1 P2-1 P3-1 Parity 4 P6-2 P7-2 P0-1 P1-1 Parity 5 P0-2 P0-2 P2-2 Parity 6 Parity 7 P0-2 P1-2 P1-2 P1-2 P3-2 P4-2 P5-2 P2 P3 P4 P6 P7 P8 P9 P0 P1 P5 Node 0 Node 1 Node 2 Node 3 Node 4 Dynamic node allocation 0 1 7 Scalable failure detection 6 2 3 5 4 ! FMI is a survivable messaging interface providing MPI-like interface • Scalable failure detection => Overlay network • Dynamic node allocation => FMI ranks are virtualized • Fast checkpoint/restart => Diskless checkpoint/restart Lawrence Livermore National Laboratory - Kento Sato 7 LLNL-PRES-662034

How FMI applications work ? FMI&example&code& FMI_Loop enables transparent recovery • int main (int *argc, char *argv[]) { and roll-back on a failure FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); – Periodically write a checkpoint /* Application’s initialization */ – Restore the last checkpoint on a failure n = FMI_Loop(…) while (( ) < numloop) { Processes are launched via fmirun • /* Application’s program */ fmirun spawns fmirun.task on each node – } fmirun.task calls fork/exec a user program – /* Application’s finalization */ fmirun broadcasts connection information – FMI_Finalize(); (endpoints) for FMI_init( … ) } Launch&FMI&processes& machine_file node0.fmi.gov node1.fmi.gov node2.fmi.gov fmirun node3.fmi.gov node4.fmi.gov Node&0& Node&1& Node&3& Node&4& Node&2& fmirun.task fmirun.task fmirun.task fmirun.task Spare node P1& P3& P7& P0& P2& P4& P5& P6& Lawrence Livermore National Laboratory - Kento Sato 8 LLNL-PRES-662034

User perspective: No failures Node 0 Node 1 Node 2 Node 3 FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ n = FMI_Loop(…) while (( ) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) 4 = FMI_Loop(…) FMI_Finalize • User&perspec:ve&when&no&failures&happens& • Itera:ons:&4& • Checkpoint&frequency:&Every&2&itera:ons& • FMI_Loop&returns&incremented&itera:on&id&& Lawrence Livermore National Laboratory - Kento Sato 9 LLNL-PRES-662034

User perspective : Failure FMI&example&code& 3 5 6 7 0 1 2 4 int main (int *argc, char *argv[]) { FMI_Init FMI_Comm_rank FMI_Init(&argc, &argv); FMI_Comm_rank(FMI_COMM_WORLD, &rank); checkpoint: 0 0 = FMI_Loop(…) /* Application’s initialization */ while ((n = FMI_Loop(…)) < 4) { 1 = FMI_Loop(…) /* Application’s program */ } /* Application’s finalization */ checkpoint: 1 2 = FMI_Loop(…) FMI_Finalize(); } 3 = FMI_Loop(…) restart: 1 2 = FMI_Loop(…) Transparently&migrate&FMI&rank&0& • &&1&to&a&spare&node& 3 = FMI_Loop(…) Restart&form&the&last&checkpoint& • 4 = FMI_Loop(…) – 2 th &checkpoint&at&itera:on&2& With&FMI,&applica:ons&s:ll&use&the& • FMI_Finalize same&series&of&ranks&even&aWer& failures & Lawrence Livermore National Laboratory - Kento Sato 10 LLNL-PRES-662034

FMI_Loop FMI_Loop int FMI_Loop(void **ckpt, size_t *sizes, int len) ckpt : Array&of&pointers&to&variables&containing&data&that&needs&to&be&checkpointed& sizes: Array&of&sizes&of&each&checkpointed&variables& len : Length&of&arrays,& ckpt &and& sizes returns iteration id FMI constructs in-memory RAID-5 across compute nodes ! Checkpoint group size ! e.g.) group_size = 4 • FMI&checkpoin:ng& Encoding group Encoding group Parity 0 P2-0 P4-0 P6-0 Parity 1 P3-0 P7-0 P5-0 0 2 4 6 8 10 12 14 P0-0 P6-1 P7-1 Parity 2 P4-1 P1-0 Parity 3 P5-1 P6-2 P0-1 P2-1 Parity 4 P1-1 P3-1 Parity 5 P7-2 P0-2 P2-2 P4-2 Parity 6 P1-2 P3-2 P5-2 Parity 7 Parity 1 P3-0 P5-0 P7-0 Parity 0 P2-0 P4-0 P6-0 1 3 5 7 9 11 13 15 P1-0 Parity 3 P5-1 P7-1 P0-0 Parity 2 P4-1 P6-1 P1-1 P3-1 Parity 5 P7-2 P0-1 P2-1 Parity 4 P6-2 P1-2 P3-2 Parity 7 P0-2 P2-2 P4-2 P5-2 Parity 6 Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Lawrence Livermore National Laboratory - Kento Sato 11 LLNL-PRES-662034

Billion-Way Resiliency for Extreme Scale Computing Seminar at - PowerPoint PPT Presentation

Billion-Way Resiliency for Extreme Scale Computing Seminar at German Research School for Simulation Sciences, Aachen October 6 th , 2014 Kento Sato Lawrence Livermore National Laboratory LLNL-PRES-662034 This work was performed under the

BEYOND INFRASTRUCTURE: RESILIENCY AT HOME Anastasia Roy Program Manager, Resiliency Solutions

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Opportunities in Biology at the Opportunities in Biology at the Extreme Scale of Computing

Synergistic Challenges in Data-Intensive Science and Extreme Scale Computing Vivek Sarkar

Resiliency $2.5 Billion Bond Election $900 Million towards Local Match ~$5 Billion in

$17 Billion for NYC $375 billion total for counties and localities $34 Billion for NY State $500

Extreme Neural Network Computing Transforms Speech Quality Extreme Neural Network

NOAA/DACF MAINE FLOOD RESILIENCY CHECKLIST An opportunity to evaluate your coastal community

Community Resiliency Workshop Be Informed Be Prepared Workshops Courtesy The Librarians

REGIONAL RESILIENCY ACTION PLAN 1 2 East Central Florida Regional Resiliency Action Plan

RESILIENT NEIGHBORHOODS: Broad Channel Resiliency Rezoning C 170256 ZMQ, N 170257 ZRQ Hamilton

EAST HARLEM RESILIENCY STUDY COMMUNITY FORUM MEETING MAY 22, 2018 EAST HARLEM RESILIENCY STUDY

Using OpenStreetMap and QGIS to build resiliency maps The view from San Francisco, California

CRAY SV1 SuperCluster Resiliency Mike Wolf I/O development SGI 41st Cray User Group

Performance of a large TeO2 crystal as a cryogenic bolometer in searching for neutrinoless double

EEE118: Electronic Devices and Circuits Lecture X James E Green Department of Electronic

IE1206 Embedded Electronics PIC-block Documentation, Seriecom Pulse sensors Le1 Le2 I , U , R ,

Connection-Oriented Media Transport in SDP draft-ietf-mmusic-sdp-comedia-01.txt David Yon

Computer Simulation Modeling Jonathan Thaler Department of Computer Science 1 / 61 Modeling

Strings in Constraint Programming Justin Pearson Uppsala University May 2019 Joint (and

Addition to Chapter 6 Temporal Constraint Satisfaction Problems CS5811 - Artificial Intelligence

Constrained Systems with Unconstrained Positions: Graph Constructions and Tradeoff Functions Lei