Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , - PowerPoint PPT Presentation

Kento Sato †1 , Kathryn Mohror †2 , Adam Moody †2 , Todd Gamblin †2 , Bronis R. de Supinski †2 , Naoya Maruyama †3 and Satoshi Matsuoka †1 †1 Tokyo Institute of Technology †2 Lawrence Livermore National Laboratory †3 RIKEN Advanced institute for Computational Science This,work,performed,under,the,auspices,of,the,U.S.,Department,of,Energy,by,Lawrence,Livermore,NaFonal,Laboratory,, under,Contract,DE#AC52#,07NA27344.,LLNL#PRES#654744#DRAFT, May,27 th ,,2014, CCGrid2014@Chicago LLNL#PRES#654744,

Failures,on,HPC,systems, • ExponenFal,growth,in,computaFonal,power, – Enables,finer,grained,simulaFons,with,shorter,period,Fme, • Overall,failure,rate,increase,accordingly,because,of,the,increasing, system,size, • 191,failures,out,of,5#million,node#hours,, – A,producFon,applicaFon,of,Laser#plasma,interacFon,code,( pF3D ), – Hera,,Atlas,and,Coastal,clusters,@LLNL, Estimated MTBF (w/o hardware reliability improvement per component in future) 1,000,nodes, 10,000,nodes, 100,000,nodes, MTBF, 1.2,days, 2.9,hours, 17,minutes, (Measured), (EsFmaFon), (EsFmaFon), • Will,be,difficult,for,applicaFons,to,conFnuously,run,for,a,long, Fme,without,fault,tolerance,at,extreme,scale, 2, LLNL#PRES#654744,

Checkpoint/Restart,(So^ware#Lv.), • Idea,of,Checkpoint/Restart, – Checkpoint, Checkpoint/Restart, • Periodically,save,snapshots,of, Failure, an,applicaFon,state,to,PFS, – Restart, check, check, check, • On,a,failure,,restart,the, point, point, point, execuFon,from,the,latest, checkpoint, CheckpoinFng,overhead, • Improved,Checkpoint/Restart, Parallel,file,system,(PFS), – MulF#level,checkpoinFng,[1], – Asynchronous,checkpoinFng,[2], – In#memory,diskless,checkpoinFng,[3], • We,found,that,so^ware#level,approaches,may,be,limited,in, increasing,resiliency,at,extreme,scale, [1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12 [3] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery", IPDPS2014 3, LLNL#PRES#654744,

Storage,architectures, We,consider,architecture#level,approaches, • Compute nodes Burst,buffer, • – A,new,Fer,in,storage,hierarchies, – Absorb,bursty,I/O,requests,from, applicaFons, – Fill,performance,gap,between,node#local, Burst buffers storage,and,PFSs,in,both,latency,and, bandwidth, If,you,write,checkpoints,to,burst,buffers,, • – Faster,checkpoint/restart,Fme,than,PFS, Parallel file system – More,reliable,than,storing,on,compute, nodes, However,…, • – Adding,burst,buffer,nodes,may,increase,total,system,size,,and,failure,rates, accordingly,, , It’s,not,clear,if,burst,buffers,improve,overall,system,efficiency, • – Because,burst,buffers,also,connect,to,networks,,the,burst,buffers,may,sFll,be,a, bofleneck, [4] Doraimani, Shyamala and Iamnitchi, Adriana, “File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces”, HPDC '08 4, LLNL#PRES#654744,

Goal,and,ContribuFons, • Goal :,,, – Develop,an,interface,to,exploit,bandwidth,of,burst,buffers, – Explore,effecFveness,of,burst,buffers, – Find,out,the,best,C/R,strategy,on,burst,buffers, • Contribu,ons :,, – Development,of,IBIO,exploiFng,bandwidth,to,burst,buffers, – A,model,to,evaluate,system,resiliency,given,a,C/R,strategy, and,storage,configuraFon, – Our,experimental,results,show,a,direcFon,to,build,resilient, systems,for,extreme,scale,compuFng , LLNL#PRES#654744,

Outlines, • IntroducFon, • Checkpoint,strategies, • Storage,designs, • IBIO:,InfiniBand#based,I/O,interface, • Modeling, • Experiments, • Conclusion, 6, LLNL#PRES#654744,

Diskless,checkpoint/restart,(C/R), failure, 7, Diskless,C/R, • Parity,1, ckpt,B1, ckpt,C1, ckpt,D1, – Create,redundant,data,across,local,storages, ckpt,A1, Parity,2, ckpt,C2, ckpt,D2, ckpt,A2, ckpt,B2, Parity,3, ckpt,D3, on,compute,nodes,using,a,encoding, ckpt,A3, ckpt,B3, ckpt,C3, Parity,4, technique,such,as,XOR, Node%1% Node%2% Node%3% Node%4% – Can,restore,lost,checkpoints,on,a,failure, caused,by,small,#,of,nodes,like,RAID#5, XOR,encoding,example, Most,of,failures,comes,from,one,node,,or,can,recover,from,XOR,checkpoint, • – e.g.,1),TSUBAME2.0:,92%,failures, Rest,of,failures,sFll,require,a,checkpoint,on,a,reliable,PFS, 8% – e.g.,2),LLNL,clusters:,85%,failures, 15% LOCAL/XOR/PARTNER checkpoint PFS checkpoint 92% 85% Diskless,checkpoint,is, promising,approach, Failure analysis on TSUBAME2.0 Failure analysis on LLNL clusters 7, LLNL#PRES#599833,

MulF#level,Checkpoint/Restart,(MLC/R), [1,2] MLC, MLC,model, Duration 2 1 1 1 1 2 1 1 1 1 2 t + c k r 1 1 1 1 1 1 1 1 k p 0 ( t + c k ) p 0 ( r p 0 ( r k ) k ) No k k Diskless, 1 1 1 1 1 1 failure t 0 ( r k ) Level#1, t 0 ( t + c k ) checkpoint, 1 1 1 1 p i ( r k ) p i ( t + c k ) i i k k Failure t i ( r k ) 2 t i ( t + c k ) 2 : No failure for T seconds t : Interval p 0 ( T ) e − λ T PFS, p 0 ( T ) = Level#2, c c t 0 ( T ) = T : Expected time when p 0 ( T ) t 0 ( T ) : c -level checkpoint time checkpoint, λ i λ (1 − e − λ T ) r p i ( T ) = : i - level failure for T seconds c : c -level recovery time p i ( T ) 1 − ( λ T + 1) · e − λ T t i ( T ) = : Expected time when p i ( T ) λ i : i -level checkpoint time t i ( T ) λ · (1 − e − λ T ) MLC,hierarchically,use,storage,levels, • 1 – Diskless,checkpoint:,Frequent,,for,one, 0.9 node,for,a,few,node,failure, 0.8 MTBF 0.7 Efficiency – PFS,checkpoint:,Less,frequent,and, a few hours 0.6 0.5 asynchronous,for,mulF#node,failure, MTBF 0.4 days or a day Our,evaluaFon,showed,system,efficiency, 0.3 • 0.2 drops,to,less,than,10%,when,MTBF,is,a, 0.1 0 few,hours , 1 2 10 50 100 Scale factor (xF, xL2) , [1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- 8, LLNL#PRES#654621, blocking Checkpointing System", SC12

Uncoordinated,C/R,+,MLC Coordinated,C/R, Uncoordinated,C/R, Cluster/A/ P0, P0, ckpt, ckpt, msg,logging, P1, P1, ckpt, ckpt, Cluster/B/ Failure, P2, Failure, P2, ckpt, ckpt, P3, P3, Coordinated,C/R, • Coordinated C/R Uncoordinated C/R 1 – All,processes,globally,synchronize,before, 0.9 taking,checkpoints,and,restart,on,a,failure,, 0.8 – Restart,overhead, MTBF 0.7 Efficiency a few hours 0.6 Uncoordinated,C/R, • 0.5 MTBF – Create,clusters,,and,log,messages, 0.4 days or a day exchanged,between,clusters, 0.3 0.2 – Message,logging,overhead,is,incurred,,but, 0.1 rolling#back,only,a,cluster,can,restart,the, 0 execuFon,on,a,failure, 1 2 10 50 100 Scale factor (xF, xL2) � ,MLC,+,Uncoordinated,C/R,(So^ware#level), approaches,may,be,limited,at,extreme,scale, 9, LLNL#PRES#654621,

Storage,designs, • AddiFon,to,the,so^ware#level,approaches,,we,also,explore,two, architecture#level,approaches,, – Flat,buffer,system:,, • ,Current,storage,system, – Burst,buffer,system:,, • Separated,buffer,space, Burst,buffer,system, Flat,buffer,system, Compute, Compute, Compute, Compute, Compute, Compute, Compute, Compute, node,1, node,2, node,4, node,1,, node,3, node,2, node,3, node,4, SSD,1, SSD,2, SSD,3, SSD,4, SSD,1, SSD,2, SSD,3, SSD,4, PFS,(Parallel,file,system), PFS,(Parallel,file,system), 10, LLNL#PRES#654744,

Flat,Buffer,Systems, Flat,buffer,system, • Design,concept, Cluster, – Each,compute,node,has,its, dedicated,node#local,, Compute, Compute, Compute, Compute, node,1, node,2, node,4, node,3, storage, idle, idle, SSD,1, SSD,2, SSD,3, SSD,4, – Scalable,with,increasing,, number,of,compute,nodes, PFS,(Parallel,file,system), • This,design,has,drawbacks:, 1. Unreliable,checkpoint,storage, e.g.),If,compute,node,2,fails,,a,checkpoint,on,SSD,2,will,be,lost,because,SSD,2,is,physically, afached,to,the,failed,compute,node,2, 2. Inefficient,uFlizaFon,of,storage,resources,on,uncoordinated,checkpoinFng, e.g.),If,compute,node,1,&,3,are,in,a,same,cluster,,and,restart,from,a,failure,,the,bandwidth,of, SSD,2,&,4,will,not,be,uFlized, 11, LLNL#PRES#654744,

Burst,Buffer,Systems, Burst,buffer,system, Cluster, • Design,concept, – A,burst,buffer,is,a,storage, failure, space,to,bridge,the,gap,in, Compute, Compute, Compute, Compute, latency,and,bandwidth, node,1,, node,2, node,4, node,3, between,node#local,storage, and,the,PFS, SSD,1, SSD,2, SSD,3, SSD,4, – Shared,by,a,subset,of,compute, nodes, PFS,(Parallel,file,system), • Although,addiFonal,nodes,are,required,,several,advantages, 1. More,Reliable,because,burst,buffers,are,located,on,a,smaller,#,of,nodes, e.g.),Even,if,compute,node,2,fails,,a,checkpoint,of,compute,node,2,is,accessible,from,the, other,compute,node,1, 2. Efficient,uFlizaFon,of,storage,resources,,on,uncoordinated,checkpoinFng, e.g.),if,compute,node,1,and,3,are,in,a,same,cluster,,and,both,restart,from,a,failure,,the, processes,can,uFlize,all,SSD,bandwidth,unlike,a,flat,buffer,system, 12, LLNL#PRES#654744,

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , - PowerPoint PPT Presentation

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , Bronis R. de Supinski 2 , Naoya Maruyama 3 and Satoshi Matsuoka 1 1 Tokyo Institute of Technology 2 Lawrence Livermore National Laboratory 3 RIKEN

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support Yue Zhu*, Teng

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support Yue Zhu *, Teng

AWAKE Project Joshua Moody AWAKE Group moody@mpp.mpg.de J. Moody, Project Review 14/12/2015

Hermitian Matrix Model with Cusp Potential Kento Sugiyama (Shizuoka Univ.) in collaboration with

Kento Sato LLNL-PRES-745265 This work was performed under the auspices of the U.S. Department of

Rating Scale Peru Chile Colombia Brazil Mexico AA-/Aa3 +7 Moody's (estable) A+/A1 +6

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL)

The Popper Convention: Practical Reproducible Evaluation of Systems Ivo Jimenez , Michael

Improving I/O Performance of HPC Applications Using Intra-Job Scheduling Arnab K. Paul , Olaf

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment

on High Performance Computing Systems Fahim Tahmid Chowdhury* , Yue Zhu*, Francesco Di Natale + ,

Presenters: Kathryn Wright, Associate Kathryn.wright@mc-group.com or

Local Voluntary Committee Annual Gathering 2019 Dr Kathryn Scott Chief Executive Welcome and

Agenda Number 3. Kathryn Marquoit Assistant Ombudsman for Public Access Kathryn Marquoit joined

Portable EXPath Portable EXPath Extension Functions Extension Functions Adam Retter Adam

Michael Todd May 4, 2011 M. Todd 1 , D. Coward 2 and M.G. Zadnik 1 Email: michael.todd@icrar.org

Background Key trends in interprofessional research: A Macrosociological analysis from 1970 to

Towards Automated Characterization of the Data Movement Complexity of Affine Programs Venmugil

Fiery Cross Reef, Spratly Islands, South China Sea May 31, 2014 Fig. 2 Fiery Cross Reef,

Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement Jeff

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof.

QCD Evolution 2019 3-D STRUCTURE OF THE PION AND KAON FROM QCD'S DYSON-SCHWINGER EQUATIONS.

Solar History / Sacha Dobler - slides Fig. 1 NASAs forecast for the next solar cycle (SC 25,

The link between the tropical precipitation and Hadley circulation Author:

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , - PowerPoint PPT Presentation

Kento Sato 1 , Kathryn Mohror 2 , Adam Moody 2 , Todd Gamblin 2 , Bronis R. de Supinski 2 , Naoya Maruyama 3 and Satoshi Matsuoka 1 1 Tokyo Institute of Technology 2 Lawrence Livermore National Laboratory 3 RIKEN

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support Yue Zhu*, Teng

Direct-FUSE: Removing the Middleman for High-Performance FUSE File System Support Yue Zhu *, Teng

AWAKE Project Joshua Moody AWAKE Group moody@mpp.mpg.de J. Moody, Project Review 14/12/2015

Hermitian Matrix Model with Cusp Potential Kento Sugiyama (Shizuoka Univ.) in collaboration with

Kento Sato LLNL-PRES-745265 This work was performed under the auspices of the U.S. Department of

Rating Scale Peru Chile Colombia Brazil Mexico AA-/Aa3 *+7 Moody's (estable) A+/A1 *+6

VELOC: Very Low Overhead Checkpointing System Bogdan Nicolae, Rinku Gupta, Franck Cappello (ANL)

The Popper Convention: Practical Reproducible Evaluation of Systems Ivo Jimenez , Michael

Improving I/O Performance of HPC Applications Using Intra-Job Scheduling Arnab K. Paul , Olaf

Tackling the Reproducibility Problem in Systems Research with Declarative Experiment

on High Performance Computing Systems Fahim Tahmid Chowdhury* , Yue Zhu*, Francesco Di Natale + ,

Presenters: Kathryn Wright, Associate Kathryn.wright@mc-group.com or

Local Voluntary Committee Annual Gathering 2019 Dr Kathryn Scott Chief Executive Welcome and

Agenda Number 3. Kathryn Marquoit Assistant Ombudsman for Public Access Kathryn Marquoit joined

Portable EXPath Portable EXPath Extension Functions Extension Functions Adam Retter Adam

Michael Todd May 4, 2011 M. Todd 1 , D. Coward 2 and M.G. Zadnik 1 Email: michael.todd@icrar.org

Background Key trends in interprofessional research: A Macrosociological analysis from 1970 to

Towards Automated Characterization of the Data Movement Complexity of Affine Programs Venmugil

Fiery Cross Reef, Spratly Islands, South China Sea May 31, 2014 Fig. 2 Fiery Cross Reef,

Software Quality Engineering: Testing, Quality Assurance, and Quantifiable Improvement Jeff

Lecture 5 recap 1 Prof. Leal-Taix and Prof. Niessner Neural Network Width Depth 2 Prof.

QCD Evolution 2019 3-D STRUCTURE OF THE PION AND KAON FROM QCD'S DYSON-SCHWINGER EQUATIONS.

Solar History / Sacha Dobler - slides Fig. 1 NASAs forecast for the next solar cycle (SC 25,

The link between the tropical precipitation and Hadley circulation Author:

Rating Scale Peru Chile Colombia Brazil Mexico AA-/Aa3 +7 Moody's (estable) A+/A1 +6