Kento Sato †1 , Kathryn Mohror †2 , Adam Moody †2 , Todd Gamblin †2 , Bronis R. de Supinski †2 , Naoya Maruyama †3 and Satoshi Matsuoka †1 †1 Tokyo Institute of Technology †2 Lawrence Livermore National Laboratory †3 RIKEN Advanced institute for Computational Science This,work,performed,under,the,auspices,of,the,U.S.,Department,of,Energy,by,Lawrence,Livermore,NaFonal,Laboratory,, under,Contract,DE#AC52#,07NA27344.,LLNL#PRES#654744#DRAFT, May,27 th ,,2014, CCGrid2014@Chicago LLNL#PRES#654744,
Failures,on,HPC,systems, • ExponenFal,growth,in,computaFonal,power, – Enables,finer,grained,simulaFons,with,shorter,period,Fme, • Overall,failure,rate,increase,accordingly,because,of,the,increasing, system,size, • 191,failures,out,of,5#million,node#hours,, – A,producFon,applicaFon,of,Laser#plasma,interacFon,code,( pF3D ), – Hera,,Atlas,and,Coastal,clusters,@LLNL, Estimated MTBF (w/o hardware reliability improvement per component in future) 1,000,nodes, 10,000,nodes, 100,000,nodes, MTBF, 1.2,days, 2.9,hours, 17,minutes, (Measured), (EsFmaFon), (EsFmaFon), • Will,be,difficult,for,applicaFons,to,conFnuously,run,for,a,long, Fme,without,fault,tolerance,at,extreme,scale, 2, LLNL#PRES#654744,
Checkpoint/Restart,(So^ware#Lv.), • Idea,of,Checkpoint/Restart, – Checkpoint, Checkpoint/Restart, • Periodically,save,snapshots,of, Failure, an,applicaFon,state,to,PFS, – Restart, check, check, check, • On,a,failure,,restart,the, point, point, point, execuFon,from,the,latest, checkpoint, CheckpoinFng,overhead, • Improved,Checkpoint/Restart, Parallel,file,system,(PFS), – MulF#level,checkpoinFng,[1], – Asynchronous,checkpoinFng,[2], – In#memory,diskless,checkpoinFng,[3], • We,found,that,so^ware#level,approaches,may,be,limited,in, increasing,resiliency,at,extreme,scale, [1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- blocking Checkpointing System", SC12 [3] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery", IPDPS2014 3, LLNL#PRES#654744,
Storage,architectures, We,consider,architecture#level,approaches, • Compute nodes Burst,buffer, • – A,new,Fer,in,storage,hierarchies, – Absorb,bursty,I/O,requests,from, applicaFons, – Fill,performance,gap,between,node#local, Burst buffers storage,and,PFSs,in,both,latency,and, bandwidth, If,you,write,checkpoints,to,burst,buffers,, • – Faster,checkpoint/restart,Fme,than,PFS, Parallel file system – More,reliable,than,storing,on,compute, nodes, However,…, • – Adding,burst,buffer,nodes,may,increase,total,system,size,,and,failure,rates, accordingly,, , It’s,not,clear,if,burst,buffers,improve,overall,system,efficiency, • – Because,burst,buffers,also,connect,to,networks,,the,burst,buffers,may,sFll,be,a, bofleneck, [4] Doraimani, Shyamala and Iamnitchi, Adriana, “File Grouping for Scientific Data Management: Lessons from Experimenting with Real Traces”, HPDC '08 4, LLNL#PRES#654744,
Goal,and,ContribuFons, • Goal :,,, – Develop,an,interface,to,exploit,bandwidth,of,burst,buffers, – Explore,effecFveness,of,burst,buffers, – Find,out,the,best,C/R,strategy,on,burst,buffers, • Contribu,ons :,, – Development,of,IBIO,exploiFng,bandwidth,to,burst,buffers, – A,model,to,evaluate,system,resiliency,given,a,C/R,strategy, and,storage,configuraFon, – Our,experimental,results,show,a,direcFon,to,build,resilient, systems,for,extreme,scale,compuFng , LLNL#PRES#654744,
Outlines, • IntroducFon, • Checkpoint,strategies, • Storage,designs, • IBIO:,InfiniBand#based,I/O,interface, • Modeling, • Experiments, • Conclusion, 6, LLNL#PRES#654744,
Diskless,checkpoint/restart,(C/R), failure, 7, Diskless,C/R, • Parity,1, ckpt,B1, ckpt,C1, ckpt,D1, – Create,redundant,data,across,local,storages, ckpt,A1, Parity,2, ckpt,C2, ckpt,D2, ckpt,A2, ckpt,B2, Parity,3, ckpt,D3, on,compute,nodes,using,a,encoding, ckpt,A3, ckpt,B3, ckpt,C3, Parity,4, technique,such,as,XOR, Node%1% Node%2% Node%3% Node%4% – Can,restore,lost,checkpoints,on,a,failure, caused,by,small,#,of,nodes,like,RAID#5, XOR,encoding,example, Most,of,failures,comes,from,one,node,,or,can,recover,from,XOR,checkpoint, • – e.g.,1),TSUBAME2.0:,92%,failures, Rest,of,failures,sFll,require,a,checkpoint,on,a,reliable,PFS, 8% – e.g.,2),LLNL,clusters:,85%,failures, 15% LOCAL/XOR/PARTNER checkpoint PFS checkpoint 92% 85% Diskless,checkpoint,is, promising,approach, Failure analysis on TSUBAME2.0 Failure analysis on LLNL clusters 7, LLNL#PRES#599833,
MulF#level,Checkpoint/Restart,(MLC/R), [1,2] MLC, MLC,model, Duration 2 1 1 1 1 2 1 1 1 1 2 t + c k r 1 1 1 1 1 1 1 1 k p 0 ( t + c k ) p 0 ( r p 0 ( r k ) k ) No k k Diskless, 1 1 1 1 1 1 failure t 0 ( r k ) Level#1, t 0 ( t + c k ) checkpoint, 1 1 1 1 p i ( r k ) p i ( t + c k ) i i k k Failure t i ( r k ) 2 t i ( t + c k ) 2 : No failure for T seconds t : Interval p 0 ( T ) e − λ T PFS, p 0 ( T ) = Level#2, c c t 0 ( T ) = T : Expected time when p 0 ( T ) t 0 ( T ) : c -level checkpoint time checkpoint, λ i λ (1 − e − λ T ) r p i ( T ) = : i - level failure for T seconds c : c -level recovery time p i ( T ) 1 − ( λ T + 1) · e − λ T t i ( T ) = : Expected time when p i ( T ) λ i : i -level checkpoint time t i ( T ) λ · (1 − e − λ T ) MLC,hierarchically,use,storage,levels, • 1 – Diskless,checkpoint:,Frequent,,for,one, 0.9 node,for,a,few,node,failure, 0.8 MTBF 0.7 Efficiency – PFS,checkpoint:,Less,frequent,and, a few hours 0.6 0.5 asynchronous,for,mulF#node,failure, MTBF 0.4 days or a day Our,evaluaFon,showed,system,efficiency, 0.3 • 0.2 drops,to,less,than,10%,when,MTBF,is,a, 0.1 0 few,hours , 1 2 10 50 100 Scale factor (xF, xL2) , [1] A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, “Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System (SC 10) [2] Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, Bronis R. de Supinski, Naoya Maruyama and Satoshi Matsuoka, "Design and Modeling of a Non- 8, LLNL#PRES#654621, blocking Checkpointing System", SC12
Uncoordinated,C/R,+,MLC Coordinated,C/R, Uncoordinated,C/R, Cluster/A/ P0, P0, ckpt, ckpt, msg,logging, P1, P1, ckpt, ckpt, Cluster/B/ Failure, P2, Failure, P2, ckpt, ckpt, P3, P3, Coordinated,C/R, • Coordinated C/R Uncoordinated C/R 1 – All,processes,globally,synchronize,before, 0.9 taking,checkpoints,and,restart,on,a,failure,, 0.8 – Restart,overhead, MTBF 0.7 Efficiency a few hours 0.6 Uncoordinated,C/R, • 0.5 MTBF – Create,clusters,,and,log,messages, 0.4 days or a day exchanged,between,clusters, 0.3 0.2 – Message,logging,overhead,is,incurred,,but, 0.1 rolling#back,only,a,cluster,can,restart,the, 0 execuFon,on,a,failure, 1 2 10 50 100 Scale factor (xF, xL2) � ,MLC,+,Uncoordinated,C/R,(So^ware#level), approaches,may,be,limited,at,extreme,scale, 9, LLNL#PRES#654621,
Storage,designs, • AddiFon,to,the,so^ware#level,approaches,,we,also,explore,two, architecture#level,approaches,, – Flat,buffer,system:,, • ,Current,storage,system, – Burst,buffer,system:,, • Separated,buffer,space, Burst,buffer,system, Flat,buffer,system, Compute, Compute, Compute, Compute, Compute, Compute, Compute, Compute, node,1, node,2, node,4, node,1,, node,3, node,2, node,3, node,4, SSD,1, SSD,2, SSD,3, SSD,4, SSD,1, SSD,2, SSD,3, SSD,4, PFS,(Parallel,file,system), PFS,(Parallel,file,system), 10, LLNL#PRES#654744,
Flat,Buffer,Systems, Flat,buffer,system, • Design,concept, Cluster, – Each,compute,node,has,its, dedicated,node#local,, Compute, Compute, Compute, Compute, node,1, node,2, node,4, node,3, storage, idle, idle, SSD,1, SSD,2, SSD,3, SSD,4, – Scalable,with,increasing,, number,of,compute,nodes, PFS,(Parallel,file,system), • This,design,has,drawbacks:, 1. Unreliable,checkpoint,storage, e.g.),If,compute,node,2,fails,,a,checkpoint,on,SSD,2,will,be,lost,because,SSD,2,is,physically, afached,to,the,failed,compute,node,2, 2. Inefficient,uFlizaFon,of,storage,resources,on,uncoordinated,checkpoinFng, e.g.),If,compute,node,1,&,3,are,in,a,same,cluster,,and,restart,from,a,failure,,the,bandwidth,of, SSD,2,&,4,will,not,be,uFlized, 11, LLNL#PRES#654744,
Burst,Buffer,Systems, Burst,buffer,system, Cluster, • Design,concept, – A,burst,buffer,is,a,storage, failure, space,to,bridge,the,gap,in, Compute, Compute, Compute, Compute, latency,and,bandwidth, node,1,, node,2, node,4, node,3, between,node#local,storage, and,the,PFS, SSD,1, SSD,2, SSD,3, SSD,4, – Shared,by,a,subset,of,compute, nodes, PFS,(Parallel,file,system), • Although,addiFonal,nodes,are,required,,several,advantages, 1. More,Reliable,because,burst,buffers,are,located,on,a,smaller,#,of,nodes, e.g.),Even,if,compute,node,2,fails,,a,checkpoint,of,compute,node,2,is,accessible,from,the, other,compute,node,1, 2. Efficient,uFlizaFon,of,storage,resources,,on,uncoordinated,checkpoinFng, e.g.),if,compute,node,1,and,3,are,in,a,same,cluster,,and,both,restart,from,a,failure,,the, processes,can,uFlize,all,SSD,bandwidth,unlike,a,flat,buffer,system, 12, LLNL#PRES#654744,
Recommend
More recommend