Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 - PowerPoint PPT Presentation

Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 27 March 2019 Department of Computer Science, Johns Hopkins University

The I/O Crisis in HPC In a world where FLOPS is the commodity……. …..Disk I/O often limits performance l Any persistent data must make it off the supercomputer – To magnetic or solid state storage l Storage is not as connected to the high-speed network as compute – Because it needs to be shared with other computers – Because it doesn’t add to TOP500 benchmarks Lecture 22: Checkpoint I/O Performance

Where does the I/O Come From? l Checkpointing! – And, writing output from simulation (which is checkpointing) l Checkpoint workload – Every node node writes local state to a shared file system – Using POSIX calls (FS parallelized) or MPI I/O J. Bent et al. PLFS: A Checkpoint File Systems for Parallel Applications. SC, 2009. Lecture 22: Checkpoint I/O Performance

Why Checkpointing l At scale failures occur inevitably – MPI synchronous model means that a failure breaks the code – Lose all work since start (or restart) l Each checkpoint provides a restart point – Limits exposure, loss of work to last checkpoint l By policy, all codes that run at scale on supercomputers MUST checkpoint! – HPC centers want codes to do useful work Lecture 22: Checkpoint I/O Performance

Checkpoint Approaches l Automatic: store contents of memory and program counters – Brute force, large data, inefficient – But easy, no development effort – New interest in this approach with the emergence of VMs and containers in HPC. l Application specific: keep data structures and metadata representing current progress. Hand coded by developer. – Smaller, faster, preferred, but tedious. – Almost all “good” codes have application specific checkpoints Lecture 22: Checkpoint I/O Performance

A Checkpoint Workload l IOR benchmark – Each node transfers 512 MB l Barriers l How much parallelism? l What effects? M. Uselton et al. Parallel I/O Performance: From Events to Ensembles. IPDPS, 2010. Lecture 22: Checkpoint I/O Performance

I/O Rates and PDF l What features do you observe? M. Uselton et al. Parallel I/O Performance: From Events to Ensembles. IPDPS, 2010. Lecture 22: Checkpoint I/O Performance

I/O Rates and PDF l What features do you observe? – Lagging processes = not realizing peak I/O performance – Harmonics in I/O distribution = unfair resource sharing M. Uselton et al. Parallel I/O Performance: From Events to Ensembles. IPDPS, 2010. Lecture 22: Checkpoint I/O Performance

Statistical Observations l Order statistics – Fancy way of saying, the longest operation dominates overall performance l Law of large numbers – I don’t think that they make this analysis cogent – It’s right, but Gaussian distribution is not what matters – A better, intuitive conclusion is l (RB interprets) smaller files are better – The worst case slow down on a smaller transfer takes less absolute time than on a large transfer – As long as transfers are “big enough” to amortize startups costs Lecture 22: Checkpoint I/O Performance

Smaller Files Improve Performance l Non-intuitive – Smaller operations seems like more overhead – But, a property of statistical analysis l Smaller better as long as fixed costs are amortized – Obviously, 1 byte is too small Lecture 22: Checkpoint I/O Performance

The Checkpoint Crisis As HPC codes get larger, I/O becomes more critical l Some observations – Checkpoint to protect against failure – More components increase failure probability – FLOPs grows faster than bandwidth l Conclusion – Must take slower checkpoints more often – Eventually you will get no constructive work done between checkpoints l Mitigation (just delaying the problem) – Burst buffers: fast (SSD) storage in high-speed network – Observe the checkpoint persistence is shorter than needed for output/analysis data Lecture 22: Checkpoint I/O Performance

Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 - PowerPoint PPT Presentation

Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 27 March 2019 Department of Computer Science, Johns Hopkins University The I/O Crisis in HPC In a world where FLOPS is the commodity. ..Disk I/O

Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 4

Checkpoints and Continuations instead of Nested Transactions Eric Koskinen Brown University

AMMI Introduction to Deep Learning 10.4. Model persistence and checkpoints Fran cois

When does macOS Catalina create APFS checkpoints and which data could be retrieved from them?

Fine-Grained Fault Tolerance using Device Checkpoints Asim Kadav with Matthew Renzelmann and

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

TEAM CAVE See Schedules and Checkpoints http://www.failedlife.com/TeamCave.htm March 22

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Modeling the Impact of Checkpoints on Next-Generation Systems Cray User Group Technical

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation,

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

Checkpoint/Restart in Linux Sukadev Bhattiprolu IBM Linux Technology Center 09/2009 Linux is a

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Lect ure # 13 ADVANCED DATABASE SYSTEMS Checkpoint Protocols @ Andy_Pavlo // 15- 721 //

Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Uni.lu HPC School 2019 PS07: Scientific computing using MATLAB Uni.lu High Performance Computing

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 - PowerPoint PPT Presentation

Lecture 22 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 27 March 2019 Department of Computer Science, Johns Hopkins University The I/O Crisis in HPC In a world where FLOPS is the commodity. ..Disk I/O

Lecture 18 I/O Performance and Checkpoints EN 600.320/420/620 Instructor: Randal Burns 4

Checkpoints and Continuations instead of Nested Transactions Eric Koskinen Brown University

AMMI Introduction to Deep Learning 10.4. Model persistence and checkpoints Fran cois

When does macOS Catalina create APFS checkpoints and which data could be retrieved from them?

Fine-Grained Fault Tolerance using Device Checkpoints Asim Kadav with Matthew Renzelmann and

On Efficient Constructions of Checkpoints Yu Chen, Zhenming Liu, Bin Ren and Xin Jin Checkpoint

TEAM CAVE See Schedules and Checkpoints http://www.failedlife.com/TeamCave.htm March 22

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Modeling the Impact of Checkpoints on Next-Generation Systems Cray User Group Technical

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Network Administration HW4 Checkpoints tzute Computer Center, CS, NCTU Overview (1/3) A. Check

Ray Tracing Assignment Goal is to reproduce the following So You Want to Write a Ray Tracer

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Lecture: Metrics to Evaluate Performance Topics: Benchmark suites, Performance equation,

CPU Performance Lecture 8 CAP 3103 06-11-2014 1.6 Performance Defining Performance Which

Checkpoint/Restart in Linux Sukadev Bhattiprolu IBM Linux Technology Center 09/2009 Linux is a

Virtual Machine Checkpointing Brendan Cully University of British Columbia with Andrew Warfield

Lect ure # 13 ADVANCED DATABASE SYSTEMS Checkpoint Protocols @ Andy_Pavlo // 15- 721 //

Checkpointing HPC Applications Thomas Ropars thomas.ropars@imag.fr Universit e Grenoble Alpes

Cyber-Physical System Checkpointing and Recovery Fanxin Kong , Meng Xu, James Weimer, Oleg

Uni.lu HPC School 2019 PS07: Scientific computing using MATLAB Uni.lu High Performance Computing

Blocking and Non-blocking Checkpointing and Rollback Recovery for Networks-on-Chip Claudia Rusu 1

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team