towards scalable application checkpointing with parallel
play

Towards Scalable Application Checkpointing with Parallel File System - PowerPoint PPT Presentation

Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga Ming Zhao darte003@fiu.edu ming@cs.fiu.edu School of Computing and Information Sciences Florida International University Miami, FL High


  1. Towards Scalable Application Checkpointing with Parallel File System Delegation Dulcardo Arteaga Ming Zhao darte003@fiu.edu ming@cs.fiu.edu School of Computing and Information Sciences Florida International University Miami, FL

  2. High Performance Computing Systems Background Checkpointing Modes Approach Experimental Evaluation Conclusions 1 / 25

  3. Introduction � Scalability • Large scale applications run on HPC • One important challenge is Fault Tolerance • Common approach is checkpointing � Checkpointing • Store a snapshot of the current application state • Applications recover from valid snapshot in case of failure � HPC systems use parallel file system to do checkpointing Background Checkpointing Modes Approach Experimental Evaluation Conclusions 2 / 25

  4. Parallel File Systems (PFSes) � Components: Meta Data Servers Store metadata information about files Data Servers Store actual data of files Clients Run on compute nodes and provide interface to Storage System Background Checkpointing Modes Approach Experimental Evaluation Conclusions 3 / 25

  5. Problem Statement Problem Large scale checkpointing causes serious bottleneck at metadata servers on HPC systems Approach Delegate the management of the PFS storage space used for checkpointing to applications to reduce metadata overhead Background Checkpointing Modes Approach Experimental Evaluation Conclusions 4 / 25

  6. Outline 1 Introduction 2 Checkpointing Modes 3 Approach 4 Experimental Evaluation 5 Conclusion Background Checkpointing Modes Approach Experimental Evaluation Conclusions 5 / 25

  7. Checkpointing Modes File-per-Process � File-per-Process (N-N) N1 N2 N3 • Every process writes to a P1 P2 P3 P4 P5 P6 different file � Metadata management overhead • Imply a creation of many files (N-N) • Metadata operation per file and per process Background Checkpointing Modes Approach Experimental Evaluation Conclusions 6 / 25

  8. Checkpointing Modes Shared-File N1 N2 N3 P1 P2 P3 P4 P5 P6 � Shared-File (N-1) segmented • Processes write sequentially on shared-file’s region � Shared-File (N-1) strided • Processes write to different (N-1 Segmented) part of shared-file N1 N2 N3 P1 P2 P3 P4 P5 P6 � Metadata management overhead • Every process requests same metadata every time • File locking (N-1 Strided) Background Checkpointing Modes Approach Experimental Evaluation Conclusions 7 / 25

  9. Approach - PFS-delegation Create reserved space 1 (only one time) Compute Nodes Receive metadata of 2 … Application reserved space (only PFS-D MPI-IO one time) Perform I/O directly 3 3. Read/write of checkpoints Metadata table Proc. 1’s to data servers checkpoint space Proc. 2’s Read and write checkpoint space … … … from/to RESERVED SPACE Proc. n’s checkpoints require checkpoint space Metadata Servers Data Servers to follow only step 3 after reserved space is created Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  10. Approach - PFS-delegation Create reserved space 1 (only one time) Compute Nodes Receive metadata of 2 … Application reserved space (only PFS-D MPI-IO one time) Perform I/O directly 3 3. Read/write of checkpoints Metadata table Proc. 1’s to data servers checkpoint space Proc. 2’s Read and write checkpoint space … … … from/to RESERVED SPACE Proc. n’s checkpoints require checkpoint space Metadata Servers Data Servers to follow only step 3 after reserved space is created Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  11. Approach - PFS-delegation Create reserved space 1 (only one time) Compute Nodes Receive metadata of 2 … Application reserved space (only PFS-D MPI-IO one time) Perform I/O directly 3 3. Read/write of checkpoints Metadata table Proc. 1’s to data servers checkpoint space Proc. 2’s Read and write checkpoint space … … … from/to RESERVED SPACE Proc. n’s checkpoints require checkpoint space Metadata Servers Data Servers to follow only step 3 after reserved space is created Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  12. Approach - PFS-delegation Create reserved space 1 (only one time) Compute Nodes Receive metadata of 2 … Application reserved space (only PFS-D MPI-IO one time) Perform I/O directly 3 3. Read/write of checkpoints Metadata table Proc. 1’s to data servers checkpoint space Proc. 2’s Read and write checkpoint space … … … from/to RESERVED SPACE Proc. n’s checkpoints require checkpoint space Metadata Servers Data Servers to follow only step 3 after reserved space is created Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  13. Approach - PFS-delegation Create reserved space 1 (only one time) Compute Nodes Receive metadata of 2 … Application reserved space (only PFS-D MPI-IO one time) Perform I/O directly 3 3. Read/write of checkpoints Metadata table Proc. 1’s to data servers checkpoint space Proc. 2’s Read and write checkpoint space … … … from/to RESERVED SPACE Proc. n’s checkpoints require checkpoint space Metadata Servers Data Servers to follow only step 3 after reserved space is created Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  14. Approach - PFS-delegation Create reserved space 1 (only one time) Compute Nodes Receive metadata of 2 … Application reserved space (only PFS-D MPI-IO one time) Perform I/O directly 3 3. Read/write of checkpoints Metadata table Proc. 1’s to data servers checkpoint space Proc. 2’s Read and write checkpoint space … … … from/to RESERVED SPACE Proc. n’s checkpoints require checkpoint space Metadata Servers Data Servers to follow only step 3 after reserved space is created Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  15. Approach - PFS-delegation Compute Nodes … Application Application uses PFS-D PFS-delegation MPI-IO interfaces 3. Read/write of checkpoints Metadata table Proc. 1’s checkpoint space PFS-delegation uses Proc. 2’s checkpoint space … MPI-IO API to … … communicate with RESERVED SPACE Proc. n’s checkpoint space servers Metadata Servers Data Servers Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  16. Approach - PFS-delegation Compute Nodes … Application Application uses PFS-D PFS-delegation MPI-IO interfaces 3. Read/write of checkpoints Metadata table Proc. 1’s checkpoint space PFS-delegation uses Proc. 2’s checkpoint space … MPI-IO API to … … communicate with RESERVED SPACE Proc. n’s checkpoint space servers Metadata Servers Data Servers Background Checkpointing Modes Approach Experimental Evaluation Conclusions 8 / 25

  17. Reserving Delegated Storage Space Compute Nodes … Application PFS-D MPI-IO The reservation process is made by creating 3. Read/write of checkpoints Metadata table one large logical file across the PFS data Proc. 1’s checkpoint space Proc. 2’s checkpoint space servers … … … RESERVED SPACE Proc. n’s checkpoint space Metadata Servers Data Servers � To avoid initial overhead at reservation there are different techniques • Create a sparse file by writing the last byte of corresponding datafile (PVFS2) • Use fallocate (GPFS) � This process is executed only once � The size of reserved space should consider: • Single checkpointing size • Amount of checkpoints • Storage policy Background Checkpointing Modes Approach Experimental Evaluation Conclusions 9 / 25

  18. Data Layout Compute Nodes … Application PFS-D MPI-IO The layout is specified as a regular file 3. Read/write of checkpoints Metadata table Proc. 1’s checkpoint space layout using MPI-IO Proc. 2’s checkpoint space … … … RESERVED SPACE Proc. n’s checkpoint space Metadata Servers Data Servers � PFS-delegation uses the following hints for layout definition: • striping factor: number of data server involved • striping unit: stripe size � PVFS2 implementation uses simple stripe and round robin distribution MPI info info; MPI Info set(info, ‘‘striping factor’’, ‘‘4’’); MPI Info set(info, ‘‘striping unit’’, ‘‘65536’’); Background Checkpointing Modes Approach Experimental Evaluation Conclusions 10 / 25

  19. Reserved-Space Distribution Compute Nodes … Application PFS-D MPI-IO Metadata Table offset start 3. Read/write of checkpoints Metadata table offset end Proc. 1’s checkpoint space offset next Proc. 2’s checkpoint space … revision … … RESERVED SPACE Proc. n’s checkpoint space Metadata Servers Data Servers � offset start and offset end : • Specify limits of client’s assigned region � offset next : • Specify next valid offset to write a checkpoint � revision : • Checkpointing counter Background Checkpointing Modes Approach Experimental Evaluation Conclusions 11 / 25

Recommend


More recommend