design of locality aware mpi io for scalable shared file
play

Design of Locality-aware MPI-IO for Scalable Shared File Write - PowerPoint PPT Presentation

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba P0 P1 P0 P1


  1. Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba

  2. P0 P1 P0 P1 Background P0 File P1 File SSF Fig. File Per Process Fig. Single Shared File ● Single Shared File (SSF) Multiple processes access a single shared file ( ⇔ File Per Process; FFP) ○ ○ It is a typical I/O access pattern in HPC applications [6] ○ It is used to reduce # of result files on a large-scale job ● Node-local Storage ○ It is installed in recent Supercomputer's computation nodes [7, 8, 9] ○ It is used as a temporary read file cache in File Per Process (FPP) ○ It can minimize communication cost The usage for Single Shared File is not obvious ○ 2 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  3. P0 P1 P2 P3 P0 P1 P2 P3 Problem ● Single Shared File access is slow Fig. lock-contention in file striping Reason: lock-contention ○ ○ The block or stripe size on a filesystem and access size of the application is not matched ● Node-local Storage cannot be used in the context of the locality-awareness ○ Reason: In the most of exist file systems employ file striping ○ The file striping consumes network bandwidth on computation node-side ● Our Goal: Achieve scalable write bandwidth for shared file access using node-local storage without file striping 3 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  4. Approach ● Avoid the lock-contention in Single Shared File access Most of the HPC applications does not access overlap region in SSF ○ ○ The locking among parallel I/O requests is not essential ○ Approach : We propose a lockless format for SSF representation ● Utilize locality ○ Lock in I/O requests within the computation node ○ Place files mostly in a node-local storage and minimize remote communication for the file access Approach : Utilize Gfarm filesystem for locality-oriented file placement ○ 4 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  5. Proposal: Sparse Segments ● Sparse Segments: An internal file format for a single shared file Each process creates the corresponding segment which is expected to ○ be stored in the node-local storage P0 P1 P0 P1 P0 P1 P0 File HOLE HOLE HOLE P1 File HOLE HOLE HOLE HOLE HOLE HOLE (a) N-1 Segmented w/o Resize (b) N-1 Strided w/ Resize (c) N-1 Strided w/ Resize 5 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  6. Implementation: Locality-aware MPI-IO ● Locality-aware MPI-IO: An MPI-IO optimization for Shared File Access Implicitly converts SSF to Sparse Segments ○ ○ Generates Sparse Segments in MPI_File_open() P0 P1 P0 P1 MPI_File_open(X) MPI_File_open(X) open(X) open(X) open(X.0) open(X.1) File X File X.0 File X.1 (a) Conventional MPI-IO (b) Ours. Locality-aware MPI-IO 6 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  7. Locality-oriented File Placement Node #0 Node #1 ● Store each Sparse Segments into Gfarm [25] Gfarm is a locality-oriented parallel filesystem ○ P1 P0 ○ Gfarm automatically stores full copy of the files MPI_File_open(X) on the nearest storage from the process ○ Each Sparse Segments is stored into open(X.0) open(X.1) corresponding node's local storage File X.0 File X.1 Gfarm Filesystem File X.0 File X.1 7 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  8. Experiment ● Method Issue write accesses against a single shared file (in weak scaling) ○ ● Application ○ Microbenchmark: IOR ○ Application Benchmark: S3D-IO, LES-IO, VPIC-IO Environment ● ○ System: TSUBAME 3.0 Supercomputer [8] at TokyoTech ■ Proposal: node-local storage on compute node ■ Lustre: file system node, 68 OSTs (peak 50 GB/s; not apple-to-apple) BeeOND: node-local storage on compute node ■ 8 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  9. Experiment ● The proposal method scales up in all benchmarks (aggregate bandwidth) Lustre bandwidth is saturated when # of processes exceeds BeeOND is not scalable even it # of OSTs uses same node-local storage Proposal is scalable IOR (non-collective) S3D-IO LES-IO VPIC-IO 9 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  10. Discussion ● Lustre and BeeOND induce slow bandwidth in application benchmarks Accesses to SSF in small pieces occur many lock contention among ○ processes ● Our method demonstrated linearly scalable bandwidth Conversion to Sparse Segments ○ ■ All application benchmarks are scaled ■ Result : Successfully avoids lock-contention ○ Locality-aware File Placement All benchmarks are scaled in linear even more than # of OST nodes ■ ■ Result : Successfully and effectively scales using node-local storages 10 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  11. Conclusion ● We proposed Sparse Segments and Locality-aware MPI-IO Our method demonstrates scalable parallel write ● ○ In both the microbenchmark and application benchmarks 11 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  12. References [6] P . Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross, “Understanding and Improving Computational Science Storage Access Through Continuous Characterization,” ACM Trans. Storage, vol. 7, no. 3, pp. 8:1–8:26, Oct. 2011. [7] Summit. [Online]. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ [8] TSUBAME 3.0. [Online]. Available: https://www.t3.gsic.titech.ac.jp/en/hardware [9] ABCI. [Online]. Available: https://abci.ai/en/about_abci/computing_resource.html [25] O. Tatebe, K. Hiraga, and N. Soda, “Gfarm grid file system,” New Generation Computing, vol. 28, no. 3, pp. 257–275, Jul 2010. 12 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

  13. Contact Information ● sugihara@hpcs.cs.tsukuba.ac.jp ● tatebe@cs.tsukuba.ac.jp 13 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Recommend


More recommend