Design of Locality-aware MPI-IO for Scalable Shared File Write - PowerPoint PPT Presentation

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba

P0 P1 P0 P1 Background P0 File P1 File SSF Fig. File Per Process Fig. Single Shared File ● Single Shared File (SSF) Multiple processes access a single shared file ( ⇔ File Per Process; FFP) ○ ○ It is a typical I/O access pattern in HPC applications [6] ○ It is used to reduce # of result files on a large-scale job ● Node-local Storage ○ It is installed in recent Supercomputer's computation nodes [7, 8, 9] ○ It is used as a temporary read file cache in File Per Process (FPP) ○ It can minimize communication cost The usage for Single Shared File is not obvious ○ 2 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

P0 P1 P2 P3 P0 P1 P2 P3 Problem ● Single Shared File access is slow Fig. lock-contention in file striping Reason: lock-contention ○ ○ The block or stripe size on a filesystem and access size of the application is not matched ● Node-local Storage cannot be used in the context of the locality-awareness ○ Reason: In the most of exist file systems employ file striping ○ The file striping consumes network bandwidth on computation node-side ● Our Goal: Achieve scalable write bandwidth for shared file access using node-local storage without file striping 3 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Approach ● Avoid the lock-contention in Single Shared File access Most of the HPC applications does not access overlap region in SSF ○ ○ The locking among parallel I/O requests is not essential ○ Approach : We propose a lockless format for SSF representation ● Utilize locality ○ Lock in I/O requests within the computation node ○ Place files mostly in a node-local storage and minimize remote communication for the file access Approach : Utilize Gfarm filesystem for locality-oriented file placement ○ 4 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Proposal: Sparse Segments ● Sparse Segments: An internal file format for a single shared file Each process creates the corresponding segment which is expected to ○ be stored in the node-local storage P0 P1 P0 P1 P0 P1 P0 File HOLE HOLE HOLE P1 File HOLE HOLE HOLE HOLE HOLE HOLE (a) N-1 Segmented w/o Resize (b) N-1 Strided w/ Resize (c) N-1 Strided w/ Resize 5 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Implementation: Locality-aware MPI-IO ● Locality-aware MPI-IO: An MPI-IO optimization for Shared File Access Implicitly converts SSF to Sparse Segments ○ ○ Generates Sparse Segments in MPI_File_open() P0 P1 P0 P1 MPI_File_open(X) MPI_File_open(X) open(X) open(X) open(X.0) open(X.1) File X File X.0 File X.1 (a) Conventional MPI-IO (b) Ours. Locality-aware MPI-IO 6 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Locality-oriented File Placement Node #0 Node #1 ● Store each Sparse Segments into Gfarm [25] Gfarm is a locality-oriented parallel filesystem ○ P1 P0 ○ Gfarm automatically stores full copy of the files MPI_File_open(X) on the nearest storage from the process ○ Each Sparse Segments is stored into open(X.0) open(X.1) corresponding node's local storage File X.0 File X.1 Gfarm Filesystem File X.0 File X.1 7 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Experiment ● Method Issue write accesses against a single shared file (in weak scaling) ○ ● Application ○ Microbenchmark: IOR ○ Application Benchmark: S3D-IO, LES-IO, VPIC-IO Environment ● ○ System: TSUBAME 3.0 Supercomputer [8] at TokyoTech ■ Proposal: node-local storage on compute node ■ Lustre: file system node, 68 OSTs (peak 50 GB/s; not apple-to-apple) BeeOND: node-local storage on compute node ■ 8 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Experiment ● The proposal method scales up in all benchmarks (aggregate bandwidth) Lustre bandwidth is saturated when # of processes exceeds BeeOND is not scalable even it # of OSTs uses same node-local storage Proposal is scalable IOR (non-collective) S3D-IO LES-IO VPIC-IO 9 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Discussion ● Lustre and BeeOND induce slow bandwidth in application benchmarks Accesses to SSF in small pieces occur many lock contention among ○ processes ● Our method demonstrated linearly scalable bandwidth Conversion to Sparse Segments ○ ■ All application benchmarks are scaled ■ Result : Successfully avoids lock-contention ○ Locality-aware File Placement All benchmarks are scaled in linear even more than # of OST nodes ■ ■ Result : Successfully and effectively scales using node-local storages 10 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Conclusion ● We proposed Sparse Segments and Locality-aware MPI-IO Our method demonstrates scalable parallel write ● ○ In both the microbenchmark and application benchmarks 11 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

References [6] P . Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and R. Ross, “Understanding and Improving Computational Science Storage Access Through Continuous Characterization,” ACM Trans. Storage, vol. 7, no. 3, pp. 8:1–8:26, Oct. 2011. [7] Summit. [Online]. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/ [8] TSUBAME 3.0. [Online]. Available: https://www.t3.gsic.titech.ac.jp/en/hardware [9] ABCI. [Online]. Available: https://abci.ai/en/about_abci/computing_resource.html [25] O. Tatebe, K. Hiraga, and N. Soda, “Gfarm grid file system,” New Generation Computing, vol. 28, no. 3, pp. 257–275, Jul 2010. 12 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Contact Information ● sugihara@hpcs.cs.tsukuba.ac.jp ● tatebe@cs.tsukuba.ac.jp 13 Design of Locality-aware MPI-IO for Scalable Shared File Write Performance (HPS 2020)

Design of Locality-aware MPI-IO for Scalable Shared File Write - PowerPoint PPT Presentation

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba P0 P1 P0 P1

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

File Management What is a file? Elements of file management File organization

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

CPSC 410/611: File Management What is a file? Elements of file management File

CS307&CS356: Operating Systems Dept. of Computer Science & Engineering Chentao Wu

DATA, DATABASES, AND QUERIES CS1100 Microsoft Access - Introduction 1 What is this About

Local Version Control (sccs, rcs) Steven J Zeil February 21, 2013 Local Version

Formally Specifying POSIX File Systems Gian Ntzik , Pedro da Rocha Pinto and Philippa Gardner

M ANAGING THE M ATTER F ILE M ANAGEMENT Shared drive/folders File naming conventions

Data Access and File Management Shan-Hung Wu & DataLab CS, NTHU Storage Engine VanillaCore

Using Box for Document Management Michael Fisher January 26, 2016 What is Box? Box (or

Public Agenda Item #7.2 Conside ideration and A d Approval o of the E ERS F Fiscal Year 2

Design of Locality-aware MPI-IO for Scalable Shared File Write - PowerPoint PPT Presentation

Design of Locality-aware MPI-IO for Scalable Shared File Write Performance Kohei Sugihara 1 , Osamu Tatebe 2 1 Department of Computer Science, University of Tsukuba 2 Center for Computational Sciences, University of Tsukuba P0 P1 P0 P1

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

File Management What is a file? Elements of file management File organization

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

CPSC 410/611: File Management What is a file? Elements of file management File

CS307&amp;CS356: Operating Systems Dept. of Computer Science &amp; Engineering Chentao Wu

DATA, DATABASES, AND QUERIES CS1100 Microsoft Access - Introduction 1 What is this About

Local Version Control (sccs, rcs) Steven J Zeil February 21, 2013 Local Version

Formally Specifying POSIX File Systems Gian Ntzik , Pedro da Rocha Pinto and Philippa Gardner

M ANAGING THE M ATTER F ILE M ANAGEMENT Shared drive/folders File naming conventions

Data Access and File Management Shan-Hung Wu &amp; DataLab CS, NTHU Storage Engine VanillaCore

Using Box for Document Management Michael Fisher January 26, 2016 What is Box? Box (or

Public Agenda Item #7.2 Conside ideration and A d Approval o of the E ERS F Fiscal Year 2

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

CS307&CS356: Operating Systems Dept. of Computer Science & Engineering Chentao Wu

Data Access and File Management Shan-Hung Wu & DataLab CS, NTHU Storage Engine VanillaCore