isolated mpi i o solution on top of mpi 1
play

Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. - PDF document

May 23-24, 2001, Sheraton Hyannis, Cape Cod, Hyannis MA 5th Workshop on Distributed Supercomputing: Scalable Cluster Software SFIO Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. Hersch cole Polytechnique Fdrale de


  1. May 23-24, 2001, Sheraton Hyannis, Cape Cod, Hyannis MA 5th Workshop on Distributed Supercomputing: Scalable Cluster Software SFIO Isolated MPI-I/O Solution on top of MPI-1 Emin Gabrielyan, Roger D. Hersch École Polytechnique Fédérale de Lausanne, Switzerland {Emin.Gabrielyan,RD.Hersch}@epfl.ch .

  2. MPI-I/O Access READ READ READ AT_ALL ALL ORDERED Operations BEG/END BEG/END BEG/END WRITE WRITE WRITE collective READ READ READ ALL AT_ALL ORDER. g n WRITE WRITE i WRITE k c o l b - n o n Coordination IREAD IREAD IREAD AT SHARED m s IWRITE IWRITE IWRITE i n g o n r i h k c c o n y l b S non-collective Positioning READ READ READ AT SHARED WRITE WRITE WRITE explicit individual shared offsets file pointers file pointers The basic set of MPI-I/O interface functions consists of File Manipulation Op- erations, File View Operations and Data Access Operations. There are three or- thogonal aspects to data access: positioning, synchronism, and coordination, and there are 12 respective types of read and of write operations. .

  3. , y r File View contiguous in the memory y o m r o e m m e m n i n s e i e u l s l o i i f u f u n o g n i u i i t s g n s u i o t a o n c u l n o l g e o c i w n t n o s a c Memory Memory View View fragmentation in the memory File File y r , o y m r o e m m e m n non-contiguous in the memory e i l s n i f u i e n o s l u i u i f o g s u u i n t g o n i i u o s t g c a n i n l o t o l c n e n o w c s n a o n Memory Memory View View File File fragmentation of the view of file contiguous in the file non-contiguous in the file The file view is a global concept, which interferes with all data access opera- tions. For each process it specifies its own view of the shared data file: a se- quence of pieces in the common data file that are visible for the particular process. In order to specify the file view the user creates a derived datatype, which defines the fragmented structure of the visible part of the file. Since each access operation can use another derived datatype that specifies the fragmenta- tion in memory, there are two additional orthogonal aspects to data access: the fragmentation in the memory and the fragmentation of the file view. .

  4. Derived Datatypes MPI_Type_contiguous(2,T3,& T4 ) MPI_Type_struct(2,...,& T3 ) MPI_Type_struct(2,...,& T3 ) MPI_Type_contiguous(2,T1,& T2 ) MPI_Type_contiguous(2,T1,& T2 ) (3,1,2,MPI_BYTE,& T1 ) (3,1,2,MPI_BYTE,& T1 ) (3,1,2,MPI_BYTE,& T1 ) (3,1,2,MPI_BYTE,& T1 ) MPI_Type_vector MPI_Type_vector MPI_Type_vector MPI_Type_vector Derived Datatype T4 T4 MPI-1 provides techniques for creating datatype objects of arbitrary data lay- out in memory. The opaque datatype object can be used in various MPI opera- tions, but the layout information, once put in a derived datatype, can not be de- coded from the datatype. .

  5. MPI-I/O Implementation MPI-I/O Interface MPI-I/O Implementation Access to the internal operations and data structures of the MPI- 1 implementation, in order to decode the layout information of the file view’s derived MPI-1 Interface datatype. MPI-1 Implementation MPI-2 operations and the MPI-I/O subset in particular form an exten- sion to MPI-1. However a developer of MPI-I/O needs access to the source code of the MPI-1 implementation, on top of which he intends to implement MPI-I/O. For each MPI-1 implementation a specific development of MPI-I/O will be required. .

  6. Reverse Engineering or Memory Painting Buffer of the size of the datatype T4 Buffer of the size of T4 ’s extent Contiguous datatype Derived datatype T4 MPI_Send(source,size,MPI_BYTE,...) MPI_Recv(destination-LB,1, T4 ,...) The layout information can not be decoded from the datatype, but the behaviour of the da- tatype depends on the layout. We try to define a special test for a derived datatype, analyse the behaviour of the datatype and based on it, decode the layout information of the da- tatype. For example, MPI_Recv operation receives a contiguous network stream and dis- tributes it in memory according to the data layout of the datatype. If the memory is previously initialised with a “green colour”, and the network stream has a “red colour”, then analysis of the memory after data reception will give us the necessary information on the data layout hidden in the opaque datatype. In our solution we do not use MPI_Send and MPI_Recv operations, instead we use the MPI_Unpack standard MPI-1 operation to avoid network transfers and multiple processes usage. .

  7. Portable MPI-I/O Solution MPI-I/O Interface MPI-I/O Implementation Memory Painting MPI-1 Interface MPI-1 Implementation Once we have a tool for derived datatype decoding, it becomes possible to create an isolated MPI-I/O solution on top of any stand- ard MPI-1. The Argonne National Laboratory’s MPICH imple- mentation of MPI-I/O is intensively used with our datatype decoding technique and an isolated solution of a limited subset of MPI-I/O operations has been implemented. .

  8. MPI-I/O Isolation READ READ READ AT_ALL ALL ORDERED BEG/END BEG/END BEG/END WRITE WRITE WRITE collective READ READ READ ALL AT_ALL ORDER. g n WRITE i WRITE WRITE k c o l b - n o n Coordination IREAD IREAD IREAD AT SHARED m s IWRITE IWRITE IWRITE i n g o n r i h k c c n o y l b S non-collective Positioning READ READ READ AT SHARED WRITE WRITE WRITE explicit individual shared offsets file pointers file pointers The basic File Manipulation operations MPI_File_open and MPI_File_close; File View op- eration MPI_File_set_view and blocking non-collective Data Access Operations MPI_File_write, MPI_File_write_at, MPI_File_read, MPI_File_read_at are already suc- cessfully implemented in the form of an isolated independent library. Currently we are work- ing on the collective counterparts of blocking operations and trying to make use of the extended two-phase method for accessing sections of out-of-core arrays, on which the ANL implementation is based. .

  9. Testing Isolated MPI-I/O MPI-I/O Interface Contiguous memory and file • MPI_File_write: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok MPI-I/O Implementation • MPI_File_write_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok Memory Painting Fragmented memory, contiguous file • MPI_File_write: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok MPI-FCI on Swiss-Tx • MPI_File_write_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok Contiguous memory, fragmented file Fragmented memory and file • MPI_File_write: MPI-FCI Ok • MPI_File_write: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok • MPI_File_read: MPI-FCI Ok • MPI_File_write_at: MPI-FCI Ok • MPI_File_write_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok • MPI_File_read_at: MPI-FCI Ok The implemented operations of the isolated solution of MPI-I/O are successfully test- ed with the MPI-FCI implementation of MPI-1 on the Swiss-Tx supercomputer. .

  10. Gateway to the Parallel I/O of the Swiss-T1 PR01 Compute Processor TNET connection ~86MB/s PR15 PR14 PR16 PR17 PR13 PR18 PR12 PR19 PR00 IO Processor PR11 PR20 Switch 0 PR10 PR21 PR09 PR22 Routing P P R R PR07 0 2 P 8 3 R 2 PR06 4 PR25 1 2 PR05 PR26 PR04 PR27 PR03 PR28 P R P 0 3 R 0 2 2 9 P P R 0 R 1 3 0 PR00 PR31 PR63 PR32 2 6 3 R 3 P R P PR61 4 3 R 4 P 7 PR60 PR35 PR59 PR36 PR58 PR37 PR57 8 3 R 6 5 P 6 9 5 3 R R 5 0 P P 5 4 R R 4 1 P 5 4 P PR53 R PR42 R P P PR52 PR43 PR51 PR44 5 0 6 9 5 PR48 PR47 4 4 4 R R R R P P P P At the bottom of the isolated MPI-I/O, we intended to provide as a high per- formance I/O solution a switching to the Striped File I/O system (SFIO). SFIO communication layer is implemented on top of MPI-1 and therefore SFIO is also portable. We measured a scalable performance of the SFIO on the architecture of the Swiss-Tx supercomputer. .

  11. SFIO on the Swiss-Tx machine 400 350 300 Performance MB/s 250 200 150 100 50 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of Compute or I/O Nodes read average read maximum write average write maximum The performance of SFIO is measured for concurrent access from all com- pute nodes to all I/O nodes. In order to limit operating system caching ef- fects, the total size of the striped file linearly increases with the number of I/O nodes up to 32GB. The stripe unit size is 200 bytes. The application’s I/O performance is measured as a function of the number of Compute and I/O nodes. .

  12. Conclusion Isolated solution automatically gives to every MPI-1 owner an MPI-I/O, without any requirements of changing, modifying, or specifically interfering to his current MPI-1 implementation. Future work • Implementation of blocking collective file access operations. • Implementation of non-blocking file access operations. • The remaining File Manipulation Operations. • Switching to SFIO. Thank You ! SFIO .

Recommend


More recommend