UCSB cs240b project Fall 1999 PTMPI Threaded MPI execution on cluster of SMP machines Zoran Dimitrijevic Department of Computer Science University of California at Santa Barbara E-mail: zoran@cs.ucsb.edu
� ✁ ✁ ✁ Introduction • Cluster of SMP machines o Each cluster node is SMP machine o Communication between the nodes is through etherenet TCP/IP • Current MPI implementation for shared memory machines: o TMPI – threaded MPI execution – each MPI node is a thread inside one process Fast Not scalable – regular OS process can be running on just one machine o MPICH – each MPI node is a process – communication between nodes involve operating system activity Slow Scalable – each node can be running on different machine
Problem Statement • System consists of several processes o Scalability – each process can run on different machine o Communication between the processes is through sockets o Processes can be running anywhere on the Net • Each MPI node is a thread inside a process o Fast communication between the MPI nodes inside the same process – through shared memory o During the startup the nodes are created – each process can have different number of MPI nodes running inside it
✂ ✂ Proposed Solution • PTMPI Startup: o Configuration is in the resource file o Each process is started with single initialization argument – process ID o Each process gets its IP and listenning port number o There are p processes in the system o Complete sockets graph is created – p(p-1)/2 sockets o Each process creates local_MPI_count receiver queues o Each process creates a thread for each MPI node running on it o Each process creates two communication threads: In communicator – read from the sockets and dispatches messages Out communicator – read from its queues (one per each MPI thread) and writes to sockets
✄ ✄ ✄ ✄ • MPI Node Thread Startup: o Each MPI node is an instance of class MPI_Node o PTMPI main creates thread for each MPI node and passes the local ID to them o Each thread creates a new instance of class MPI_Node o SPMD in shared memory All global data for MPI program must be copied for each thread This is achieved since all MPI functions are friend function to class MPI_Node or defined in class MPI_Node, and all global MPI data are members of the class MPI_Node All MPI global data can be placed in mpi_global_data.h which is included in MPI_Node class o Each thread calls method mpi_main(int argc, char **argv) Arguments are passed from PTMPI main function exept first one (and the name is set to mpi_program)
• PTMPI System Layout: Process 0: IP0 MPI MPI thread thread node node Process 1: IP1 MPI output thread MPI daemon MPI MPI node thread thread thread node node node input daemon MPI output MPI thread Sockets daemon thread node node input Process 2: IP2 daemon input output daemon daemon MPI thread node MPI MPI thread thread node node
• Process node layout: Local MPI node threads Read Write sockets recv_queue[0] MPI_Node::mpi_main sockets Out_comm._queue[0] Out_comm._queue[0] recv_queue[0] MPI_Node::mpi_main . . . Out_comm._queue[0] . . p-1 . . . . . . p-1 . . . . Out_comm._queue[0] recv_queue[0] MPI_Node::mpi_main In Communicator Out Communicator Each thread writes and reads to recv_queues in shared memory
• Receiver Queues MPI_QueueElem mutex cond MPI_QueueElem mutex cond MPI_QueueElem MPI_QueueElem mutex mutex cond cond MPI_QueueElem MPI_QueueElem mutex mutex cond cond recv_buffer use_mutex recv_request recv_cond
• Messages: MPI_QueueElem o Goal: minimize the number of memory copy in system o All queues in the system are using the same class for elements o Broadcast does not copy the message o Threads are using mutex and condition members of MPI_QueueElem o Last waiter free the message if the message is buffered and deletes the element
• MPI functions implemented: o MPI_Init o MPI_Comm_rank o MPI_Comm_size o MPI_Finalize o MPI_Send o MPI_Isend o MPI_Recv o MPI_Irecv o MPI_Wait o MPI_Broadcast
Initial Performance Evaluation 120 60 100 50 1024x16 1024x16 80 40 1024x32 1024x32 1024x64 1024x64 60 30 2048x16 2048x16 2048x32 2048x32 40 20 2048x64 2048x64 20 10 0 0 MPICH PTMPI MPICH PTMPI Figure 4: Block-based matrix multiplication execution time in Figure 3: Block-based matrix multiplication execution time in seconds for 8 MPI nodes running on four two-processor SMP seconds for 16 MPI nodes running on four two-processor SMP nodes. nodes.
45 80 40 70 35 60 1024x16 1024x16 30 1024x32 50 1024x32 25 1024x64 40 2048x16 2048x16 20 2048x32 30 2048x32 15 2048x64 2048x64 20 10 10 5 0 0 MPICH PTMPI MPICH PTMPI Figure 5: Block-based matrix multiplication execution time in Figure 6: Block-based matrix multiplication execution time in seconds for 32 MPI nodes running on four four-processor seconds for 16 MPI nodes running on four four-processor SMP nodes. SMP nodes. 2048x32 1 n o d e /C PU 2048x32 1 node/CPU 2048x32 2 n o d e s /CPU 2048x32 2 nodes/CPU 60 140 50 120 40 100 80 30 60 20 40 10 20 0 0 1 2 4 1 2 4 8 16 Figure 7: PTMPI block-based matrix multiplication execution Figure 8: PTMPI block-based matrix multiplication execution time in seconds as function of number of two-processor SMP time in seconds as function of number of four-processor SMP nodes. nodes.
80 90 80 70 70 60 60 50 50 1024x16 1024x16 40 2048x32 2048x32 40 30 30 20 20 10 10 0 0 1 2 4 1 2 4 8 16 Figure 10: PTMPI block-based matrix multiplication Figure 9: PTMPI block-based matrix multiplication MFLOPS rate per processor as function of number of four- MFLOPS rate as function of number of two-processor SMP processor SMP nodes (one thread per processor). nodes (one thread per processor).
Conclusions and Future Improvements • Basic MPI functions are implemented • Current MPI_node to process is basic one, it is expected that smart mapping can significantly improve execution speedup for some applications • Since the communication between the threads is faster than through sockets, MPI gathering function need to be implemented • Spin waiting for send and receive inside the process if running on real SMP • Sending only message header through the socket if the message is big, and waiting for message data request when the receiver is ready
Recommend
More recommend