QMPI: A Library for Multithreaded MPI Applications Alex Brooks Hoang-Vu Dang Marc Snir
Outline • Motivation • Communication Model • Qthreads • QMPI • Summary 2
MOTIVATION 3
Issue • Large numbers of threads performing communication causes problems – Locking – Polling – Scheduling • As a result there are very few hybrid MPI+pthread applications 4
Current MPI Design • MPI code executed by calling thread – Requires coarse-grain locking – limits concurrency – Some implementations don’t support • Communication completion is observed through polling – Separate calls to progress engine • Scheduler is unaware of which threads have become runnable 5
Bandwidth (MB/s) 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 512 MPICH 1K 2K Performance 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 MVAPICH 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 6 4096K
Goals • Enable efficient use of multithreaded two-sided communication – Light-weight threads – Low-overhead scheduling upon communication completion • Improve programmability of multithreaded MPI 7
COMMUNICATION MODEL 8
Main idea Worker Thread Communication Worker Thread Engine … Worker Thread • Light-weight tasks submit requests to comm. engine • Comm. engine marks task as runnable when communication completes 9
QTHREADS 10
Introduction • Tasking model which supports millions of light-weight threads • Three main entities – Task - Function of execution – Worker - Thread executing tasks – Shepherd - Queue of tasks 11
Synchronization • Full/Empty bit (FEB) semantics – FEB determines status of data • 0 (empty) : data is not written • 1 (full) : data is written • Read – Stall task until FEB is full, then read data and set as empty • Write – Stall until FEB is empty, then write data and set as full 12
Task Scheduler • Each work is associated with a single shepherd – Tasks pulled from shepherd to execute • Tasks can be stolen from other shepherds under certain conditions • Tasks preempt when waiting on synchronization 13
Overview • Scalable over-subscription – Millions of tasks can be spawned with minimal overhead in performance • Worker idle time is reduced through task preemption at synchronization • “Automatic” load -balancing of tasks • Shared-memory environment 14
QMPI 15
Overview • Qthreads+MPI – Qthreads light-weight task model with communication through MPI • Two threads dedicated for communication engine – One for communication, one for FEB management 16
Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 17
Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 18
Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 19
Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 20
Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 21
Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 22
Bandwidth (MB/s) 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 512 MPICH 1K 2K Performance 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 MVAPICH 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 23 2048K 4096K
Target Applications • Not beneficial for all problems – Little overlap in multithreaded communication increase runtime • Bulk-synchronous communication • Oversubscription – Benefit directly from Qthreads 24
Simple Experiment • 5-point stencil computation – Send edge values to neighbors – Recv edge values from neighbors – Compute new values 25
Results Receive Phase Send Phase 10000 1000 Execution Time (usec) 1000 100 100 MPI+Pthread MPI+Pthread 10 QMPI QMPI 10 1 1 120 1200 12000 120 1200 12000 Grid Size (1 side) Grid Size (1 side) Calculation Phase 100000 Execution Time (usec) 10000 1000 MPI+Pthread 100 QMPI 10 1 120 1200 12000 Grid Size (1 side) 26
SUMMARY 27
Conclusion • Large numbers of threads performing communication causes problems • QMPI uses a communication model to decrease communication overhead • QMPI performs much better than traditional MPI+pthreads in many situations 28
On-going/Future Work • Test QMPI with real applications – MiniGhost, Lulesh, UTS, etc. • Message Aggregation • Push QMPI model to an internal feature of MPI 29
QUESTIONS 30
Recommend
More recommend