qmpi a library for multithreaded
play

QMPI: A Library for Multithreaded MPI Applications Alex Brooks - PowerPoint PPT Presentation

QMPI: A Library for Multithreaded MPI Applications Alex Brooks Hoang-Vu Dang Marc Snir Outline Motivation Communication Model Qthreads QMPI Summary 2 MOTIVATION 3 Issue Large numbers of threads performing


  1. QMPI: A Library for Multithreaded MPI Applications Alex Brooks Hoang-Vu Dang Marc Snir

  2. Outline • Motivation • Communication Model • Qthreads • QMPI • Summary 2

  3. MOTIVATION 3

  4. Issue • Large numbers of threads performing communication causes problems – Locking – Polling – Scheduling • As a result there are very few hybrid MPI+pthread applications 4

  5. Current MPI Design • MPI code executed by calling thread – Requires coarse-grain locking – limits concurrency – Some implementations don’t support • Communication completion is observed through polling – Separate calls to progress engine • Scheduler is unaware of which threads have become runnable 5

  6. Bandwidth (MB/s) 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 512 MPICH 1K 2K Performance 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 MVAPICH 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 6 4096K

  7. Goals • Enable efficient use of multithreaded two-sided communication – Light-weight threads – Low-overhead scheduling upon communication completion • Improve programmability of multithreaded MPI 7

  8. COMMUNICATION MODEL 8

  9. Main idea Worker Thread Communication Worker Thread Engine … Worker Thread • Light-weight tasks submit requests to comm. engine • Comm. engine marks task as runnable when communication completes 9

  10. QTHREADS 10

  11. Introduction • Tasking model which supports millions of light-weight threads • Three main entities – Task - Function of execution – Worker - Thread executing tasks – Shepherd - Queue of tasks 11

  12. Synchronization • Full/Empty bit (FEB) semantics – FEB determines status of data • 0 (empty) : data is not written • 1 (full) : data is written • Read – Stall task until FEB is full, then read data and set as empty • Write – Stall until FEB is empty, then write data and set as full 12

  13. Task Scheduler • Each work is associated with a single shepherd – Tasks pulled from shepherd to execute • Tasks can be stolen from other shepherds under certain conditions • Tasks preempt when waiting on synchronization 13

  14. Overview • Scalable over-subscription – Millions of tasks can be spawned with minimal overhead in performance • Worker idle time is reduced through task preemption at synchronization • “Automatic” load -balancing of tasks • Shared-memory environment 14

  15. QMPI 15

  16. Overview • Qthreads+MPI – Qthreads light-weight task model with communication through MPI • Two threads dedicated for communication engine – One for communication, one for FEB management 16

  17. Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 17

  18. Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 18

  19. Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 19

  20. Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 20

  21. Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 21

  22. Communication Model Node Synch Container FEB Thread Worker Shepherd FEB Queue Worker Shepherd … … Comm Thread Worker Shepherd Comm Queue Network 22

  23. Bandwidth (MB/s) 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 512 MPICH 1K 2K Performance 4K 8K 16K 32K 64K 128K 256K 512K 1024K 2048K 4096K 0.001 1000 0.01 100 0.1 10 1 1 2 4 8 16 32 64 128 Message size (bytes) 256 MVAPICH 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K 23 2048K 4096K

  24. Target Applications • Not beneficial for all problems – Little overlap in multithreaded communication increase runtime • Bulk-synchronous communication • Oversubscription – Benefit directly from Qthreads 24

  25. Simple Experiment • 5-point stencil computation – Send edge values to neighbors – Recv edge values from neighbors – Compute new values 25

  26. Results Receive Phase Send Phase 10000 1000 Execution Time (usec) 1000 100 100 MPI+Pthread MPI+Pthread 10 QMPI QMPI 10 1 1 120 1200 12000 120 1200 12000 Grid Size (1 side) Grid Size (1 side) Calculation Phase 100000 Execution Time (usec) 10000 1000 MPI+Pthread 100 QMPI 10 1 120 1200 12000 Grid Size (1 side) 26

  27. SUMMARY 27

  28. Conclusion • Large numbers of threads performing communication causes problems • QMPI uses a communication model to decrease communication overhead • QMPI performs much better than traditional MPI+pthreads in many situations 28

  29. On-going/Future Work • Test QMPI with real applications – MiniGhost, Lulesh, UTS, etc. • Message Aggregation • Push QMPI model to an internal feature of MPI 29

  30. QUESTIONS 30

Recommend


More recommend