mpi 25 years of progress
play

MPI: 25 Years of Progress Anthony Skjellum University of Tennessee - PowerPoint PPT Presentation

MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu Formerly: LLNL, MSU, MPI Software Technology, Verari/Verarisoft, UAB, and Auburn University Co-authors: Ron Brightwell, Sandia


  1. MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu 
 Formerly: LLNL, MSU, MPI Software Technology, 
 Verari/Verarisoft, UAB, and Auburn University 
 Co-authors: Ron Brightwell, Sandia 
 Rossen Dimitrov, Intralinks

  2. MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu 
 Formerly: LLNL, MSU, MPI Software Technology, 
 Verari/Verarisoft, UAB, and Auburn University 
 Co-authors: Ron Brightwell, Sandia 
 Rossen Dimitrov, Intralinks

  3. Outline l Background l Legacy l About Progress l MPI Taxonomy l A glimpse at the past l A look toward the future

  4. Progress l 25 years we as a community set out to standardize parallel programming l It worked J l Amazing “collective operation” (hmm.. still not complete) l Some things about the other progress too, moving data independently of user calls to MPI …

  5. Community l This was close to the beginning …

  6. As we all know (agree?) l MPI defined progress as a “weak” requirement l MPI implementations don’t have to move the data independently of when MPI is called l Implementations can do so l There is no need for an internally concurrent schedule to comply l For instance: do all the data movement at “Waitall” … predictable if required only to be here!

  7. How programs/programmers achieve progress l The MPI library calls the progress engine when you call any of most MPI calls l The MPI library does it for you ▼ In the transport, MPI just shepherds lightly ▼ In an internal thread or threads periodically scheduled l You kick the progress engine (Self help) ▼ You call MPI_Test() sporadically in your user thread ▼ You schedule and call MPI_Test() in a helper thread

  8. 
 
 
 
 
 
 Desirements l Overlap communication and Computation l Predictability / low jitter 
 l Later: overlap of communication, computation, and I/O 
 l Proviso: LJ à Must have the memory bandwidth 


  9. MPI Implementation Taxonomy (Dimitrov) l Message completion notification blocking blocking ▼ Asynchronous (blocking) independent polling ▼ Synchronous (polling) l Message progress polling all-polling ▼ Asynchronous (independent) independent ▼ Synchronous (polling) 


  10. Segmentation l Common technique for implementing overlapping through pipelining Segments Message m m/s m/s m/s Compute m/s Compute m/s Compute m/s Compute m Entire message Segmented message

  11. Optimal Segmentation T ( s ) T no overlap T best s 1 s b s m

  12. Performance Gain from Overlapping l Effect of overlapping on FFT global phase in seconds, p = 2 1.000 0.900 0.800 size Max Execution time [sec] 0.700 speedup 0.600 1M p=2 1M 1.41 0.500 2M p=2 0.400 4M p=2 2M 1.43 0.300 0.200 0.100 4M 1.43 0.000 1 2 4 8 16 32 64 Number of segments

  13. Performance Gain from Overlapping (cont.) l Effect of overlapping on FFT global phase in seconds, p = 4 1.000 0.900 0.800 size Max Execution time [sec] 0.700 speedup 0.600 1M p=4 1M 1.31 0.500 2M p=4 0.400 4M p=4 0.300 2M 1.32 0.200 0.100 4M 1.33 0.000 1 2 4 8 16 32 64 Number of segments

  14. Performance Gain from Overlapping (cont.) l Effect of overlapping on FFT global phase in seconds, p = 8 1.000 0.900 0.800 size Max Execution time [sec] 0.700 speedup 0.600 1M p=8 1M 1.32 0.500 2M p=8 4M p=8 0.400 0.300 2M 1.32 0.200 0.100 4M 1.33 0.000 1 2 4 8 16 32 64 Number of segments

  15. Effect of Message-Passing Library on Overlapping l Comparison between blocking and polling modes of MPI, n = 2M, p = 2 0.500 0.450 0.400 0.350 Execution time [sec] 0.300 blocking 0.250 polling 0.200 0.150 0.100 0.050 0.000 1 2 4 8 16 32 64 Number of segments

  16. Effect of Message-Passing Library on Overlapping l Comparison between blocking and polling modes of MPI, n = 2M, p = 8 0.500 0.450 0.400 0.350 Execution time [sec] 0.300 blocking 0.250 polling 0.200 0.150 0.100 0.050 0.000 1 2 4 8 16 32 64 Number of segments

  17. Observations/Upshots l Completion notification method affects latency of short messages (i.e., < 4k on legacy system) l Notification method did not affect bandwidth of long messages l Short message programs ▼ Strong progress, polling notification l Long message programs ▼ Strong progress, blocking notification

  18. Future (soon?) l MPI’s support overlap and notification mode well l Overlap is worth at most a factor of 2 (3 if you include I/O) l It is valuable in real algorithmic situations l Arguably growing in value at exascale l We need to reveal this capability broadly without the “Self help” model

  19. Thank you l 25 years of 
 progress l And still going 
 strong … l Collective! l Nonblocking? l Persistent! l Fault Tolerant?

Recommend


More recommend