adaptive mpi overview recent developments
play

Adaptive MPI: Overview & Recent Developments Sam White UIUC - PowerPoint PPT Presentation

Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018 Motivation Exascale trends: HW: increased node parallelism, decreased memory per thread SW: applications becoming more complex, dynamic How


  1. Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018

  2. Motivation • Exascale trends: • HW: increased node parallelism, decreased memory per thread • SW: applications becoming more complex, dynamic • How should applications and runtimes respond? • Incrementally: MPI+X (X=OpenMP, Kokkos, MPI, etc)? • Rewrite in: Legion, Charm++, HPX, etc? 2 Charm++ Workshop 2018

  3. Adaptive MPI • AMPI is an MPI implementation on top of Charm++ • AMPI offers Charm++’s application-independent features to MPI programmers: • Overdecomposition • Communication/computation overlap • Dynamic load balancing • Online fault tolerance 3 Charm++ Workshop 2018

  4. Overview • Introduction • Features • Shared memory optimizations • Conclusions 4 Charm++ Workshop 2018

  5. Execution Model • AMPI ranks are User-Level Threads (ULTs) • Can have multiple per core • Fast to context switch • Scheduled based on message delivery • Migratable across cores and nodes at runtime • For load balancing & fault tolerance 5 Charm++ Workshop 2018

  6. Execution Model Rank 0 Rank 1 Scheduler Scheduler Core 0 Core 1 Node 0 6 Charm++ Workshop 2018

  7. Execution Model Rank 0 Rank 1 Rank 3 Rank 4 MPI_Send() MPI_Recv() Rank 2 Scheduler Scheduler Core 0 Core 1 Node 0 7 Charm++ Workshop 2018

  8. Execution Model Rank 0 Rank 1 Rank 4 Rank 5 AMPI_Migrate() Rank 2 Rank 3 Rank 6 Scheduler Scheduler Core 0 Core 1 Node 0 8 Charm++ Workshop 2018

  9. Thread Safety • AMPI virtualizes ranks as threads: is this safe? 9

  10. Thread Safety • AMPI virtualizes ranks as threads: is this safe? No, global variables are defined per process 10

  11. Thread Safety • AMPI programs are MPI programs without mutable global variables • Solutions: 1. Refactor the application to not use globals/statics, instead pass them around on the stack 2. Swap ELF Global Offset Table entries at ULT context switch 3. Swap Thread Local Storage pointer during ctx • Tag unsafe vars with C/C++ ‘thread_local’ or OpenMP ‘threadprivate’, the runtime manages TLS • Work in progress: have the compiler privatize them for you, i.e., icc -fmpc-privatize 11

  12. Conversion to AMPI • AMPI programs are MPI programs, with 2 caveats: 1 2 1. Without mutable global/static variables • Or with them properly handled 2. Possibly with calls to AMPI’s extensions • AMPI_Migrate() 2 . Fortran main & command line args 1 2 12

  13. AMPI Fortran Support • AMPI implements the F77 and F90 MPI bindings • MPI -> AMPI Fortran conversion: • Rename ‘program main’ -> ‘subroutine mpi_main’ • AMPI_ command line argument parsing routines • Automatic arrays: increase ULT stack size 13

  14. Overdecomposition • Bulk-synchronous codes often underutilize the network with compute/communicate phases • LULESH v2.0: 14 Charm++ Workshop 2018

  15. Overdecomposition • With overdecomposition, overlap communication of one rank with computation of others on its core 15 Charm++ Workshop 2018

  16. Message-driven Execution • Overdecomposition spreads network injection over the whole timestep LULESH 2.0 Communication over Time 980 KB 1 rank/core 240 KB 8 ranks/core 16

  17. Migratability • AMPI ranks are migratable at runtime between address spaces • User-level thread stack + heap 0xFFFFFFFF 0xFFFFFFFF thread 0 stack thread 1 stack • Isomalloc memory allocator thread 2 stack thread 3 stack thread 4 stack makes migration automatic • No user serialization code thread 4 heap thread 3 heap • Works everywhere but BGQ & thread 2 heap thread 1 heap thread 0 heap Windows bss bss data data text text 0x00000000 0x00000000 17 Charm++ Workshop 2018

  18. Load Balancing • To enable load balancing in an AMPI program: 1. Insert a call to AMPI_Migrate(MPI_Info) • Info object is LB, Checkpoint, etc. 2. Link with Isomalloc and a load balancer: ampicc -memory isomalloc -module CommonLBs 3. Specify the number of virtual processes and a load balancing strategy at runtime: srun -n 100 ./pgm +vp 1000 +balancer RefineLB 18

  19. Recent Work • AMPI can optimize for communication locality • Many ranks can reside on the same core • Same goes for process/socket/node • Load balancers can take communication graph into consideration 19 Charm++ Workshop 2018

  20. AMPI Shared Memory • Many AMPI ranks can share the same OS process 20 Charm++ Workshop 2018

  21. Existing Performance • Small message latency on Quartz (LLNL) MVAPICH P2 AMPI P2 32 IMPI P2 AMPI P1 1-way Latency (us) 16 OpenMPI P2 8 4 2 1 0.5 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes) ExaMPI 2017 21

  22. Existing Performance • Large message latency on Quartz 131072 2048 1024 65536 512 32768 Latency (us) 256 16384 128 8192 64 4096 32 2048 16 1024 8 512 4 256 4 8 16 32 64 64 128 256 512 1024 2048 4096 Message Size (MB) Message Size (KB) ExaMPI 2017 22

  23. Performance Analysis • Breakdown of P1 time (us) per message on Quartz • Scheduling: Charm++ scheduler & ULT ctx • Memory copy: message payload movement • Other: AMPI message creation & matching 23 Charm++ Workshop 2018

  24. Scheduling Overhead 1. Even for P1, all AMPI messages traveled thru Charm++’s scheduler • Use Charm++ [inline] tasks 2. ULT context switching overhead • Faster with Boost ULTs 3. Avoid resuming threads without real progress • MPI_Waitall: keep track of # reqs “blocked on” P1 0-B latency: 1.27 us -> 0.66 us 24 Charm++ Workshop 2018

  25. Memory Copy Overhead • Q: Even with [inline] tasks, AMPI P1 performs poorly for large messages. Why? • A: Charm++ messaging semantics do not match MPI’s • In Charm++, messages are first class objects • Users pass ownership of messages to the runtime when sending and assume it when receiving • Only app’s that can reuse message objects in their data structures can perform “zero copy” transfers 25 Charm++ Workshop 2018

  26. Memory Copy Overhead • To overcome Charm++ messaging semantics in shared memory, use a rendezvous protocol: • Recv’er performs direct (userspace) memcpy from sendbuf to recvbuf • Benefit: avoid intermediate copy • Cost: synchronization, sender must suspend & be resumed upon copy completion P1 1-MB latency: 165 us -> 82 us 26 Charm++ Workshop 2018

  27. Other Overheads • Sender-side: • Create a Charm++ message object & a request • Receiver-side: • Create a request, create matching queue entry, dequeue from unexpectedMsgs or enqueue in postedReqs • Solution: use memory pools for fixed-size, frequently-used objects • Optimize for common usage patterns, i.e. MPI_Waitall with a mix of send and recv requests P1 0-B latency: 0.66 us -> 0.54 us 27 Charm++ Workshop 2018

  28. AMPI-shm Performance • Small message latency on Quartz • AMPI-shm P2 faster than other impl’s for 2+ KB 28 Charm++ Workshop 2018

  29. AMPI-shm Performance • Large message latency on Quartz • AMPI-shm P2 fastest for all large messages, up to 2.33x faster than process-based MPIs for 32+ MB 29 Charm++ Workshop 2018

  30. AMPI-shm Performance • Bidirectional bandwidth on Quartz • AMPI-shm can utilize full memory bandwidth • 26% higher peak, 2x bandwidth for 32+ MB than others STREAM copy 35000 nal Bandwidth (MB/s) 30000 25000 20000 30 Charm++ Workshop 2018

  31. AMPI-shm Performance • Small message latency on Cori-Haswell 8 Cray MPI P2 AMPI-shm P2 AMPI-shm P1 4 Latency (us) 2 1 0.5 0.25 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes) 31 Charm++ Workshop 2018

  32. AMPI-shm Performance • Large message latency on Cori-Haswell • AMPI-shm P2 is 47% faster than Cray MPI at 32+ MB 32 Charm++ Workshop 2018

  33. AMPI-shm Performance • Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB STREAM copy 35000 nal Bandwidth (MB/s) 30000 25000 20000 33 Charm++ Workshop 2018

  34. AMPI-shm Performance • Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB STREAM copy 35000 nal Bandwidth (MB/s) 30000 25000 20000 34 Charm++ Workshop 2018

  35. Summary • User-space communication offers portable intranode messaging performance • Lower latency: 1.5x-2.3x for large msgs • Higher bandwidth: 1.3x-2x for large msgs • Intermediate buffering unnecessary for medium/ large msgs 35 Charm++ Workshop 2018

  36. Conclusions • AMPI provides application-independent runtime support for existing MPI applications: • Overdecomposition • Latency tolerance • Dynamic load balancing • Automatic fault detection & recovery • See the AMPI manual for more info 36

  37. This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374. 37 Charm++ Workshop 2018

  38. Questions? Thank you 38 Charm++ Workshop 2018

Recommend


More recommend