Adaptive MPI: Overview & Recent Developments Sam White UIUC - PowerPoint PPT Presentation

Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018

Motivation • Exascale trends: • HW: increased node parallelism, decreased memory per thread • SW: applications becoming more complex, dynamic • How should applications and runtimes respond? • Incrementally: MPI+X (X=OpenMP, Kokkos, MPI, etc)? • Rewrite in: Legion, Charm++, HPX, etc? 2 Charm++ Workshop 2018

Adaptive MPI • AMPI is an MPI implementation on top of Charm++ • AMPI offers Charm++’s application-independent features to MPI programmers: • Overdecomposition • Communication/computation overlap • Dynamic load balancing • Online fault tolerance 3 Charm++ Workshop 2018

Overview • Introduction • Features • Shared memory optimizations • Conclusions 4 Charm++ Workshop 2018

Execution Model • AMPI ranks are User-Level Threads (ULTs) • Can have multiple per core • Fast to context switch • Scheduled based on message delivery • Migratable across cores and nodes at runtime • For load balancing & fault tolerance 5 Charm++ Workshop 2018

Execution Model Rank 0 Rank 1 Scheduler Scheduler Core 0 Core 1 Node 0 6 Charm++ Workshop 2018

Execution Model Rank 0 Rank 1 Rank 3 Rank 4 MPI_Send() MPI_Recv() Rank 2 Scheduler Scheduler Core 0 Core 1 Node 0 7 Charm++ Workshop 2018

Execution Model Rank 0 Rank 1 Rank 4 Rank 5 AMPI_Migrate() Rank 2 Rank 3 Rank 6 Scheduler Scheduler Core 0 Core 1 Node 0 8 Charm++ Workshop 2018

Thread Safety • AMPI virtualizes ranks as threads: is this safe? 9

Thread Safety • AMPI virtualizes ranks as threads: is this safe? No, global variables are defined per process 10

Thread Safety • AMPI programs are MPI programs without mutable global variables • Solutions: 1. Refactor the application to not use globals/statics, instead pass them around on the stack 2. Swap ELF Global Offset Table entries at ULT context switch 3. Swap Thread Local Storage pointer during ctx • Tag unsafe vars with C/C++ ‘thread_local’ or OpenMP ‘threadprivate’, the runtime manages TLS • Work in progress: have the compiler privatize them for you, i.e., icc -fmpc-privatize 11

Conversion to AMPI • AMPI programs are MPI programs, with 2 caveats: 1 2 1. Without mutable global/static variables • Or with them properly handled 2. Possibly with calls to AMPI’s extensions • AMPI_Migrate() 2 . Fortran main & command line args 1 2 12

AMPI Fortran Support • AMPI implements the F77 and F90 MPI bindings • MPI -> AMPI Fortran conversion: • Rename ‘program main’ -> ‘subroutine mpi_main’ • AMPI_ command line argument parsing routines • Automatic arrays: increase ULT stack size 13

Overdecomposition • Bulk-synchronous codes often underutilize the network with compute/communicate phases • LULESH v2.0: 14 Charm++ Workshop 2018

Overdecomposition • With overdecomposition, overlap communication of one rank with computation of others on its core 15 Charm++ Workshop 2018

Message-driven Execution • Overdecomposition spreads network injection over the whole timestep LULESH 2.0 Communication over Time 980 KB 1 rank/core 240 KB 8 ranks/core 16

Migratability • AMPI ranks are migratable at runtime between address spaces • User-level thread stack + heap 0xFFFFFFFF 0xFFFFFFFF thread 0 stack thread 1 stack • Isomalloc memory allocator thread 2 stack thread 3 stack thread 4 stack makes migration automatic • No user serialization code thread 4 heap thread 3 heap • Works everywhere but BGQ & thread 2 heap thread 1 heap thread 0 heap Windows bss bss data data text text 0x00000000 0x00000000 17 Charm++ Workshop 2018

Load Balancing • To enable load balancing in an AMPI program: 1. Insert a call to AMPI_Migrate(MPI_Info) • Info object is LB, Checkpoint, etc. 2. Link with Isomalloc and a load balancer: ampicc -memory isomalloc -module CommonLBs 3. Specify the number of virtual processes and a load balancing strategy at runtime: srun -n 100 ./pgm +vp 1000 +balancer RefineLB 18

Recent Work • AMPI can optimize for communication locality • Many ranks can reside on the same core • Same goes for process/socket/node • Load balancers can take communication graph into consideration 19 Charm++ Workshop 2018

AMPI Shared Memory • Many AMPI ranks can share the same OS process 20 Charm++ Workshop 2018

Existing Performance • Small message latency on Quartz (LLNL) MVAPICH P2 AMPI P2 32 IMPI P2 AMPI P1 1-way Latency (us) 16 OpenMPI P2 8 4 2 1 0.5 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes) ExaMPI 2017 21

Existing Performance • Large message latency on Quartz 131072 2048 1024 65536 512 32768 Latency (us) 256 16384 128 8192 64 4096 32 2048 16 1024 8 512 4 256 4 8 16 32 64 64 128 256 512 1024 2048 4096 Message Size (MB) Message Size (KB) ExaMPI 2017 22

Performance Analysis • Breakdown of P1 time (us) per message on Quartz • Scheduling: Charm++ scheduler & ULT ctx • Memory copy: message payload movement • Other: AMPI message creation & matching 23 Charm++ Workshop 2018

Scheduling Overhead 1. Even for P1, all AMPI messages traveled thru Charm++’s scheduler • Use Charm++ [inline] tasks 2. ULT context switching overhead • Faster with Boost ULTs 3. Avoid resuming threads without real progress • MPI_Waitall: keep track of # reqs “blocked on” P1 0-B latency: 1.27 us -> 0.66 us 24 Charm++ Workshop 2018

Memory Copy Overhead • Q: Even with [inline] tasks, AMPI P1 performs poorly for large messages. Why? • A: Charm++ messaging semantics do not match MPI’s • In Charm++, messages are first class objects • Users pass ownership of messages to the runtime when sending and assume it when receiving • Only app’s that can reuse message objects in their data structures can perform “zero copy” transfers 25 Charm++ Workshop 2018

Memory Copy Overhead • To overcome Charm++ messaging semantics in shared memory, use a rendezvous protocol: • Recv’er performs direct (userspace) memcpy from sendbuf to recvbuf • Benefit: avoid intermediate copy • Cost: synchronization, sender must suspend & be resumed upon copy completion P1 1-MB latency: 165 us -> 82 us 26 Charm++ Workshop 2018

Other Overheads • Sender-side: • Create a Charm++ message object & a request • Receiver-side: • Create a request, create matching queue entry, dequeue from unexpectedMsgs or enqueue in postedReqs • Solution: use memory pools for fixed-size, frequently-used objects • Optimize for common usage patterns, i.e. MPI_Waitall with a mix of send and recv requests P1 0-B latency: 0.66 us -> 0.54 us 27 Charm++ Workshop 2018

AMPI-shm Performance • Small message latency on Quartz • AMPI-shm P2 faster than other impl’s for 2+ KB 28 Charm++ Workshop 2018

AMPI-shm Performance • Large message latency on Quartz • AMPI-shm P2 fastest for all large messages, up to 2.33x faster than process-based MPIs for 32+ MB 29 Charm++ Workshop 2018

AMPI-shm Performance • Bidirectional bandwidth on Quartz • AMPI-shm can utilize full memory bandwidth • 26% higher peak, 2x bandwidth for 32+ MB than others STREAM copy 35000 nal Bandwidth (MB/s) 30000 25000 20000 30 Charm++ Workshop 2018

AMPI-shm Performance • Small message latency on Cori-Haswell 8 Cray MPI P2 AMPI-shm P2 AMPI-shm P1 4 Latency (us) 2 1 0.5 0.25 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes) 31 Charm++ Workshop 2018

AMPI-shm Performance • Large message latency on Cori-Haswell • AMPI-shm P2 is 47% faster than Cray MPI at 32+ MB 32 Charm++ Workshop 2018

AMPI-shm Performance • Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB STREAM copy 35000 nal Bandwidth (MB/s) 30000 25000 20000 33 Charm++ Workshop 2018

AMPI-shm Performance • Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB STREAM copy 35000 nal Bandwidth (MB/s) 30000 25000 20000 34 Charm++ Workshop 2018

Summary • User-space communication offers portable intranode messaging performance • Lower latency: 1.5x-2.3x for large msgs • Higher bandwidth: 1.3x-2x for large msgs • Intermediate buffering unnecessary for medium/ large msgs 35 Charm++ Workshop 2018

Conclusions • AMPI provides application-independent runtime support for existing MPI applications: • Overdecomposition • Latency tolerance • Dynamic load balancing • Automatic fault detection & recovery • See the AMPI manual for more info 36

This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374. 37 Charm++ Workshop 2018

Questions? Thank you 38 Charm++ Workshop 2018

Adaptive MPI: Overview & Recent Developments Sam White UIUC - PowerPoint PPT Presentation

Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018 Motivation Exascale trends: HW: increased node parallelism, decreased memory per thread SW: applications becoming more complex, dynamic How

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Seminar on Seminar on Recent Developments in Project Management Recent Developments in Project

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

HARDROC2 HARDROC2 HARDROC2 Hardroc2 submission: mid june 08, Hardroc2 submission: mid june

ilab WLAN Wireless transmission problems Error rate is much higher Interferences

Lecture 2 Point-to-Point Communications 1 I-Hsiang Wang ihwang@ntu.edu.tw

Constraint-driven Design - The Next Step Towards Analog Design Automation Invited Talk Gran

Higgs to Fermions @ Jacobo Konigsberg, U. of Florida, June/2/14 Outline Intro Higgs and

Higgs Physics and SUSY Searches with ATLAS Max Goblirsch, on behalf of the MPP ATLAS group MPP

'You Better Run' Connecting low-energy Dark Matter searches with high-energy physics Bradley J.

Measurement of SM Higgs boson couplings to bottom and top quarks with the ATLAS detector Dr. Jose

Adaptive MPI: Overview & Recent Developments Sam White UIUC - PowerPoint PPT Presentation

Adaptive MPI: Overview & Recent Developments Sam White UIUC Charm++ Workshop 2018 Motivation Exascale trends: HW: increased node parallelism, decreased memory per thread SW: applications becoming more complex, dynamic How

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Seminar on Seminar on Recent Developments in Project Management Recent Developments in Project

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Message Passing Programming Designing MPI Applications Overview Lecture will cover MPI

HARDROC2 HARDROC2 HARDROC2 Hardroc2 submission: mid june 08, Hardroc2 submission: mid june

ilab WLAN Wireless transmission problems Error rate is much higher Interferences

Lecture 2 Point-to-Point Communications 1 I-Hsiang Wang ihwang@ntu.edu.tw

Constraint-driven Design - The Next Step Towards Analog Design Automation Invited Talk Gran

Higgs to Fermions @ Jacobo Konigsberg, U. of Florida, June/2/14 Outline Intro Higgs and

Higgs Physics and SUSY Searches with ATLAS Max Goblirsch, on behalf of the MPP ATLAS group MPP

'You Better Run' Connecting low-energy Dark Matter searches with high-energy physics Bradley J.

Measurement of SM Higgs boson couplings to bottom and top quarks with the ATLAS detector Dr. Jose

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards