access programming with mpi 3 one sided
play

Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS MPI-3.0


  1. spcl.inf.ethz.ch @spcl_eth Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M ACIEJ B ESTA , T ORSTEN H OEFLER

  2. spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS  MPI-3.0 supports RMA (“MPI One Sided”)  Designed to react to hardware trends  Majority of HPC networks support RDMA [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 2

  3. spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS  MPI-3.0 supports RMA (“MPI One Sided”)  Designed to react to hardware trends  Majority of HPC networks support RDMA [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 3

  4. spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS  MPI-3.0 supports RMA (“MPI One Sided”)  Designed to react to hardware trends  Majority of HPC networks support RDMA [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 4

  5. spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS  MPI-3.0 supports RMA (“MPI One Sided”)  Designed to react to hardware trends  Majority of HPC networks support RDMA  Communication is „one sided” (no involvement of destination)  RMA decouples communication & synchronization  Different from message passing one sided two sided Proc B Proc A Proc A Proc B Communication send Communication put + recv Synchronization sync Synchronization [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf 5

  6. spcl.inf.ethz.ch @spcl_eth P RESENTATION O VERVIEW 1. Overview of three MPI-3 RMA concepts 2. MPI window creation 3. Communication 5. Application evaluation 4. Synchronization 6

  7. spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 7

  8. spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 8

  9. spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 9

  10. spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 10

  11. spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 11

  12. spcl.inf.ethz.ch @spcl_eth MPI-3.0 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 12

  13. spcl.inf.ethz.ch @spcl_eth MPI-3.0 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 13

  14. spcl.inf.ethz.ch @spcl_eth MPI-3.0 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 14

  15. spcl.inf.ethz.ch @spcl_eth MPI-3.0 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 15

  16. spcl.inf.ethz.ch @spcl_eth MPI-3.0 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 16

  17. spcl.inf.ethz.ch @spcl_eth S CALABLE P ROTOCOLS & R EFERENCE I MPLEMENTATION  Scalable & generic protocols  Can be used on any RDMA network (e.g., OFED/IB) 17

  18. spcl.inf.ethz.ch @spcl_eth S CALABLE P ROTOCOLS & R EFERENCE I MPLEMENTATION  Scalable & generic protocols  Can be used on any RDMA network (e.g., OFED/IB) 18

  19. spcl.inf.ethz.ch @spcl_eth S CALABLE P ROTOCOLS & R EFERENCE I MPLEMENTATION  Scalable & generic protocols  Can be used on any RDMA network (e.g., OFED/IB)  Window creation, communication and synchronization Synchronization Communication Window creation 19

  20. spcl.inf.ethz.ch @spcl_eth S CALABLE P ROTOCOLS & R EFERENCE I MPLEMENTATION  Scalable & generic protocols  Can be used on any RDMA network (e.g., OFED/IB)  Window creation, communication and synchronization  foMPI, a fully functional MPI-3 RMA implementation  DMAPP: lowest-level networking API for Cray Gemini/Aries systems  XPMEM: a portable Linux kernel module http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 20

  21. spcl.inf.ethz.ch @spcl_eth S CALABLE P ROTOCOLS & R EFERENCE I MPLEMENTATION  Scalable & generic protocols  Can be used on any RDMA network (e.g., OFED/IB)  Window creation, communication and synchronization  foMPI, a fully functional MPI-3 RMA implementation  DMAPP: lowest-level networking API for Cray Gemini/Aries systems  XPMEM: a portable Linux kernel module http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 21

  22. spcl.inf.ethz.ch @spcl_eth S CALABLE P ROTOCOLS & R EFERENCE I MPLEMENTATION  Scalable & generic protocols  Can be used on any RDMA network (e.g., OFED/IB)  Window creation, communication and synchronization  foMPI, a fully functional MPI-3 RMA implementation  DMAPP: lowest-level networking API for Cray Gemini/Aries systems  XPMEM: a portable Linux kernel module http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI 22

  23. spcl.inf.ethz.ch @spcl_eth P ART 1: S CALABLE W INDOW C REATION Traditional windows Process A Process B Process C Memory Memory Memory 0x123 0x120 0x111 𝑞 = total number Time bound: 𝒫 𝑞 backwards compatible of processes Memory bound: 𝒫 𝑞 (MPI-2) 23

  24. spcl.inf.ethz.ch @spcl_eth P ART 1: S CALABLE W INDOW C REATION Allocated windows Process A Process B Process C Memory Memory Memory 0x123 0x123 0x123 𝑞 = total number Time bound: 𝒫 log 𝑞 (𝑥ℎ𝑞) Allows MPI of processes Memory bound: 𝒫 1 to allocate memory 24

  25. spcl.inf.ethz.ch @spcl_eth P ART 1: S CALABLE W INDOW C REATION Dynamic windows Process A Process B Process C Memory Memory Memory 0x129 0x129 0x123 0x120 0x111 𝑞 = total number Time bound: 𝒫 𝑞 Local attach/detach of processes Memory bound: 𝒫 𝑞 Most flexible 25

  26. spcl.inf.ethz.ch @spcl_eth P ART 2: C OMMUNICATION MPI_Put Remote  Put and Get: process  Direct DMAPP put and get operations or … local (blocking) memcpy (XPMEM) dmapp_put_nbi  Accumulate:  DMAPP atomic operations for 64 bit types  ...or fall back to remote locking protocol  MPI datatype handling with MPITypes library [1]  Fast path for contiguous data transfers of common intrinisic datatypes (e.g., MPI_DOUBLE) MPI_Compare _and_swap Remote process … Contiguous memory dmapp_ acswap_qw_nbi [1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI /PVM’09 26

  27. spcl.inf.ethz.ch @spcl_eth P ERFORMANCE I NTER - NODE : L ATENCY Put Inter-Node Get Inter-Node 80% faster 20% faster Half ping-pong Proc 1 Proc 0 put sync memory 27

  28. spcl.inf.ethz.ch @spcl_eth P ERFORMANCE I NTRA - NODE : L ATENCY Put/Get Intra-Node Half ping-pong 3x Proc 0 Proc 1 faster put sync memory 28

  29. spcl.inf.ethz.ch @spcl_eth P ERFORMANCE : O VERLAP Proc 1 Proc 0 put comp. Inter-Node Overlap in % Sync memory Useful for, e.g., scientific codes: AWM-Olsen 3D FFT seismic MILC 29

  30. spcl.inf.ethz.ch @spcl_eth Proc 1 P ERFORMANCE : M ESSAGE R ATE Proc 0 puts ... Sync memory Intra-Node Inter-Node 30

  31. spcl.inf.ethz.ch @spcl_eth P ERFORMANCE : A TOMICS hardware- accelerated protocol: lower latency fall back protocol: higher bandwidth proprietary 64 bit integers 31

  32. spcl.inf.ethz.ch @spcl_eth P ART 3: S YNCHRONIZATION Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 32

  33. spcl.inf.ethz.ch @spcl_eth S CALABLE F ENCE I MPLEMENTATION  Collective call  Completes all outstanding memory operations Node 0 Node 1 int int MPI_Win_fence(…) { asm( mfence ); Proc 1 Proc 2 Proc 3 Proc 0 dmapp_gsync_wait(); MPI_Barrier(...); return MPI_SUCCESS; return put } put put put put put put 33

Recommend


More recommend