spcl.inf.ethz.ch @spcl_eth R OBERTO ¡B ELLI , ¡T ORSTEN ¡H OEFLER ¡ No/fied ¡Access: ¡Extending ¡Remote ¡Memory ¡Access ¡ ¡ Programming ¡Models ¡for ¡Producer-‑Consumer ¡Synchroniza/on ¡ ¡
spcl.inf.ethz.ch @spcl_eth C OMMUNICATION IN T ODAY ’ S HPC S YSTEMS § The de-facto programming model: MPI-1 § Using send/recv messages and collectives § The de-facto network standard: RDMA § Zero-copy, user-level, os-bypass, fuzz-bang 2
spcl.inf.ethz.ch @spcl_eth P RODUCER -C ONSUMER R ELATIONS § Most important communication idiom § Some examples: § Perfectly supported by MPI-1 Message Passing § But how does this actually work over RDMA? 3
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE E AGER 4 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE E AGER 5 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE E AGER 6 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE E AGER Critical path: 1 latency + 1 copy 7 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE R ENDEZVOUS 8 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE R ENDEZVOUS 9 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE R ENDEZVOUS 10 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE R ENDEZVOUS 11 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE R ENDEZVOUS 12 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth MPI-1 M ESSAGE P ASSING – S IMPLE R ENDEZVOUS Critical path: 3 latencies 13 [1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06
spcl.inf.ethz.ch @spcl_eth C OMMUNICATION IN T ODAY ’ S HPC S YSTEMS § The de-facto programming model: MPI-1 § Using send/recv messages and collectives § The de-facto hardware standard: RDMA § Zero-copy, user-level, os-bypass, fuzz bang 14 http://www.hpcwire.com/2006/08/18/a_critique_of_rdma-1/
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING § Why not use these RDMA features more directly? § A global address space may simplify programming § … and accelerate communication § … and there could be a widely accepted standard § MPI-3 RMA (“MPI One Sided”) was born § Just one among many others (UPC, CAF, … ) § Designed to react to hardware trends, learn from others § Direct (hardware-supported) remote access § New way of thinking for programmers 15 [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA S UMMARY § MPI-3 updates RMA (“MPI One Sided”) § Significant change from MPI-2 § Communication is „one sided” (no involvement of destination) § Utilize direct memory access § RMA decouples communication & synchronization § Fundamentally different from message passing one sided two sided Proc B Proc A Proc A Proc B Communication send Communication put + recv Synchronization sync Synchronization 16 [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 17
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 18
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 19
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 20
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA C OMMUNICATION O VERVIEW Process B (active) Memory Process A (passive) Memory Put Non-atomic communication calls (put, get) MPI window Atomic Get MPI window Process C (active) … Process D (active) … Atomic communication calls (Acc, Get & Acc, CAS, FAO) 21
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 22
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 23
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 24
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 25
spcl.inf.ethz.ch @spcl_eth MPI-3 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- zation Communi- cation Post/Start/ Lock All Complete/Wait 26
spcl.inf.ethz.ch @spcl_eth I N CASE YOU WANT TO LEARN MORE MPI-3 RMA S YNCHRONIZATION O VERVIEW Active Target Mode Passive Target Mode Active process Passive process Fence Lock Synchroni- How to implement producer/consumer in passive mode? zation Communi- cation Post/Start/ Lock All Complete/Wait 27
spcl.inf.ethz.ch @spcl_eth O NE S IDED – P UT + S YNCHRONIZATION 28
spcl.inf.ethz.ch @spcl_eth O NE S IDED – P UT + S YNCHRONIZATION 29
spcl.inf.ethz.ch @spcl_eth O NE S IDED – P UT + S YNCHRONIZATION 30
spcl.inf.ethz.ch @spcl_eth O NE S IDED – P UT + S YNCHRONIZATION Critical path: 3 latencies 31
spcl.inf.ethz.ch @spcl_eth C OMPARING A PPROACHES Message Passing One Sided 1 latency + copy / 3 latencies 3 latencies 32
spcl.inf.ethz.ch @spcl_eth I DEA : RMA N OTIFICATIONS § First seen in Split-C (1992) § Combine communication and synchronization using RDMA § RDMA networks can provide various notifications § Flags § Counters § Event Queues 33
spcl.inf.ethz.ch @spcl_eth C OMPARING A PPROACHES Message Passing One Sided Notified Access 1 latency + copy / 3 latencies 1 latency 3 latencies 34
spcl.inf.ethz.ch @spcl_eth C OMPARING A PPROACHES But how to notify? Message Passing One Sided Notified Access 1 latency + copy / 3 latencies 1 latency 3 latencies 35
spcl.inf.ethz.ch @spcl_eth P REVIOUS WORK : O VERWRITING I NTERFACE § Flags (polling at the remote side) § Used in GASPI, DMAPP, NEON § Disadvantages § Location of the flag chosen at the sender side § Consumer needs at least one flag for every process § Polling a high number of flags is inefficient 36
spcl.inf.ethz.ch @spcl_eth P REVIOUS WORK : C OUNTING I NTERFACE § Atomic counters (accumulate notifications → scalable) § Used in Split-C, LAPI, SHMEM - Counting Puts, … § Disadvantages § Dataflow applications may require many counters § High polling overhead to identify accesses § Does not preserve order (may not be linearizable) 37
Recommend
More recommend