spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process p Memory A
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B get B B B Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B get B B B flush Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory put A A A get B B B B flush Cray BlueWaters
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Implemented in hardware in NICs in the majority of HPC networks (RDMA)
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Implemented in hardware in NICs in the majority of HPC networks (RDMA)
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Implemented in hardware in NICs in the majority of HPC networks (RDMA)
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Implemented in hardware in NICs in the majority of HPC networks (RDMA)
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Implemented in hardware in NICs in the majority of HPC networks (RDMA)
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Supported by many HPC libraries and languages
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Supported by many HPC libraries and languages
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Supported by many HPC libraries and languages
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Enables significant speedups over message passing in many types of applications, e.g.: [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Enables significant speedups over message passing in many types of applications, e.g.: Speedup of ~1.5 for communication patterns in irregular workloads [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Enables significant speedups over message passing in many types of applications, e.g.: Speedup of ~1.5 for communication patterns in irregular workloads Speedup of ~1.4-2 in physics computations [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush Message Passing:
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING Communication in RMA is one-sided RMA: Process q Process p A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING Communication in RMA is one-sided RMA: Process q Process p put A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING no active participation, Communication in RMA is one-sided direct access to memory RMA: Process q Process p put A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A
spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING no active participation, Communication in RMA is one-sided direct access to memory RMA: Process q Process p put A put Memory Memory A A flush explicit receive, Message Passing: possible queueing Process q Process p send A message Memory Memory A A A
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Is it ideal? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING Is it ideal? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING No hash collision: Is it ideal? How to enable it? 1 remote atomic Consider an insert in a Up to 5x speedup over MP [1] distributed hashtable... Proc p Proc q A hash collision: Use and extend I/O 4 remote atomics + 2 remote puts MMUs and their paging Significant performance drops capabilities Use “active” semantics Local execution; triggered by an active access . In RMA? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13
spcl.inf.ethz.ch @spcl_eth U SE SEMANTICS FROM A CTIVE M ESSAGES (AM) [1] Process p AM++[2] We use it in syntax & GASNet [3] Process q semantics to enable the “active” behavior Memory We need active puts/gets: A’s addr: Handler A ... Invoke a handler upon accessing a given page Z’s addr: Handler Z Preserve one-sided RMA behavior [1] T. von Eicken et al. Active messages: a mechanism for integrated communication and computation . ISCA’92. [2] J. J. Willcock et al. AM++: A generalized active message framework . PACT’ 10. [3] D. Bonachea, GASNet Specification, v1.1. Berkeley Technical Report. 2002.
spcl.inf.ethz.ch @spcl_eth U SE I NPUT /O UTPUT M EMORY M ANAGEMENT U NITS Main memory Physical Physical addresses addresses IOMMU MMU We propose it as a way to implement the “active” Device Virtual IOTLB TLB addresses behavior addresses I/O devices CPU +
spcl.inf.ethz.ch @spcl_eth We could use it somehow. But … IOMMU S AND RMA 10 MSI An RDMA CPU IOMMU packet 1 3 11 SMT cores 4 ... Dev-to-PT NIC cache 6 IOTLB No multiplexing 2 No parallelism (single log)... BAD PCIe packets (single log)... BAD Main memory 9 Remapping structures System-wide fault log 12 5 W ... User Fault entry Fault entry 8 R Dev-to-PT handlers Handler A ... 7 Data is discarded... PT Extremely BAD
spcl.inf.ethz.ch @spcl_eth Stores addresses of each access log A CTIVE P UTS MSI An RDMA CPU IOMMU packet SMT cores ... Dev-to-PT NIC cache Access log table + IOTLB Decide on PCIe packets keeping/discarding the entry/data Main memory Remapping structures System-wide fault log W ... User Fault entry Fault entry R Dev-to-PT handlers + WL + Handler A WLD + Access log (private for each process) ... ... Fault entry Fault entry Enables Request Request IUID + PT data data data-centric Data can be Maps each page to programming reused an access log
spcl.inf.ethz.ch @spcl_eth A CTIVE P UTS Log both the entry and the Do not modify data of an incoming put the page Process q W = 0 Attempt to Accessed 2 WL = 1 write(X) page WLD = 1 1 IOMMU Process p Page fault! 3 (W = 0) Access log 4 Move(X) X 5 Process(X) CPU Main memory
spcl.inf.ethz.ch @spcl_eth A CTIVE G ETS MSI An RDMA CPU IOMMU packet SMT cores ... Dev-to-PT NIC cache Access log table + IOTLB PCIe packets Main memory Remapping structures System-wide fault log W ... User Fault entry Fault entry R Dev-to-PT handlers + WL + Handler A WLD + Access log (private for each process) + RL ... RLD + ... Fault entry Fault entry Request Request IUID + PT data data
Recommend
More recommend