active access a mechanism for high performance
play

Active Access: A Mechanism for High-Performance Distributed - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING spcl.inf.ethz.ch


  1. spcl.inf.ethz.ch @spcl_eth Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations M ACIEJ B ESTA , T ORSTEN H OEFLER

  2. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING

  3. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process p Memory A

  4. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B

  5. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  6. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  7. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  8. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A B Cray BlueWaters

  9. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B Cray BlueWaters

  10. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B get B B B Cray BlueWaters

  11. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory A put A A B get B B B flush Cray BlueWaters

  12. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS (RMA) P ROGRAMMING Process q Process p Memory Memory put A A A get B B B B flush Cray BlueWaters

  13. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  14. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  15. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  16. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  17. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Implemented in hardware in NICs in the majority of HPC networks (RDMA)

  18. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Supported by many HPC libraries and languages

  19. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Supported by many HPC libraries and languages

  20. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Supported by many HPC libraries and languages

  21. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Enables significant speedups over message passing in many types of applications, e.g.: [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12

  22. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Enables significant speedups over message passing in many types of applications, e.g.:  Speedup of ~1.5 for communication patterns in irregular workloads [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12

  23. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Enables significant speedups over message passing in many types of applications, e.g.:  Speedup of ~1.5 for communication patterns in irregular workloads  Speedup of ~1.4-2 in physics computations [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13 [2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC . SPAA’12

  24. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush

  25. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush Message Passing:

  26. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING RMA: Process q Process p A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  27. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING  Communication in RMA is one-sided RMA: Process q Process p A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  28. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING  Communication in RMA is one-sided RMA: Process q Process p put A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  29. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING no active  participation, Communication in RMA is one-sided direct access to memory RMA: Process q Process p put A put Memory Memory A A flush Message Passing: Process q Process p A message Memory Memory A A A

  30. spcl.inf.ethz.ch @spcl_eth RMA VS . M ESSAGE P ASSING no active  participation, Communication in RMA is one-sided direct access to memory RMA: Process q Process p put A put Memory Memory A A flush explicit receive, Message Passing: possible queueing Process q Process p send A message Memory Memory A A A

  31. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  32. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Is it ideal? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  33. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING  Is it ideal? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  34. spcl.inf.ethz.ch @spcl_eth R EMOTE M EMORY A CCESS P ROGRAMMING No hash collision:  Is it ideal? How to enable it?  1 remote atomic  Consider an insert in a  Up to 5x speedup over MP [1] distributed hashtable... Proc p Proc q A hash collision: Use and extend I/O  4 remote atomics + 2 remote puts MMUs and their paging  Significant performance drops capabilities Use “active” semantics Local execution; triggered by an active access . In RMA? [1] R. Gerstenberger et al. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided. SC13

  35. spcl.inf.ethz.ch @spcl_eth U SE SEMANTICS FROM A CTIVE M ESSAGES (AM) [1] Process p AM++[2] We use it in syntax & GASNet [3] Process q semantics to enable the “active” behavior Memory We need active puts/gets: A’s addr: Handler A ...  Invoke a handler upon accessing a given page Z’s addr: Handler Z  Preserve one-sided RMA behavior [1] T. von Eicken et al. Active messages: a mechanism for integrated communication and computation . ISCA’92. [2] J. J. Willcock et al. AM++: A generalized active message framework . PACT’ 10. [3] D. Bonachea, GASNet Specification, v1.1. Berkeley Technical Report. 2002.

  36. spcl.inf.ethz.ch @spcl_eth U SE I NPUT /O UTPUT M EMORY M ANAGEMENT U NITS Main memory Physical Physical addresses addresses IOMMU MMU We propose it as a way to implement the “active” Device Virtual IOTLB TLB addresses behavior addresses I/O devices CPU +

  37. spcl.inf.ethz.ch @spcl_eth We could use it somehow. But … IOMMU S AND RMA 10 MSI An RDMA CPU IOMMU packet 1 3 11 SMT cores 4 ... Dev-to-PT NIC cache 6 IOTLB No multiplexing 2 No parallelism (single log)... BAD PCIe packets (single log)... BAD Main memory 9 Remapping structures System-wide fault log 12 5 W ... User Fault entry Fault entry 8 R Dev-to-PT handlers Handler A ... 7 Data is discarded... PT Extremely BAD

  38. spcl.inf.ethz.ch @spcl_eth Stores addresses of each access log A CTIVE P UTS MSI An RDMA CPU IOMMU packet SMT cores ... Dev-to-PT NIC cache Access log table + IOTLB Decide on PCIe packets keeping/discarding the entry/data Main memory Remapping structures System-wide fault log W ... User Fault entry Fault entry R Dev-to-PT handlers + WL + Handler A WLD + Access log (private for each process) ... ... Fault entry Fault entry Enables Request Request IUID + PT data data data-centric Data can be Maps each page to programming reused an access log

  39. spcl.inf.ethz.ch @spcl_eth A CTIVE P UTS Log both the entry and the Do not modify data of an incoming put the page Process q W = 0 Attempt to Accessed 2 WL = 1 write(X) page WLD = 1 1 IOMMU Process p Page fault! 3 (W = 0) Access log 4 Move(X) X 5 Process(X) CPU Main memory

  40. spcl.inf.ethz.ch @spcl_eth A CTIVE G ETS MSI An RDMA CPU IOMMU packet SMT cores ... Dev-to-PT NIC cache Access log table + IOTLB PCIe packets Main memory Remapping structures System-wide fault log W ... User Fault entry Fault entry R Dev-to-PT handlers + WL + Handler A WLD + Access log (private for each process) + RL ... RLD + ... Fault entry Fault entry Request Request IUID + PT data data

Recommend


More recommend