mvapich2 gdr pushing the frontier of mpi libraries
play

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling - PowerPoint PPT Presentation

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 2018) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail:


  1. MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU Technology Conference (GTC 2018) by Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State University The Ohio State University E-mail: panda@cse.ohio-state.edu E-mail: subramon@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda http://www.cse.ohio-state.edu/~subramon

  2. Outline • Overview of the MVAPICH2 Project MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Current Features • • Multi-stream Communication for IPC • CMA-based Intra-node Host-to-Host Communication Support • Maximal Overlap in MPI Datatype Processing • Efficient Support for Managed Memory Streaming Support with InfiniBand Multicast and GDR • • Support for Deep Learning • Support for OpenPOWER with NVLink • Support for Container • Upcoming Features • CMA-based Intra-node Collective Communication Support • XPMEM-based Collective Communication Support • Optimized Collectives for Deep Learning • Out-of-core processing for Deep Learning • Conclusions Network Based Computing Laboratory 2 GTC 2018

  3. Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,875 organizations in 86 countries – More than 460,000 (> 0.46 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘17 ranking) • 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China • 9th, 556,104 cores (Oakforest-PACS) in Japan • 12th, 368,928-core (Stampede2) at TACC • 17th, 241,108-core (Pleiades) at NASA • 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology – Available with software stacks of many vendors and Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade Network Based Computing Laboratory 3 GTC 2018

  4. Network Based Computing Laboratory Number of Downloads 100000 150000 200000 250000 300000 350000 400000 450000 500000 50000 0 MVAPICH2 Release Timeline and Downloads Sep-04 Feb-05 Jul-05 Dec-05 MV 0.9.4 May-06 Oct-06 MV2 0.9.0 Mar-07 Aug-07 MV2 0.9.8 Jan-08 Jun-08 Nov-08 MV2 1.0 Apr-09 Sep-09 MV 1.0 Feb-10 MV2 1.0.3 Jul-10 GTC 2018 MV 1.1 Dec-10 Timeline May-11 Oct-11 MV2 1.4 Mar-12 Aug-12 MV2 1.5 Jan-13 Jun-13 MV2 1.6 Nov-13 Apr-14 MV2 1.7 Sep-14 MV2 1.8 Feb-15 Jul-15 MV2 1.9 Dec-15 MV2-GDR 2.0b May-16 MV2-MIC 2.0 Oct-16 MV2 Virt 2.2 Mar-17 MV2-X 2.3b MV2-GDR 2.3a Aug-17 MV2 2.3rc1 Jan-18 OSU INAM 0.9.3 4

  5. MVAPICH2 Architecture High Performance Parallel Programming Models PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) (MPI) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Energy- Fault Collectives I/O and Active Introspection point Job Startup Virtualization Memory & Analysis Algorithms Awareness Messages File Systems Tolerance Primitives Access Support for Modern Multi-/Many-core Architectures Support for Modern Networking Technology (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL * ), NVIDIA GPGPU) (InfiniBand, iWARP, RoCE, OmniPath) Modern Features Transport Protocols Transport Mechanisms Modern Features SR- Multi Shared UMR ODP * NVLink * CAPI * RC XRC UD DC MCDRAM * CMA IVSHMEM IOV Rail Memory * Upcoming Network Based Computing Laboratory 5 GTC 2018

  6. MVAPICH2 Software Family High-Performance Parallel Programming Libraries MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud MVAPICH2-EA Energy aware and High-performance MPI MVAPICH2-MIC Optimized MPI for clusters with Intel KNC Microbenchmarks OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs Tools OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration OEMT Utility to measure the energy consumption of MPI applications Network Based Computing Laboratory 6 GTC 2018

  7. MVAPICH2-GDR: Optimizing MPI Data Movement on GPU Clusters • Connected as PCIe devices – Flexibility but Complexity Memory buffers Node 1 Node 0 1 . Intra- GPU QPI CPU 2 . Intra-Socket GPU -GPU CPU CPU 3 . Inter-Socket GPU -GPU PCIe 4 . Inter-Node GPU -GPU 5 . Intra-Socket GPU -Host 6 . Inter-Socket GPU -Host GPU 7 . Inter-Node GPU -Host GPU GPU GPU IB 8 . Inter-Node GPU -GPU with IB adapter on remote socket and more . . . • NVLink is leading to more paths • For each path different schemes: Shared_mem, IPC, GPUDirect RDMA, pipeline … • Critical for runtimes to optimize data movement while hiding the complexity Network Based Computing Laboratory 7 GTC 2018

  8. GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender: inside MPI_Send(s_devbuf, size, …); MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity Network Based Computing Laboratory 8 GTC 2018

  9. CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory Network Based Computing Laboratory 9 GTC 2018

  10. Using MVAPICH2-GPUDirect Version • MVAPICH2-2.3 with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/ • System software requirements • Mellanox OFED 3.2 or later NVIDIA Driver 367.48 or later • • NVIDIA CUDA Toolkit 7.5/8.0/9.0 or later Plugin for GPUDirect RDMA • http://www.mellanox.com/page/products_dyn?product_family=116 • Strongly recommended GDRCOPY module from NVIDIA • https://github.com/NVIDIA/gdrcopy • Contact MVAPICH help list with any questions related to the package mvapich-help@cse.ohio-state.edu Network Based Computing Laboratory 10 GTC 2018

  11. MVAPICH2-GDR 2.3a • Released on 11/09/2017 • Major Features and Enhancements – Based on MVAPICH2 2.2 – Support for CUDA 9.0 – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Enhanced performance of GPU-based point-to-point communication – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – Enhanced performance of MPI_Allreduce for GPU-resident data – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications • Basic support for IB-MCAST designs with GPUDirect RDMA • Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA • Advanced reliability support for IB-MCAST designs – Efficient broadcast designs for Deep Learning applications – Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems Network Based Computing Laboratory 11 GTC 2018

  12. Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency GPU-GPU Inter-node Bi-Bandwidth 30 6000 Bandwidth (MB/s) Latency (us) 20 4000 11X 10 10x 1.88us 2000 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth Bandwidth (MB/s) 4000 3000 MVAPICH2-GDR-2.3a 2000 9x Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores 1000 NVIDIA Volta V100 GPU 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 128 256 512 1K 2K 4K CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a Network Based Computing Laboratory 12 GTC 2018

Recommend


More recommend