support for gpus with gpudirect rdma in mvapich2
play

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth - PowerPoint PPT Presentation

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU Project GPUDirect


  1. Support for GPUs with GPUDirect RDMA in MVAPICH2 SC’13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

  2. Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 2

  3. Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/watt Multi-core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Multi-core processors are ubiquitous and InfiniBand is widely accepted • MVAPICH2 has constantly evolved to provide superior performance • • Accelerators/Coprocessors are becoming common in high-end systems • How does MVAPICH2 help development on these emerging architectures? Tianhe – 2 (1) Stampede (6) Titan (2) Tianhe – 1A (10) SC'13 NVIDIA Booth presentation 3

  4. InfiniBand + GPU systems (Past) • Many systems today have both GPUs and high-speed networks such as InfiniBand • Problem: Lack of a common memory registration mechanism – Each device has to pin the host memory it will use – Many operating systems do not allow multiple devices to register the same memory pages • Previous solution: – Use different buffer for each device and copy data SC'13 NVIDIA Booth presentation 4

  5. GPU-Direct • Collaboration between Mellanox and NVIDIA to converge on one memory registration technique • Both devices register a common host buffer – GPU copies data to this buffer, and the network adapter can directly read from this buffer (or vice-versa) • Note that GPU-Direct does not allow you to bypass host memory SC'13 NVIDIA Booth presentation 5

  6. MPI + CUDA • Data movement in applications with standard MPI and CUDA interfaces CPU At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .); PCIe At Receiver: NIC GPU MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .); Switch High Productivity and Low Performance • Users can do the Pipelining at the application level using non-blocking MPI and CUDA interfaces Low Productivity and High Performance 6 SC'13 NVIDIA Booth presentation

  7. GPU-Aware MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Optimizes data movement from GPU memory At Sender: MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity 7 SC'13 NVIDIA Booth presentation

  8. Pipelined Data Movement in MVAPICH2 • Pipelines data movement from the GPU, overlaps device-to-host CUDA copies - inter-process data movement (network transfers or shared memory - copies) host-to-device CUDA copies - 3000 45% improvement compared Memcpy+Send - 2500 MemcpyAsync+Isend with a naïve (Memcpy+Send) MVAPICH2-GPU 2000 Time (us) Better 1500 24% improvement compared - 1000 with an advanced user-level 500 implementation 0 (MemcpyAsync+Isend) 32K 128K 512K 2M Message Size (bytes) Internode osu_latency large SC'13 NVIDIA Booth presentation 8

  9. Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 9

  10. GPU-Direct RDMA (GDR) with CUDA • Network adapter can directly System read/write data from/to GPU Memory device memory CPU • Avoids copies through the host • Fastest possible communication Chip GPU between GPU and IB HCA set InfiniBand • Allows for better asynchronous communication GPU Memory • OFED with GDR support is under development by Mellanox and NVIDIA 10 MVAPICH User Group Meeting 2013 SC'13 NVIDIA Booth presentation

  11. GPU-Direct RDMA (GDR) with CUDA SNB E5-2670 / • OFED with support for GPUDirect RDMA is IVB E5-2680V2 under work by NVIDIA and Mellanox System CPU Memory • OSU has an initial design of MVAPICH2 using GPUDirect RDMA GPU – Hybrid design using GPU-Direct RDMA Chipset IB Adapter • GPUDirect RDMA and Host-based pipelining GPU • Alleviates P2P bandwidth bottlenecks on Memory SandyBridge and IvyBridge SNB E5-2670 – Support for communication using multi-rail P2P write: 5.2 GB/s – Support for Mellanox Connect-IB and ConnectX P2P read: < 1.0 GB/s VPI adapters IVB E5-2680V2 P2P write: 6.4 GB/s – Support for RoCE with Mellanox ConnectX VPI P2P read: 3.5 GB/s adapters SC'13 NVIDIA Booth presentation 11

  12. Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency Large Message Latency Small Message Latency 800 25 1-Rail 1-Rail 2-Rail 2-Rail 700 1-Rail-GDR 1-Rail-GDR 20 10 % 600 2-Rail-GDR 2-Rail-GDR Latency (us) Latency (us) 500 15 400 67 % 10 300 200 5 5.49 usec 100 0 0 8K 32K 128K 512K 2M 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch SC'13 NVIDIA Booth presentation 12

  13. Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth Small Message Bandwidth Large Message Bandwidth 2000 12000 1-Rail 1-Rail 9.8 GB/s 1800 2-Rail 2-Rail 10000 1-Rail-GDR 1-Rail-GDR 1600 2-Rail-GDR 2-Rail-GDR Bandwidth (MB/s) Bandwidth (MB/s) 1400 8000 1200 1000 6000 800 5x 4000 600 400 2000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch 13 SC'13 NVIDIA Booth presentation

  14. Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth Small Message Bi-Bandwidth Large Message Bi-Bandwidth 2000 25000 1-Rail 1-Rail 1800 2-Rail 2-Rail 19 GB/s 1-Rail-GDR 1-Rail-GDR 1600 20000 2-Rail-GDR 2-Rail-GDR 19 % Bi-Bandwidth (MB/s) Bi-Bandwidth (MB/s) 1400 1200 15000 1000 800 10000 4.3x 600 400 5000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch 14 SC'13 NVIDIA Booth presentation

  15. How can I get started with GDR Experimentation? • Two modules are needed – Alpha version of OFED kernel and libraries with GPUDirect RDMA (GDR) support from Mellanox – Alpha version of MVAPICH2-GDR from OSU (currently a separate distribution) • Send a note to hpc@mellanox.com • You will get alpha versions of GDR driver and MVAPICH2-GDR (based on MVAPICH2 2.0a release) • You can get started with this version • MVAPICH2 team is working on multiple enhancements (collectives, datatypes, one-sided) to exploit the advantages of GDR • As GDR driver matures, successive versions of MVAPICH2-GDR with enhancements will be made available to the community SC'13 NVIDIA Booth presentation 15

  16. Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 16

  17. Multi-GPU Configurations Process 0 Process 1 • Multi-GPU node architectures are becoming common Memory • Until CUDA 3.2 – Communication between processes staged through the host CPU – Shared Memory (pipelined) – Network Loopback [asynchronous) • CUDA 4.0 and later – Inter-Process Communication (IPC) I/O Hub – Host bypass – Handled by a DMA Engine – Low latency and Asynchronous – Requires creation, exchange and GPU 0 GPU 1 mapping of memory handles - overhead HCA SC'13 NVIDIA Booth presentation 17

  18. Designs in MVAPICH2 and Performance SHARED-MEM CUDA IPC • MVAPICH2 takes advantage of CUDA 50 Latency (usec) IPC for MPI communication between 40 GPUs 30 70% 20 • Hides the complexity and overheads of 10 handle creation, exchange and mapping 0 • Available in standard releases from 1 4 16 64 256 1K Message Size (Bytes) MVAPICH2 1.8 Intranode osu_latency small 2000 6000 Bandwidth (MBps) Latency (usec) 5000 1500 46% 78% 4000 1000 3000 2000 500 1000 0 0 4K 16K 64K 256K 1M 4M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Intranode osu_latency large Intranode osu_bw 18 SC'13 NVIDIA Booth presentation

  19. Collectives Optimizations in MVAPICH2: Overview • Optimizes data movement at the collective level for small messages • Pipelines data movement in each send/recv operation for large messages • Several collectives have been optimized Bcast, Gather, Scatter, Allgather, Alltoall, Scatterv, Gatherv, - Allgatherv, Alltoallv • Collective level optimizations are completely transparent to the user • Pipelining can be tuned using point-to-point parameters SC'13 NVIDIA Booth presentation 19

Recommend


More recommend