Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth - PowerPoint PPT Presentation

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC’13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Outline • Overview of MVAPICH2-GPU Project • GPUDirect RDMA with Mellanox IB adaptors • Other Optimizations for GPU Communication • Support for MPI + OpenACC • CUDA and OpenACC extensions in OMB SC'13 NVIDIA Booth presentation 2

Drivers of Modern HPC Cluster Architectures Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/watt Multi-core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Multi-core processors are ubiquitous and InfiniBand is widely accepted • MVAPICH2 has constantly evolved to provide superior performance • • Accelerators/Coprocessors are becoming common in high-end systems • How does MVAPICH2 help development on these emerging architectures? Tianhe – 2 (1) Stampede (6) Titan (2) Tianhe – 1A (10) SC'13 NVIDIA Booth presentation 3

InfiniBand + GPU systems (Past) • Many systems today have both GPUs and high-speed networks such as InfiniBand • Problem: Lack of a common memory registration mechanism – Each device has to pin the host memory it will use – Many operating systems do not allow multiple devices to register the same memory pages • Previous solution: – Use different buffer for each device and copy data SC'13 NVIDIA Booth presentation 4

GPU-Direct • Collaboration between Mellanox and NVIDIA to converge on one memory registration technique • Both devices register a common host buffer – GPU copies data to this buffer, and the network adapter can directly read from this buffer (or vice-versa) • Note that GPU-Direct does not allow you to bypass host memory SC'13 NVIDIA Booth presentation 5

MPI + CUDA • Data movement in applications with standard MPI and CUDA interfaces CPU At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .); PCIe At Receiver: NIC GPU MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .); Switch High Productivity and Low Performance • Users can do the Pipelining at the application level using non-blocking MPI and CUDA interfaces Low Productivity and High Performance 6 SC'13 NVIDIA Booth presentation

GPU-Aware MPI Library: MVAPICH2-GPU • Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Optimizes data movement from GPU memory At Sender: MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …); High Performance and High Productivity 7 SC'13 NVIDIA Booth presentation

Pipelined Data Movement in MVAPICH2 • Pipelines data movement from the GPU, overlaps device-to-host CUDA copies - inter-process data movement (network transfers or shared memory - copies) host-to-device CUDA copies - 3000 45% improvement compared Memcpy+Send - 2500 MemcpyAsync+Isend with a naïve (Memcpy+Send) MVAPICH2-GPU 2000 Time (us) Better 1500 24% improvement compared - 1000 with an advanced user-level 500 implementation 0 (MemcpyAsync+Isend) 32K 128K 512K 2M Message Size (bytes) Internode osu_latency large SC'13 NVIDIA Booth presentation 8

GPU-Direct RDMA (GDR) with CUDA • Network adapter can directly System read/write data from/to GPU Memory device memory CPU • Avoids copies through the host • Fastest possible communication Chip GPU between GPU and IB HCA set InfiniBand • Allows for better asynchronous communication GPU Memory • OFED with GDR support is under development by Mellanox and NVIDIA 10 MVAPICH User Group Meeting 2013 SC'13 NVIDIA Booth presentation

GPU-Direct RDMA (GDR) with CUDA SNB E5-2670 / • OFED with support for GPUDirect RDMA is IVB E5-2680V2 under work by NVIDIA and Mellanox System CPU Memory • OSU has an initial design of MVAPICH2 using GPUDirect RDMA GPU – Hybrid design using GPU-Direct RDMA Chipset IB Adapter • GPUDirect RDMA and Host-based pipelining GPU • Alleviates P2P bandwidth bottlenecks on Memory SandyBridge and IvyBridge SNB E5-2670 – Support for communication using multi-rail P2P write: 5.2 GB/s – Support for Mellanox Connect-IB and ConnectX P2P read: < 1.0 GB/s VPI adapters IVB E5-2680V2 P2P write: 6.4 GB/s – Support for RoCE with Mellanox ConnectX VPI P2P read: 3.5 GB/s adapters SC'13 NVIDIA Booth presentation 11

Performance of MVAPICH2 with GPU-Direct-RDMA: Latency GPU-GPU Internode MPI Latency Large Message Latency Small Message Latency 800 25 1-Rail 1-Rail 2-Rail 2-Rail 700 1-Rail-GDR 1-Rail-GDR 20 10 % 600 2-Rail-GDR 2-Rail-GDR Latency (us) Latency (us) 500 15 400 67 % 10 300 200 5 5.49 usec 100 0 0 8K 32K 128K 512K 2M 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch SC'13 NVIDIA Booth presentation 12

Performance of MVAPICH2 with GPU-Direct-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth Small Message Bandwidth Large Message Bandwidth 2000 12000 1-Rail 1-Rail 9.8 GB/s 1800 2-Rail 2-Rail 10000 1-Rail-GDR 1-Rail-GDR 1600 2-Rail-GDR 2-Rail-GDR Bandwidth (MB/s) Bandwidth (MB/s) 1400 8000 1200 1000 6000 800 5x 4000 600 400 2000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch 13 SC'13 NVIDIA Booth presentation

Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-Bandwidth GPU-GPU Internode MPI Bi-directional Bandwidth Small Message Bi-Bandwidth Large Message Bi-Bandwidth 2000 25000 1-Rail 1-Rail 1800 2-Rail 2-Rail 19 GB/s 1-Rail-GDR 1-Rail-GDR 1600 20000 2-Rail-GDR 2-Rail-GDR 19 % Bi-Bandwidth (MB/s) Bi-Bandwidth (MB/s) 1400 1200 15000 1000 800 10000 4.3x 600 400 5000 200 0 0 1 4 16 64 256 1K 4K 8K 32K 128K 512K 2M Message Size (bytes) Message Size (bytes) Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Telsa K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPU-Direct-RDMA Patch 14 SC'13 NVIDIA Booth presentation

How can I get started with GDR Experimentation? • Two modules are needed – Alpha version of OFED kernel and libraries with GPUDirect RDMA (GDR) support from Mellanox – Alpha version of MVAPICH2-GDR from OSU (currently a separate distribution) • Send a note to hpc@mellanox.com • You will get alpha versions of GDR driver and MVAPICH2-GDR (based on MVAPICH2 2.0a release) • You can get started with this version • MVAPICH2 team is working on multiple enhancements (collectives, datatypes, one-sided) to exploit the advantages of GDR • As GDR driver matures, successive versions of MVAPICH2-GDR with enhancements will be made available to the community SC'13 NVIDIA Booth presentation 15

Multi-GPU Configurations Process 0 Process 1 • Multi-GPU node architectures are becoming common Memory • Until CUDA 3.2 – Communication between processes staged through the host CPU – Shared Memory (pipelined) – Network Loopback [asynchronous) • CUDA 4.0 and later – Inter-Process Communication (IPC) I/O Hub – Host bypass – Handled by a DMA Engine – Low latency and Asynchronous – Requires creation, exchange and GPU 0 GPU 1 mapping of memory handles - overhead HCA SC'13 NVIDIA Booth presentation 17

Designs in MVAPICH2 and Performance SHARED-MEM CUDA IPC • MVAPICH2 takes advantage of CUDA 50 Latency (usec) IPC for MPI communication between 40 GPUs 30 70% 20 • Hides the complexity and overheads of 10 handle creation, exchange and mapping 0 • Available in standard releases from 1 4 16 64 256 1K Message Size (Bytes) MVAPICH2 1.8 Intranode osu_latency small 2000 6000 Bandwidth (MBps) Latency (usec) 5000 1500 46% 78% 4000 1000 3000 2000 500 1000 0 0 4K 16K 64K 256K 1M 4M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Intranode osu_latency large Intranode osu_bw 18 SC'13 NVIDIA Booth presentation

Collectives Optimizations in MVAPICH2: Overview • Optimizes data movement at the collective level for small messages • Pipelines data movement in each send/recv operation for large messages • Several collectives have been optimized Bcast, Gather, Scatter, Allgather, Alltoall, Scatterv, Gatherv, - Allgatherv, Alltoallv • Collective level optimizations are completely transparent to the user • Pipelining can be tuned using point-to-point parameters SC'13 NVIDIA Booth presentation 19

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth - PowerPoint PPT Presentation

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU Project GPUDirect

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Measurement of leakage in a novel gas seal at high temperature Alain Anderson Graduate Research

13-21112 Approved for public release; distribution is unlimited. Title: Applications of In Situ

CONSULTING May 25 th , 2012 AVIC Beijing Institute of Aeronautical Materials MATERIAL DEGRADATION

1 Fire protection of buildings (FPB) Definitions of fire test standard and assessment of

B02 B02

Ham Radio Cruise Ship Operation By Clay Abrams K6AEP 5/20/14 Operation of Amateur Radio from

SPORT Handover Conference 08/03/18 A NEW VISION FOR SPORT We are committed to providing

ESCRI-SA Project Status Update ESCRI Knowledge Sharing Reference Group 12 June 2019 Presentation