Optimizing MPI Communication on Multi-GPU Systems using CUDA - PowerPoint PPT Presentation

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University + Texas Advanced Computing Center 1

Outline • Motivation • Problem Statement • Using CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 2

GPUs for HPC • GPUs are becoming a common component of modern clusters – higher compute density and performance/watt • 3 of the top 5 systems in the latest Top 500 list use GPUs • Increasing number of HPC workloads are being ported to GPUs - many of these use MPI • MPI libraries are being extended to support communication from GPU device memory 3

MVAPICH/MVAPICH2 for GPU Clusters At Sender: � CPU � cudaMemcpy (sbuf, sdev) � MPI_Send (sbuf, . . . ) � � inside PCIe Earlier At Receiver: � 4 MVAPICH2 � MPI_Recv (rbuf, . . . ) � cudaMemcpy (rdev, rbuf) � NIC GPU At Sender: � � Switch MPI_Send (sdev, . . . ) � � Now At Receiver: � � MPI_Recv (rdev, . . . ) � • Efficient overlap copies over the PCIe with RDMA transfers over the network • Allows us to select efficient algorithms for MPI collectives and MPI datatype processing • Available with MVAPICH2 v1.8 ( http://mvapich.cse.ohio-state.edu ) 4

Motivation Process 0 Process 1 • Multi-GPU node architectures are becoming common Memory • Until CUDA 3.2 – Communication between processes staged through the host CPU – Shared Memory (pipelined) – Network Loopback [asynchronous) • CUDA 4.0 – Inter-Process Communication (IPC) I/O Hub – Host bypass – Handled by a DMA Engine – Low latency and Asynchronous – Requires creation, exchange and GPU 0 GPU 1 mapping of memory handles - overhead HCA 5

Comparison of Costs 228 usec Copy Latency (usec) 200 150 100 49 usec 50 3 usec CUDA IPC Copy + CUDA IPC Copy Handle Creation & Copy Via Host Mapping Overhead • Comparison of bare copy costs between two processes on one node, each using a different GPU (outside MPI) • 8 Bytes 6

Outline • Motivation • Problem Statement • Basics of CUDA IPC • CUDA IPC based Designs in MVAPICH2 – Two Sided Communication – One-sided Communication • Experimental Evaluation • Conclusion and Future Work 7

Problem Statement • Can we take advantage of CUDA IPC to improve performance of MPI communication between GPUs on a node? • How do we address the memory handle creation and mapping overheads? • What kind of performance do the different MPI communication semantics deliver with CUDA IPC? – Two-sided Semantics – One-sided Semantics • How do CUDA IPC based designs impact the performance of end- applications? 8

Basics of CUDA IPC sbuf_ptr rbuf_ptr Process 0 Process 1 cudaIpcGetEventHandle (&event_handle, event) cuMemGetAddressRange (&base_ptr, sbuf_ptr) cudaIpcGetMemhandle IPC handles (&handle, base_ptr) cudaIpcOpenMemhandle (&base_ptr, handle) cudaIpcOpenEventhandle (&ipc_event, event_handle) cudaMemcpy (rbuf_ptr, base_ptr + displ) cudaEventRecord (&ipc_event, event_handle) IPC memory handle should be Done cudaStreamWaitEvent closed at Process 1 before the (0, event) buffer is freed at Process 0 other CUDA calls that can 10 modify the sbuf

Design of Two-sided Communication • MPI communication costs – synchronization – data movement • Small message communication – minimize synchronization overheads – pair-wise eager buffers for host-host communication – associated pair-wise IPC buffers on GPU – synchronization using CUDA Events • Large message communication – minimize number for copies - rendezvous protocol – minimize memory mapping overheads using a mapping cache 12

Design of One-sided Communication • Separates communication from synchronization • Window • Communication calls - put, get, accumulate • Synchronization calls – active - fence, post-wait/start-complete – passive – lock-unlock – period between two synchronization calls is a communication epoch • IPC memory handles created and mapped during window creation • Put/Get implemented as cudaMemcpyAsync • Synchronization using CUDA Events 13

Experimental Setup • Intel Westmere node – 2 NVIDIA Tesla C2075 GPUs – Red Hat Linux 5.8 and CUDA Toolkit 4.1 • MVAPICH/MVAPICH2 - High Performance MPI Library for IB, 10GigE/iWARP and RoCE – Available since 2002 – Used by more than 1.930 organizations (HPC centers, Industries and Universities) in 68 countries – More than 111,000 downloads from OSU site directly – Empowering many TOP500 clusters • 5th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology • 7th ranked 111,104-core cluster (Pleiades) at NASA 25th ranked 62,976-core cluster (Ranger) at TACC • – http://mvapich.cse.ohio-state.edu 15

Two-sided Communication Performance SHARED-MEM CUDA IPC 50 2000 Latency (usec) 40 Latency (usec) 1500 46% 30 70% 1000 20 500 10 0 0 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) 6000 Bandwidth (MBps) 5000 78% considerable 4000 improvement in 3000 MPI performance 2000 due to host 1000 bypass 0 1 16 256 4K 64K 1M 16 Message Size (Bytes)

One-sided Communication Performance (get + active synchronization vs. send/recv) SHARED-MEM-1SC CUDA-IPC-1SC CUDA-IPC-2SC 50 2000 40 Latency (usec) Latency (usec) 1500 30 1000 20 30% 500 10 0 0 2 8 32 128 512 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes) 6000 Bandwidth (MBps) 5000 Better 4000 27% performance 3000 compared to 2000 two-sided 1000 semantics. 0 1 16 256 4K 64K 1M Message Size (Bytes) 17

One-sided Communication Performance (get + passive synchronization) SHARED-MEM CUDA IPC 600 500 Latency (usec) 400 300 200 100 true 0 asynchronous 0 100 200 300 progress Target Busy Loop (usec) • Lock + 8 Gets + Unlock with the target in a busy loop (128KB messages) 18

Lattice Boltzmann Method 2SIDED-SHARED-MEM 2SIDED-IPC 1SIDED-IPC 140 LB Step Latency (msec) 120 16% 100 80 60 40 20 0 256x256x64 256x512x64 512x512x64 Dataset per GPU • Computation fluid dynamics code with support for multi-phase flows with large density ratios • Modified to use MPI communication from GPU device memory - one- sided and two-sided semantics • Up to 16% improvement in per step 19

Conclusion and Future Work • Take advantage of CUDA IPC to improve MPI communication between GPUs on a node • 70% improvement in latency and 78% improvement in bandwidth for two- sided communication • One-sided communication gives better performance and allows for truly asynchronous communication • 16% improvement in execution time of Lattice Boltzmann Method code • Studying the impact on other applications while exploiting computation- communication overlap • Exploring efficient designs for inter-node one-sided communication on GPU clusters 21

Thank You! {potluri, wangh, bureddy, singhas, panda} @cse.ohio-state.edu carlos@tacc.utexas.edu Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ � 22

Optimizing MPI Communication on Multi-GPU Systems using CUDA - PowerPoint PPT Presentation

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based Computing Laboratory

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Communication Chapter 2 Layered Protocols (1) 2-1 Layers, interfaces, and protocols in the OSI

Synchronization Message Passing The most important property of a program is whether it

1. Asynchronous and Synchronous Transmission Synchronization: Sender & receiver

Asynchronous Communication Mechanisms (ACMs) Fei Xia Fei Xia EECE, Newcastle University

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL

Paxos week! Doug Woos Logistics notes Next Monday: International Workers Day - No in-class

Synchronous vs. Asynchronous Programming Jan Pascal Maas Institute for Software Engineering and

Optimizing MPI Communication on Multi-GPU Systems using CUDA - PowerPoint PPT Presentation

Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication Sreeram Potluri* Hao Wang* Devendar Bureddy* Ashish Kumar Singh* Carlos Rosales + Dhabaleswar K. Panda* *Network-Based Computing Laboratory

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

MULTI GPU PROGRAMMING WITH MPI Jiri Kraus, Senior Devtech Compute, April 4th 2016 MPI+CUDA

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

MPI AND OPENACC JIRI KRAUS, NVIDIA MPI+OPENACC System System System GDDR5 Memory GDDR5

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

S7546 Multi-GPU Programming with OpenACC Jeff Larkin, May 9, 2017, GTC17 Multi-GPU

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Communication Chapter 2 Layered Protocols (1) 2-1 Layers, interfaces, and protocols in the OSI

Synchronization Message Passing The most important property of a program is whether it

1. Asynchronous and Synchronous Transmission Synchronization: Sender &amp; receiver

Asynchronous Communication Mechanisms (ACMs) Fei Xia Fei Xia EECE, Newcastle University

A Level-Encoded Transition Signaling Protocol for High-Throughput Asynchronous Global

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL

Paxos week! Doug Woos Logistics notes Next Monday: International Workers Day - No in-class

Synchronous vs. Asynchronous Programming Jan Pascal Maas Institute for Software Engineering and

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

1. Asynchronous and Synchronous Transmission Synchronization: Sender & receiver