MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for - PowerPoint PPT Presentation

MVAPICH2-‑GPU: ¡Op0mized ¡GPU ¡to ¡GPU ¡ Communica0on ¡for ¡InfiniBand ¡Clusters ¡ ¡ ¡ H. ¡Wang, ¡S. ¡Potluri, ¡M. ¡Luo, ¡A. ¡K. ¡Singh, ¡S. ¡Sur ¡ D. ¡K. ¡Panda ¡ ¡ Network-‑Based ¡Compu0ng ¡Laboratory ¡ The ¡Ohio ¡State ¡University ¡ ISC ¡2011 ¡Hamburg ¡ 1

Outline ¡ • Introduc0on ¡ • Problem ¡Statement ¡ • Our ¡Solu0on: ¡MVAPICH2-‑GPU ¡ ¡ • Design ¡Considera0ons ¡ • Performance ¡Evalua0on ¡ • Conclusion ¡& ¡Future ¡Work ¡ ISC 2011 Hamburg 2

InfiniBand ¡Clusters ¡in ¡Top500 ¡ • Percentage ¡share ¡of ¡InfiniBand ¡is ¡steadily ¡increasing ¡ ¡ • 41% ¡of ¡systems ¡in ¡TOP ¡500 ¡using ¡InfiniBand ¡(June ¡’11) ¡ • 61% ¡of ¡systems ¡in ¡TOP ¡100 ¡using ¡InfiniBand ¡(June ¡‘11) ¡ ISC 2011 Hamburg 3

Growth ¡in ¡GPGPUs ¡ • GPGPUs ¡are ¡gaining ¡significance ¡on ¡clusters ¡for ¡data-‑centric ¡ applica0ons: ¡ – Word ¡Occurrence, ¡Sparse ¡Integer ¡Occurrence ¡ – K-‑means ¡clustering, ¡Linear ¡regression ¡ • GPGPUs ¡+ ¡InfiniBand ¡are ¡gaining ¡momentum ¡for ¡large ¡clusters ¡ – #2 ¡(Tianhe-‑1A), ¡#4 ¡(Nebulae) ¡and ¡#5 ¡(Tsubame) ¡Petascale ¡systems ¡ • GPGPUs ¡programming ¡ – CUDA ¡or ¡OpenCL ¡+ ¡MPI ¡ – Dr. ¡Sumit ¡Gupta ¡briefed ¡industry ¡users ¡at ¡NVIDIA ¡mee0ng ¡yesterday ¡on ¡ programmability ¡advances ¡on ¡GPUs ¡ • Big ¡issues: ¡performance ¡of ¡data ¡movement ¡ ¡ – Latency ¡ – Bandwidth ¡ – Overlap ¡ ¡ ¡ ISC 2011 Hamburg 4

Data ¡movement ¡in ¡GPU ¡clusters ¡ IB IB Main Main GPU GPU Memory Memory Adapter Adapter PCI-E PCI-E PCI-E PCI-E PCI-E Hub IB Network PCI-E Hub • Data ¡movement ¡in ¡InfiniBand ¡clusters ¡with ¡GPUs ¡ – CUDA: ¡Device ¡memory ¡ à ¡Main ¡memory ¡ ¡[at ¡source ¡process] ¡ – MPI: ¡Source ¡rank ¡ à ¡Des0na0on ¡process ¡ – CUDA: ¡Main ¡memory ¡ à ¡Device ¡memory ¡ ¡[at ¡des0na0on ¡process] ¡ • GPU ¡and ¡InfiniBand ¡require ¡separate ¡memory ¡registra0on ¡ ISC 2011 Hamburg 5

GPU ¡Direct ¡ without GPU Direct with GPU Direct • Collabora0on ¡between ¡Mellanox ¡and ¡NVIDIA ¡to ¡converge ¡on ¡ one ¡memory ¡registra0on ¡technique ¡ • Both ¡devices ¡can ¡register ¡same ¡host ¡memory: ¡ – GPU ¡and ¡network ¡adapters ¡can ¡access ¡the ¡buffer ¡ ISC 2011 Hamburg 6

Problem ¡Statement ¡ • Data ¡movement ¡from/to ¡GPGPUs ¡ – Performance ¡bojleneck ¡ – Reduced ¡programmer ¡produc0vity ¡ • Hard ¡to ¡op0mize ¡at ¡the ¡applica0on ¡level ¡ – CUDA ¡and ¡MPI ¡exper0se ¡required ¡for ¡efficient ¡implementa0on ¡ – Hardware ¡dependent ¡latency ¡characteris0cs ¡ – Hard ¡to ¡support ¡and ¡op0mize ¡collec0ves ¡ – Hard ¡to ¡support ¡advanced ¡features ¡like ¡one-‑sided ¡communica0on ¡ ISC 2011 Hamburg 8

MVAPICH2-‑GPU: ¡Design ¡Goals ¡ • Support ¡GPU ¡to ¡GPU ¡communica0on ¡through ¡standard ¡MPI ¡ interfaces ¡ – e.g. ¡enable ¡MPI_Send, ¡MPI_Recv ¡from/to ¡GPU ¡memory ¡ • Provide ¡high ¡performance ¡without ¡exposing ¡low ¡level ¡details ¡ to ¡the ¡programmer ¡ – Pipelined ¡data ¡transfer ¡which ¡ automa:cally ¡provides ¡op0miza0ons ¡ inside ¡MPI ¡library ¡without ¡user ¡tuning ¡ • Available ¡to ¡work ¡with ¡ ¡ – GPU ¡Direct ¡ – Without ¡GPU ¡Direct ¡ ¡ ISC 2011 Hamburg 10

Sample ¡Code ¡-‑ ¡without ¡MPI ¡integra0on ¡ • Naïve ¡implementa0on ¡with ¡MPI ¡and ¡CUDA ¡ At ¡Sender: ¡ ¡ ¡cudaMemcpy(s_buf, ¡s_device, ¡size, ¡cudaMemcpyDeviceToHost); ¡ ¡MPI_Send(s_buf, ¡size, ¡MPI_CHAR, ¡1, ¡1, ¡MPI_COMM_WORLD); ¡ ¡ At ¡Receiver: ¡ ¡MPI_Recv(r_buf, ¡size, ¡MPI_CHAR, ¡0, ¡1, ¡MPI_COMM_WORLD, ¡&req); ¡ ¡cudaMemcpy(r_device, ¡r_buf, ¡size, ¡cudaMemcpyHostToDevice); ¡ • High ¡produc:vity ¡but ¡poor ¡performance ¡ ISC 2011 Hamburg 11

Sample ¡Code ¡– ¡User ¡op0mized ¡code ¡ ¡ • Pipelining ¡at ¡user ¡level ¡with ¡non-‑blocking ¡MPI ¡and ¡CUDA ¡interfaces ¡ • Code ¡repeated ¡at ¡receiver ¡side ¡ ¡ • Good ¡performance ¡but ¡poor ¡produc:vity ¡ At ¡Sender: ¡ ¡ ¡ ¡ ¡for ¡(j ¡= ¡0; ¡j ¡< ¡pipeline_len; ¡j++) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡cudaMemcpyAsync(s_buf ¡+ ¡j ¡* ¡block_sz, ¡s_device ¡+ ¡j ¡* ¡block_sz, ¡…); ¡ ¡ ¡ ¡for ¡(j ¡= ¡0; ¡j ¡< ¡pipeline_len; ¡j++) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡while ¡(result ¡!= ¡cudaSucess) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡result ¡= ¡cudaStreamQuery(…); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡if(j ¡> ¡0) ¡MPI_Test(…); ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡MPI_Isend(s_buf ¡+ ¡j ¡* ¡block_sz, ¡block_sz, ¡MPI_CHAR, ¡1, ¡1, ¡….); ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡ ¡MPI_Waitall(); ¡ ISC 2011 Hamburg 12

Sample ¡Code ¡– ¡MVAPICH2-‑GPU ¡ • MVAPICH2-‑GPU: ¡provides ¡standard ¡MPI ¡interfaces ¡for ¡GPU ¡ At ¡Sender: ¡ ¡ ¡ ¡ ¡MPI_Send(s_device, ¡size, ¡…); ¡// ¡s_device ¡is ¡data ¡buffer ¡in ¡GPU ¡ ¡ ¡ ¡ At ¡Receiver: ¡ ¡ ¡ ¡ ¡ ¡MPI_Recv(r_device, ¡size, ¡…); ¡// ¡r_device ¡is ¡data ¡buffer ¡in ¡GPU ¡ • High productivity and high performance! ISC 2011 Hamburg 13

Design ¡considera0ons ¡ • Memory ¡detec0on ¡ – CUDA ¡4.0 ¡introduces ¡ Unified ¡Virtual ¡Addressing ¡(UVA) ¡ ¡ – MPI ¡library ¡can ¡differen0ate ¡between ¡device ¡memory ¡and ¡ host ¡memory ¡without ¡any ¡hints ¡from ¡the ¡user ¡ • Overlap ¡CUDA ¡copy ¡and ¡RDMA ¡transfer ¡ – Pipeline ¡DMA ¡of ¡data ¡from ¡GPU ¡and ¡InfiniBand ¡RDMA ¡ – Allow ¡for ¡progressing ¡DMAs ¡individual ¡data ¡chunks ¡ ISC 2011 Hamburg 15

Pipelined ¡Design ¡ MPI_Send ¡ MPI_Recv ¡ GPU ¡Device ¡ Host ¡Main ¡ Host ¡Main ¡ GPU ¡Device ¡ Memory ¡ Memory ¡ Memory ¡ Memory ¡ RTS/CTS ¡ cudaMemcpy ¡ Async() ¡ … ¡ RDMA ¡Write ¡ & ¡Finish ¡MSG ¡ cudaStream ¡ Query() ¡ … ¡ cudaMemcpy ¡ … ¡ Async() ¡ … ¡ cudaStream ¡ Query() ¡ … ¡ with GPU-Direct – Data ¡is ¡divided ¡into ¡chunks ¡ – Pipeline ¡CUDA ¡copies ¡with ¡RDMA ¡transfers ¡ – If ¡system ¡does ¡not ¡have ¡GPU-‑Direct, ¡an ¡extra ¡copy ¡is ¡required ¡ ISC 2011 Hamburg 16

Pipeline ¡Design ¡(Cont.) ¡ • Chunk ¡size ¡depends ¡on ¡CUDA ¡copy ¡cost ¡and ¡RDMA ¡latency ¡over ¡ the ¡network ¡ • Automa0c ¡tuning ¡of ¡chunk ¡size ¡ – Detects ¡CUDA ¡copy ¡and ¡RDMA ¡latencies ¡during ¡installa0on ¡ – Chunk ¡size ¡can ¡be ¡stored ¡in ¡configura0on ¡file ¡(mvapich.conf) ¡ • User ¡transparent ¡to ¡deliver ¡the ¡best ¡performance ¡ ¡ ISC 2011 Hamburg 17

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for - PowerPoint PPT Presentation

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Consumer / Producer Communica0on with Applica0on Level Framing

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S.

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

PCI: A Four-Letter Word of E-Commerce or: How I Learned to Stop Worrying and Love the Standard

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit

PAWP A Power-Aware Web Proxy for WLAN Clients Marcel Ro u, Michael Olsen, and Chandra

Towards a library of formalised undecidable problems in Coq: The undecidability of intuitionistic

Whats new in Virtio 1.1? and surrounding areas Jens Freimann Red Hat FOSDEM 2018 What

Performing with CUDA W. B. Langdon CREST lab, Department of Computer Science Pages 423-430

Big Picture Interrupts Processor IC220 Set #11: Cache Storage and I/O Memory- I/O bus Main

Wave Relay System and Wave Relay System and General Project Details General Project Details

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for - PowerPoint PPT Presentation

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand Clusters H. Wang, S. Potluri, M. Luo, A. K. Singh, S. Sur D. K. Panda

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Consumer / Producer Communica0on with Applica0on Level Framing

MVAPICH2-GDR: Pushing the Frontier of HPC and Deep Learning GPU Technology Conference GTC 2017

MVAPICH2-GDR: High-Performance and Scalable CUDA-Aware MPI Library for HPC and AI GPU Technology

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S.

Building Efficient HPC Clouds with MVAPICH2 and OpenStack over SR-IOV-enabled Heterogeneous

MVAPICH2 over OpenStack with SR-IOV: An Efficient Approach to Build HPC Clouds Jie Zhang, Xiaoyi

Exploiting Computation and Communication Overlap in MVAPICH2 MPI Library Keynote Talk at Charm++

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

PCI: A Four-Letter Word of E-Commerce or: How I Learned to Stop Worrying and Love the Standard

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges &amp; opportuni=es Rachit

PAWP A Power-Aware Web Proxy for WLAN Clients Marcel Ro u, Michael Olsen, and Chandra

Towards a library of formalised undecidable problems in Coq: The undecidability of intuitionistic

Whats new in Virtio 1.1? and surrounding areas Jens Freimann Red Hat FOSDEM 2018 What

Performing with CUDA W. B. Langdon CREST lab, Department of Computer Science Pages 423-430

Big Picture Interrupts Processor IC220 Set #11: Cache Storage and I/O Memory- I/O bus Main

Wave Relay System and Wave Relay System and General Project Details General Project Details

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit