STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine

GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2

GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for optimized inter-node copy • GPUDirect P2P for intra-node • accelerated GPU-GPU memcpy • inter-GPU direct load/store access • GPUDirect RDMA 2 for optimized inter-node communication • GPUDirect Async for optimized inter-node communication • [ 1 ] developer info: https://developer.nvidia.com/gpudirect [ 2 ] http://docs.nvidia.com/cuda/gpudirect-rdma 3

GPUDIRECT IN THE CAVE CERN’s NA62 experiment “probes decays of the charged kaon” 4 [*] http://apegate.roma1.infn.it/mediawiki/index.php/NaNet_overview 4/7/16

GPUDIRECT scopes GPU GPUDirect P2P à data • GPUDirect RDMA/P2P Data plane GPUs both master and slave • GPUDirect RDMA à data • GPU slave, 3 rd party device master • GPUDirect Async à control • GPUDirect Async GPU & 3 rd party device master & slave • GPU HOST Control plane 5

GPUDIRECT scopes (2) 3 rd party GPUDirect RDMA & Async • device GPU Async over PCIe, for low latency • RDMA GPUDirect P2P • PCIe P2P switch over PCIe • over NVLink (Pascal only) • GPU CPU 6

GPUDIRECT RDMA ON PASCAL peak results, optimal PCIe fabric 14 12 bandwidth (GB/s) 10 8 6 RDMA read RDMA write 4 2 0 GK110 P100 GPU family 7 4/7/16

GPUDIRECT P2P ON PASCAL early results, P2P thru NVLink Open-MPI intra-node GPU-to-GPU point-to-point BW 20000 Bandwidth (MB/s) 15000 10000 17.9GB/s 5000 0 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2MB 4MB 8 4/7/16

ASYNC: MOTIVATION 9 4/7/16

VISUAL PROFILE - TRADITIONAL (Time marked for one step, Domain size/GPU – 1024, Boundary – 16, Ghost Width – 1) 10

VISUAL PROFILE - TRADITIONAL CPU bounded (Time marked for one step, Domain size/GPU – 128, Boundary – 16, Ghost Width – 1) 11

SW ARCHITECTURE 12

GPUDIRECT SW applications benchmarks ECOSYSTEM MVAPICH2 Open MPI CUDA RT CUDA IB verbs driver user-mode kernel-mode NV display IB core driver nv_peer_mem extensions[*] for RDMA proprietary cxgb4 mlx5 open- source RDMA HW GPU HCA mixed 13 [*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

EXTENDED STACK applications benchmarks MVAPICH2 Open MPI libmp CUDA RT libgdsync CUDA IB verbs IB Verbs extensions driver extensions for Async for Async user-mode kernel-mode NV display IB core driver nv_peer_mem extensions[*] for RDMA/Async proprietary ext. for Async cxgb4 mlx5 open- source RDMA HW GPU HCA Async mixed 14 [*] MLNX OFED, Chelsio www.openfabrics.org/~swise/ofed-3.12-1-peer-direct/OFED-3.12-1-peer-direct-20150330-1122.tgz

GPUDIRECT ASYNC + INFINIBAND preview release of components • CUDA Async extensions, preview in CUDA 8.0 EA • Peer-direct async extension, in MLNX OFED 3.x, soon • libgdsync, on github.com/gpudirect, soon • libmp, on github.com/gpudirect, soon 15 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

ASYNC: APIS 16

GPUDIRECT ASYNC Front-end unit expose GPU front-end unit CPU prepares work plan hardly parallelizable, branch intensive • GPU orchestrates flow • Runs on optimized front-end unit Same one scheduling GPU work • Now also scheduling network • Compute Engines communications 17

STREAM MEMORY OPERATIONS guarantee memory consistency fpr RDMA CU_STREAM_WAIT_VALUE_GEQ = 0x0, CU_STREAM_WAIT_VALUE_EQ = 0x1, CU_STREAM_WAIT_VALUE_AND = 0x2, CU_STREAM_WAIT_VALUE_FLUSH = 1<<30 polling on 32-bit CUresult cuStreamWaitValue32(CUstream stream, CUdeviceptr addr, word cuuint32_t value, unsigned int flags); CU_STREAM_WRITE_VALUE_NO_MEMORY_BARRIER = 0x1 CUresult cuStreamWriteValue32(CUstream stream, CUdeviceptr addr, 32-bit word write cuuint32_t value, unsigned int flags); CU_STREAM_MEM_OP_WAIT_VALUE_32 = 1, CU_STREAM_MEM_OP_WRITE_VALUE_32 = 2, CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES = 3 low-overhead batched CUresult cuStreamBatchMemOp(CUstream stream, unsigned int count, work submission CUstreamBatchMemOpParams *paramArray, unsigned int flags); 18

STREAM MEMORY OPERATIONS Front-end unit GPU front-end unit host mem 3 0 h_flag 1 2 2 1 *(volatile uint32_t*)h_flag = 0; … 1 cuStreamWaitValue32(stream, d_flag, 1, CU_STREAM_WAIT_VALUE_EQ); calc_kernel<<<GSZ,BSZ,0,stream>>>(); 2 cuStreamWriteValue32(stream, d_flag, 2, 0); 3 … *(volatile uint32_t*)h_flag = 1; … cudaStreamSynchronize(stream); Compute Engines assert(*(volatile uint32_t*)h_flag== 2); 19

GPUDIRECT ASYNC APIs features batching multiple consecutive mem ops save ~1us each op • use cuStreamBatchMemOp • APIs accept device pointers • memory need registration (cuMemHostRegister) • device pointer retrieval (cuMemHostGetDevicePointer) • 3 rd party device PCIe resources (aka BARs) • assumed physically contiguous & uncached • special flag needed • 20

GPU PEER MAPPING accessing 3 rd party device PCIe resource from GPU struct device_bar { void *ptr; CUdeviceptr d_ptr; size_t len; }; void map_device_bar(device_bar *db) { device_driver_get_bar(&db->ptr,&db->len); registration is mandatory CUCHECK( cuMemHostRegister (db->ptr, db->len, CU_MEMHOSTREGISTER_IOMEMORY )); new flag CUCHECK( cuMemHostGetDevicePointer (&db->d_ptr, db->ptr, 0)); } GPU access to … device thru cuStreamWriteValue32 (stream, db->d_ptr+off , 0xfaf0, 0); device pointer 21

GPU PEER MAPPING + ASYNC cuStreamWriteValue32(stream, db->d_ptr+off , 0xfaf0, 0); PCIe bus 0xfaf0 phys_ptr+off PCIe iface PCIe resources 3 rd party device GPU 22

2DSTENCIL PERFORMANCE weak scaling, RDMA vs RDMA+Async 2DStencil 35.00% 30.00% Percentage Improvement 25.00% 20.00% NP=2 15.00% NP=4 10.00% 5.00% 0.00% 8 16 32 64 128 256 512 1024 2048 4096 8192 local la0ce size 23 two/four nodes, IVB Xeon CPUs, K40m GPUs, Mellanox Connect-IB FDR, Mellanox FDR switch

CAVEATS Good platform GPUDirect RDMA & Async • need correct/reliable forwarding of PCIe transactions • GPUDirect Async • GPU peer mapping limited to privileged processes (CUDA 8.0 EA) • Platform: • best: PCIE switch • limited: CPU root-complex • 24

April 4-7, 2016 | Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2 GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

BBC Technologies: Our LATAM Experience Who are BBC Technologies? BBC Technologies Where we are

Neuron Technologies Company Profile Dr. Ismayil Alakbarov Founder & CEO, Neuron Technologies

JSC LATVIAS STATE FORESTS Where nature and technologies meet every day Ilga Brzkalna Head

Yhd-12.3081 State of the World and Development (2 cr)

Technologies : Retour sur le Futur ? Technologies : Retour sur le Futur ? Technologies : Retour

State of EB Accelerator Technologies & Future Opportunities Charles Thangaraj and Gianluigi

Current State of Learning Technologies National Research Conseil national Council Canada de

delivers state-of-the-art security technologies for strategic partnership Yuriy Shostak Deputy

New Technologies to Drive System Transformation Using Claims Data to Assess State Initiatives: A

Technologies to Support Independence Across the Continuum of Prevention for Cognitive Aging

State of Cellular Data Connectivity in India Appropriate Computing Technologies for Development

Summary of Micro Grid Technologies Presentation/ The Current State and Future Prospects of DER

Overview of HPC Technologies Part-I Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State

System-on-Chip Seung Kang Qualcomm Technologies, Inc. IEEE Solid-State Circuits Society Seminar

Conversational state management in Web Service Technologies Homework for Seminars in Software

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect overall GPUDirect Async OUTLOOK SW architecture CUDA Async APIs 2 GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

GPU Technology Conference GTC 2016 by Dhabaleswar K. (DK) Panda The Ohio State University

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

BBC Technologies: Our LATAM Experience Who are BBC Technologies? BBC Technologies Where we are

Neuron Technologies Company Profile Dr. Ismayil Alakbarov Founder &amp; CEO, Neuron Technologies

JSC LATVIAS STATE FORESTS Where nature and technologies meet every day Ilga Brzkalna Head

Yhd-12.3081 State of the World and Development (2 cr)

Technologies : Retour sur le Futur ? Technologies : Retour sur le Futur ? Technologies : Retour

State of EB Accelerator Technologies &amp; Future Opportunities Charles Thangaraj and Gianluigi

Current State of Learning Technologies National Research Conseil national Council Canada de

delivers state-of-the-art security technologies for strategic partnership Yuriy Shostak Deputy

New Technologies to Drive System Transformation Using Claims Data to Assess State Initiatives: A

Technologies to Support Independence Across the Continuum of Prevention for Cognitive Aging

State of Cellular Data Connectivity in India Appropriate Computing Technologies for Development

Summary of Micro Grid Technologies Presentation/ The Current State and Future Prospects of DER

Overview of HPC Technologies Part-I Dhabaleswar K. (DK) Panda Hari Subramoni The Ohio State

System-on-Chip Seung Kang Qualcomm Technologies, Inc. IEEE Solid-State Circuits Society Seminar

Conversational state management in Web Service Technologies Homework for Seminars in Software

Neuron Technologies Company Profile Dr. Ismayil Alakbarov Founder & CEO, Neuron Technologies

State of EB Accelerator Technologies & Future Opportunities Charles Thangaraj and Gianluigi