SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena - PowerPoint PPT Presentation

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1

GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK INTERCONNECTS (Today 5pm, D.Rossetti, @Mellanox ) S7155 - OPTIMIZED INTER-GPU COLLECTIVE OPERATIONS WITH NCCL (Tue 9am, S.Jeaugey @NVIDIA) S7142 - MULTI-GPU PROGRAMMING MODELS (Wed 1pm, S.Potluri, J.Krauss @NVIDIA) S7489 - CLUSTERING GPUS WITH ETHERNET (Wed 4pm, F.Osman @Broadcom) S7356 - MVAPICH2-GDR: PUSHING THE FRONTIER OF HPC AND DEEP LEARNING (Thu 2pm, D.K.Panda @OSU) 2

GPUDirect technologies NVLINK-enabled multi-GPU systems GPUDirect P2P AGENDA GPUDirect RDMA GPUDirect Async Async Benchmarks & applications results 3

INTRODUCTION TO GPUDIRECT TECHNOLOGIES 4

GPUDIRECT FAMILY 1 Technologies, enabling products !!! GPUDIRECT SHARED GPU- GPUDIRECT P2P SYSMEM Accelerated GPU-GPU memory copies GPU pinned memory shared with other Inter-GPU direct load/store access RDMA capable devices Avoids intermediate copies GPUDIRECT RDMA 2 GPUDIRECT ASYNC Direct GPU to 3 rd party device transfers Direct GPU to 3 rd party device synchronizations E.g. direct I/O, optimized inter-node E.g. optimized inter-node communication communication 5 [ 1 ] https://developer.nvidia.com/gpudirect [ 2 ] http://docs.nvidia.com/cuda/gpudirect-rdma

GPUDIRECT • GPUDirect P2P → data scopes Intra-node • GPUs both master and slave • GPU GPUDirect RDMA/P2P • Over PCIe or NVLink Data plane • GPUDirect RDMA → data Inter-node • GPU slave, 3 rd party device master • Over PCIe • GPUDirect Async GPUDirect Async → control • GPU HOST GPU & 3 rd party device, master & slave Control plane • Over PCIe • 6

NVLINK-enabled Multi-GPU servers 7

DGX-1 SYSTEM TOPOLOGY GPU – CPU link: PCIe 12.5+12.5 GB/s eff BW GPUDirect P2P: GPU – GPU link is NVLink Cube mesh topology not all-to-all GPUDirect RDMA: GPU – NIC link is PCIe 9

IBM MINSKY 2 POWER8 with NVLink 4 NVIDIA Tesla P100 GPUs 256 GB System Memory 2 SSD storage devices High-speed interconnect: IB or Ethernet Optional: Up to 1 TB System Memory PCIe attached NVMe storage 10

IBM MINSKY SYSTEM TOPOLOGY GPU – CPU link: System System 2x NVLINK Memory Memory 115 GB/s 115 GB/s 40+40 GB/s raw BW POWER8 POWER8 GPUDirect P2P: CPU CPU NVLink NVLink GPU – GPU link is 2x NVLink 80 GB/s 80 GB/s P100 P100 P100 P100 Two cliques topology GPU GPU GPU GPU GPUDirect RDMA: GPU GPU GPU GPU Memory Memory Memory Memory Not supported 11

GPUDIRECT AND MULTI-GPU SYSTEMS THE CASE OF DGX-1 12

HOW TO’S Device topology, link type and capabilities GPUa - GPUb link: P2P over NVLINK vs PCIe, speed, etc Same for CPU – GPU link: NVLINK or PCIe Same for NIC – GPU link (HWLOC) Select an optimized GPU/CPU/NIC combination in MPI runs Enable GPUDirect RDMA 13

CUDA LINK CAPABILITIES basic info, GPU-GPU links only // CUDA driver API A relative value indicating the performance of the typedef enum CUdevice_P2PAttribute_enum { link between two GPUs CU_DEVICE_P2P_ATTRIBUTE_PERFORMANCE_RANK = 0x01, CU_DEVICE_P2P_ATTRIBUTE_ACCESS_SUPPORTED = 0x02, ( NVLINK ranks higher than CU_DEVICE_P2P_ATTRIBUTE_NATIVE_ATOMIC_SUPPORTED = 0x03 PCIe). } CUdevice_P2PAttribute; Can do remote native cuDeviceGetP2PAttribute(int* value, CUdevice_P2PAttribute atomics in GPU kernels attrib, CUdevice srcDevice, CUdevice dstDevice) // CUDA runtime API cudaDeviceGetP2PAttribute(int *value, enum cudaDeviceP2PAttr attr, int srcDevice, int dstDevice) 14

GPUDIRECT P2P: NVLINK VS PCIE NVLINK transparently picked if available cudaSetDevice(0); cudaMalloc(&buf0, size); cudaCanAccessPeer (&access, 0, 1); assert(access == 1); cudaEnablePeerAccess (1, 0); cudaSetDevice(1); cudaMalloc(&buf1, size); … cudaSetDevice (0); cudaMemcpy (buf0, buf1, size, cudaMemcpyDefault); Note: some GPUs are not connected e.g. GPU0-GPU7 Note2: others have multiple potential link (NVLINK and PCIe) but cannot use both at the same time!!! 15

MULTI GPU RUNS ON DGX-1 Select best GPU/CPU/NIC for each MPI rank $ cat wrapper.sh Create wrapper script if [ ! – z $OMPI_COMM_WORLD_LOCAL_RANK ]; then lrank=$OMPI_COMM_WORLD_LOCAL_RANK elif [ ! – z $MV2_COMM_WORLD_LOCAL_RANK ]; then Use local MPI rank (MPI impl lrank=$MV2_COMM_WORLD_LOCAL_RANK dependent) fi if (( $lrank > 7 )); then echo "too many ranks"; exit; fi Don’t use CUDA_VISIBLE_DEVICES, case ${HOSTNAME} in *dgx*) hurts P2P!!! USE_GPU=$((2*($lrank%4)+$lrank/4)) # 0,2,4,6,1,3,5,7 export USE_SOCKET=$(($USE_GPU/4)) # 0,0,1,1,0,0,1,1 Environment variables to pass HCA=mlx5_$(($USE_GPU/2)) # 0,1,2,3,0,1,2,3 selection down to MPI and app export OMPI_MCA_btl_openib_if_include=${HCA} export MV2_IBA_HCA=${HCA} export USE_GPU;; In application … cudaSetDevice (“USE_GPU”) esac numactl --cpunodebind=${USE_SOCKET} – l $@ Run wrapper script $ mpirun – np N wrapper.sh myapp param1 param2 … 16

NVML 1 NVLINK Link discovery and info APIs nvmlDevice separate from CUDA nvmlDeviceGetNvLinkVersion(nvmlDevice_t device, unsigned int link, unsigned int *version) gpu id’s (all devices vs CUDA_VISIBLE_DEVICES) nvmlDeviceGetNvLinkState(nvmlDevice_t device, unsigned int link, nvmlEnableState_t *isActive) NVML_NVLINK_MAX_LINKS=6 nvmlDeviceGetNvLinkCapability(nvmlDevice_t device, unsigned int link, nvmlNvLinkCapability_t capability, unsigned int *capResult) See later for capabilities nvmlDeviceGetNvLinkRemotePciInfo(nvmlDevice_t domain:bus:device.function PCI device, unsigned int link, nvmlPciInfo_t *pci) identifier of device on the other side of the link, can be socket PCIe bridge (IBM POWER8) 1 http://docs.nvidia.com/deploy/nvml-api/ 17

NVLINK CAPABILITIES On DGX-1 nvidia-smi nvlink – I <GPU id> – c typedef enum nvmlNvLinkCapability_enum { NVML_NVLINK_CAP_P2P_SUPPORTED = 0, NVML_NVLINK_CAP_SYSMEM_ACCESS = 1, NVML_NVLINK_CAP_P2P_ATOMICS = 2, NVML_NVLINK_CAP_SYSMEM_ATOMICS= 3, NVML_NVLINK_CAP_SLI_BRIDGE = 4, NVML_NVLINK_CAP_VALID = 5, } nvmlNvLinkCapability_t; 18

NVLINK COUNTERS On DGX-1 Per GPU (-i 0), per link (-l <0..3>) Two sets of counters (-g <0|1>) Per set counter types: cycles,packets,bytes (-sc xyz) Reset individually (-r <0|1>) 19

NVML TOPOLOGY GPU-GPU & GPU-CPU topology query 1 APIs nvmlDeviceGetTopologyNearestGpus( nvmlDevice_t device, nvmlGpuTopologyLevel_t level, unsigned int* count, nvmlDevice_t* deviceArray ) nvmlDeviceGetTopologyCommonAncestor( nvmlDevice_t device1, nvmlDevice_t device2, nvmlGpuTopologyLevel_t* pathInfo ) NVML_TOPOLOGY_INTERNAL, NVML_TOPOLOGY_SINGLE, nvmlSystemGetTopologyGpuSet(unsigned int cpuNumber, unsigned NVML_TOPOLOGY_MULTIPLE, NVML_TOPOLOGY_HOSTBRIDGE, int* count, nvmlDevice_t* deviceArray ) NVML_TOPOLOGY_CPU, NVML_TOPOLOGY_SYSTEM, nvmlDeviceGetCpuAffinity(nvmlDevice_t device, unsigned int cpuSetSize, unsigned long *cpuSet); 1 http://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html 20

SYSTEM TOPOLOGY On DGX-1 $ nvidia-smi topo -m 21

SYSTEM TOPOLOGY On DGX-1, PCIe only $ nvidia-smi topo -mp 22

GPUDIRECT P2P 25

DGX-1 P2P PERFORMANCE p2pBandwidthLatencyTest in CUDA toolkit samples Sources: samples/1_Utilities/p2pBandwidthLatencyTest Binary: samples/bin/x86_64/linux/release/p2pBandwidthL atencyTest 26

DGX-1 P2P PERFORMANCE busGrind In CUDA toolkit demo suite: /usr/local/cuda-8.0/extras/demo_suite/busGrind – h Usage: -h: print usage -p [0,1] enable or disable pinned memory tests (default on) -u [0,1] enable or disable unpinned memory tests (default off) -e [0,1] enable or disable p2p enabled memory tests (default on) -d [0,1] enable or disable p2p disabled memory tests (default off) -a enable all tests -n disable all tests 27

Intra-node MPI BW k40-pcie k40-pcie-bidir p100-nvlink p100-nvlink-bidir 40 35 ~35 GB/sec Bi-dir Bandwidth GB/sec 30 25 20 ~17 GB/sec Uni-dir 15 10 5 0 1K 4K 16K 64K 256K 1M 4M 16M Message Size (Bytes) GPU-aware MPI running over GPUDirect P2P Dual IVB Xeon 2U server (K40 PCIe) vs DGX-1 (P100-nvlink) 8/3/16 28

GPUDIRECT RDMA 29

GPUDirect RDMA over RDMA networks for better network communication latency For Linux rdma subsystem open-source nvidia_peer_memory kernel module 1 important bug fix in ver 1.0-3 !!! enables NVIDIA GPUDirect RDMA on OpenFabrics stack Multiple vendors Mellanox 2 : ConnectX3 to ConnectX-5, IB/RoCE Chelsio 3 : T5, iWARP Others to come 1 https://github.com/Mellanox/nv_peer_memory 2 http://www.mellanox.com/page/products_dyn?product_family=116 30 3 http://www.chelsio.com/gpudirect-rdma

SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena - PowerPoint PPT Presentation

S7128 - HOW TO ENABLE NVIDIA CUDA STREAM SYNCHRONOUS COMMUNICATIONS USING GPUDIRECT Davide Rossetti, Elena Agostini 1 GPUDIRECT ELSEWHERE AT GTC2017 H7130 - CONNECT WITH THE EXPERTS: NVIDIA GPUDIRECT TECHNOLOGIES ON MELLANOX NETWORK

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

STATE OF GPUDIRECT TECHNOLOGIES Davide Rossetti(*) Sreeram Potluri David Fontaine GPUDirect

NVIDIA GPUDIRECT TECHNOLOGIES Davide Rossetti, Elena Agostini Tue 3/19, 2PM, Room 211A

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC13 NVIDIA Booth by D.K. Panda The Ohio

Synchronous Grammars Synchronous grammars are a way of simultaneously generating pairs of

HW/SW Codesign w/ FPGAs Data Flow Modeling II ECE 522 Synchronous Data Flow Graphs Synchronous

High-Performance GPU Clustering: GPUDirect RDMA over 40GbE iWARP Tom Reu Consulting Applications

First-class Synchronous Operations CML supports selective communication in a very general way,

High Performance Broadcast with GPUDirect RDMA and InfiniBand Hardware Multicast for Streaming

MVAPICH2-GDR: Pushing the Frontier of MPI Libraries Enabling GPUDirect Technologies GPU

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies GPU

Repetitive Synchronous Imitation A new tool for looking at timing Repetitive Synchronous

RxJS 5 RxJS 5 RxJS 5 RxJS 5 In-depth In-depth by Gerard Sans (@gerardsans) A little about me

Synchronous Ordering 1 Goals of the lecture Hiera rchy of communication mo des

Outline Synchronous Programming Introduction of Reactive Systems The Data-Flow Language Lustre

Synchronous Elastic Systems Synchronous Elastic Systems Mike Kishinevsky and Jordi Cortadella

AM in MBE: Is It Really That Unique? Presented by Paul Witherell, PhD Measurement Science for

Third Quarter 2016 Financial Results NOVEMBER 14, 2016 Q3-2016 FINANCIAL RESULTS Certain

Composites Research Network: Background UBC Composites Group has been active since late

Investor Presentation The UKs leading provider of hazardous waste management solutions

Full year results presentation October 2017 02 Redefine International P .L.C. 2017 Full Year

Los Angeles County Los Angeles County Regional DMMP Regional DMMP Pilot Studies Pilot Studies

Post-Mortem Trust Planning, Modifications and Allocations: Tax Elections Available to the Executor

Statistical Computing with Matrix Biostatistics 615/815 Lecture 14: . . Summary PCA . Bulk