COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley SIMPLIFYING MULTI-GPU COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh 1

GOAL Limitations for strong scaling on GPU clusters Possibly address with GPU Global Address: NVSHMEM Case studies using NVSHMEM Start of a discussion and not a solution 2

PROGRAMMING WITH NVSHMEM 3

GPU CLUSTER PROGRAMMING Offload model void 2dstencil (u, v, …) Compute on GPU { Communication from CPU for (timestep = 0; …) { interior_compute_kernel <<<…>>> (…) Synchronization at boundaries pack_kernel <<<…>>> (…) cudaStreamSynchronize(…) Overheads on GPU Clusters MPI_Irecv(…) MPI_Isend(…) Offload latencies MPI_Waitall(…) Synchronization overheads unpack_kernel <<<…>>> (…) boundary_compute_kernel <<<…>>> (…) Limits strong scaling … } More CPU means more power } 4

GPU-CENTRIC COMMUNICATION GPU capabilities Compute state to hide latencies to global memory Implicit coalescing of loads/stores to achieve efficiency CUDA helps to program to these Should also benefit when accessing data over the network Direct accesses to remote memory simplifies programming Achieving efficiency while making it easier to program Continuous fine-grained accesses smooths traffic over the network 5

GPU GLOBAL ADDRESS SPACE 6

NVSHMEM A subset of OpenSHMEM Interoperability with MPI/OpenSHMEM, in CUDA kernels/OpenACC regions Host: initialization and cleanup (host) nvstart_pes, nvstop_pes allocation and deallocation (host) nvshmalloc and nvshmcleanup nvshmem_barrier_all (host) nvshmem_get_ptr (host/GPU) put and get routines (GPU) nvshmem_(float/int)_(p/g) nvshmem_(float/int)_(put/get) nvshmem_(quiet/fence) (GPU) nvshmem_wait/wait_until (GPU) 7

COMMUNICATION FROM CUDA KERNELS __global__ void 2dstencil (u, v, sync, …) { for(timestep = 0; …) { if (i+1 > nx) { v[i+1] = shmem_float_g (v[1], rightpe); } Long running CUDA kernels if (i-1 < 1) { v[i-1] = shmem_float_g (v[nx], leftpe); } Communication within parallelism u[i] = (u[i] + (v[i+1] + v[i-1] . . . if (i < 2) { shmem_int_p (sync + i, 1, peers[i]); shmem_quiet(); shmem_wait_until (sync + i, EQ, 1); } //intra-kernel sync … } } 8

EXPERIMENTAL PLATFORMS • Single node – GPUs directly connected with NVLink • Single node – up to 8 GPUs – 2 per card – 4 cards under same PCIe root complex using raiser cards with PCIe switch CUDA IPC and P2P • • Multi-node platform – Top-of-Rack PCIe Switch – ExpressFabric, proprietary technology from Avago Technologies • Inter Host Communication with TWC – Tunneled Window Connection Source: http://www.avagotech.com/applications/datacenters/expressfabric/ 9

CURRENT ARCHITECTURES Volta + NVLink Operations Kepler + P2P Kepler + PCIe Pascal + NVLink Express Fabric (Single Node) over PCIe (Multi Node) (Single Node) ☑ Communication Write ☑ ☑ ☑ Read ☑ ☑ ☒ ☑ Atomics ☑ ☒ ☒ ☑ ☑ Execution Inter-thread (1) Avoid intra-WARP synchronization Synchronization (2) Ensure synchronizing blocks are scheduled 10

PERFORMANCE STUDIES 11

CoMD MOLECULAR DYNAMICS MPI vs. NVSHMEM for Halo Exchange in EAM Force Evaluation EAM-1 MPI Send MPI Send MPI Send EAM-3 Pack Un-pack Pack Un-pack Pack Un-pack MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer MPI buffer Recv Recv Recv Kernel Kernel Exchange X Exchange Y Exchange Z GPU-driven communication Fine-grained communication at the thread level Avoids synchronization and artificial serialization Send EAM-3 EAM-1 Wait Data Kernel Kernel 12

CoMD FORCE EVALUATION NVProf timeline for EAM forces using Link Cells 13

CoMD PERFORMANCE Atom Redistribution Force Exchange Timestep 3.5 3 2.5 Speedup 2 1.5 1 0.5 0 2048 6912 27436 108000 364500 Atoms/GPU 4 K80s (8 GPUs) connected over PCIe 14

MULTI-GPU TRANSPOSE 0 8 16 24 32 40 48 56 0 1 2 3 4 5 6 7 Bandwidth limited 1 9 17 25 33 41 49 57 8 9 10 11 12 13 14 15 GPU 0 2 10 18 26 34 42 50 58 16 17 18 19 20 21 22 23 3 11 19 27 35 43 51 59 24 25 26 27 28 29 30 31 MPI version carefully pipelines local transposes and inter-process data movement 4 12 20 28 36 44 52 60 32 33 34 35 36 37 38 39 NVSHMEM significantly reduces code 40 41 42 43 44 45 46 47 GPU 1 5 13 21 29 37 45 53 61 complexity 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62 56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63 MPI NVSHMEM Bi-Bandwidth (GB/sec) 25 20 15 10 5 0 384 768 1536 3072 6144 12288 Matrix dimension 15 2 K40s connected over PCIe

COLLECTIVE COMMUNICATION NCCL collectives communication library Uses fine-grained load/stores between GPUs – No DMAs used! Pipelines data movement and overlaps it with computation (virtue of WARP scheduling) Implemented over NVSHMEM NVLink Single node PCIe Multi node PCIe 16

HPGMG-FV Intra-level communication Proxy for geometric multi-grid linear solvers Boundary exchange is symmetric Point-to-point between neighbors MPI uses 3 Steps: 1 – send data (boundary->MPI buffer) 2 – local exchange (internal->internal) 3 – receive data (MPI buffer->boundary) 17

HPGMG-FV – BOUNDARY EXCHANGE Implementation complexity MPI NVSHMEM CopyKernel(BOUNDARY-TO-BUFFER) cudaDeviceSync MPI_Irecv + MPI_Isend CopyKernel(ALL-TO-ALL) CopyKernel(INTERNAL-TO-INTERNAL) Nvshmem_barrier_all_offload MPI_Waitall CopyKernel(BUFFER-TO-BOUNDARY) 18

HPGMG CHEBYSHEV SMOOTHER Limited by latencies – more so at coarser levels Use fine-grained put/get with NVSHMEM HPGMG - Chebyshev Smoother - 8 GPUs MPI NVSHMEM 20 15 Time in msec 10 5 0 128^3 64^3 32^3 Coarser Finer Granularity 4 K80s (8 GPUs) connected over PCIe 19

SUMMARY Strong scaling important on GPU clusters Overheads from CPU orchestrated communication NVSHMEM is a prototype library for GPU-initiated Communication Better performance and better programmability Promising results with NVIDIA Collectives library and Mini-Apps 20

April 4-7, 2016 | Silicon Valley THANK YOU! 21

COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley SIMPLIFYING MULTI-GPU COMMUNICATION WITH NVSHMEM Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh 1 GOAL Limitations for strong scaling on GPU clusters Possibly address with GPU Global Address: NVSHMEM

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Sreeram Potluri,

S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS Anshuman

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

UHF Communication System UHF Communication System UHF Communication System UHF Communication

Communication Saves Lives Communication in healthcare Communication is essential to healthcare

TACN - 2019 Tennessee Advanced Communication Network 1 Tennessee Advanced Communication Network

Leadership Using Effective Communication L eader eadership ship = = Using communication t

COMMUNICATION Task Force 1 The Communication Task Force will envision a global communication

Total Access Communication Total Access Communication Total Access Communication Total Access

Vermont Communication Support Project Communication Support In and Out of the Courtroom

Communication, Services, and Coordination Communication, Services, and Coordination Communication

Lab 2 Group Communication Desired group communication Multicast communication Andreas

Keep Communication Going Please download the Keep Communication Going Worksheet and Handouts 1

Communication Complexity Lecture 23 Computing with remote inputs 1 Communication Complexity

S9708 - STRONG SCALING HPC APPLICATIONS: BEST PRACTICES WITH A LATTICE QCD CASE STUDY Kate Clark,

American Forces Travel AmericanForcesTravel.com About Us 2 AmericanForcesTravel.com This joint

Electric Vehicles in Stockholm - Making Sweden the leading EV market Eva Sunnerstedt, City of

used EVs A standardized test for used EVs in the Nordic region. EV sale building up + 463 so far

1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2012 2013

Spectral Synthesis of Core-Collapse Supernovae The JEKYLL code and its application. Mattias

Investor Presentation Second Quarter 2017 1 Safe Harbor During the course of this presentation

February 09, 2017 1 District preparing integrated attainment strategy that addresses

FNN Investor Event 17 October 2017 1 Investment Highlights 1 Proven Board and Management team