PERFORMANCE OPTIMIZATION FOR SCIENTIFIC APPLICATIONS Alan Gray, Developer Technology Engineer, NVIDIA GTC, March 26-29 2018
Introduction • • Single-GPU Exposing parallelism • • Memory Coalescing Data Management Optimization • Interoperability • AGENDA • ESCAPE performance results • Multi-GPU CUDA-aware MPI • AlltoAll optimization • • DGX-2 (with NVSwitch) vs DGX-1V 2
INTRODUCTION 3
ESCAPE NVIDIA’s role is to take existing GPU -enabled codes and optimize. • 4
ECMWF ESCAPE Project Leaders European Centre for Medium Range Weather Forecasts (ECMWF) are an • intergovernmental org. Global forecasts: • • used by more than 22 countries to drive their regional forecasts. provide accurate weekly, monthly and seasonal predictions, including early warnings of • severe events. In 2012 ECMWF provided the most accurate prediction for the trajectory and landfall • of Hurricane Sandy: information that undoubtedly saved lives. • We are working closely with ECMWF (and other partners) in ESCAPE to evaluate and improve the algorithms, techniques and software on GPUs. This is done through use of dwarves: mini-apps designed to represent the key • properties of the real simulation applications. 5
ESCAPE DWARVES Spherical Harmonics (SH) Dwarf • ECMWF’s Integrated Forecasting System (IFS) is a global prediction system: entire earth’s atmosphere is represented as a spherical grid. • Info in “grid - point” space can be equivalently represented in “spectral” space, i.e. in terms of the frequencies of the fluctuating waves, which is more suited to some calculations. IFS therefore repeatedly transforms between these representations, Fourier • transforms (FFTs) in longitude and Legendre transforms (DGEMMs) in latitude, with AlltoAll data movement in-between. This dwarf represents the spectral transforms from IFS. • NB. Number of points varies (e.g. most round equator, fewest at poles). Additionally, • there exist multiple altitude “levels”, in third dimension away from surface of earth, each with 3 “fields”. 6
ESCAPE DWARVES MPDATA Dwarf Advection: horizontal transport • Uses unstructured grid with • nearest-neighbour stencils MPDATA scheme already used • within COSMO-EULAG (PSNC), and of interest to ECMWF for future developments Both SH and MPDATA Dwarves Fortran+OpenACC+MPI. SH also has interfacing to • CUDA libraries. Many of the optimizations I will present are transferable to other • applications/languages etc. 7
SINGLE-GPU: EXPOSING PARALLELISM 8
EXPOSING PARALLELISM OpenACC Loop Mapping Typical Structure of Application (usually spanning multiple routines/files): Loop over timesteps … Loop over 1 st dimension … Aim is to expose as Loop over 2 nd dimension much parallelism in … this red box as Loop over fields possible, as flexibly as … possible Operation (involving multidimensional arrays) … Another Loop over dimensions… … 9
EXPOSING PARALLELISM OpenACC Loop Mapping Before Optimization: Loop over 1 st dimension Loop over 1 st dimension was sequential: SH … • Loop over 2 nd dimension parallelism not exposed to GPU … Loop over fields … Operation MPDATA Naïve mapping of loops, using “kernels” • and/or “loop” directives without restructuring. Resulting decomposition chosen by • compiler did not work well since runtime loop extents didn’t map well to GPU architecture. 10
EXPOSING PARALLELISM OpenACC Loop Mapping Assuming loops are independent, • $!ACC parallel loop collapse(3) better to restructure such that Loop over 1 st dimension loops are tightly nested, and use Loop over 2 nd dimension “collapse” clause. Loop over fields Loops will be collapsed into a • … single loop, and compiler will map Operation all parallelism to GPU blocks/threads in an efficient way. • This can require extensive restructuring, depending on how application is originally written. 11
EXPOSING PARALLELISM OpenACC Loop Mapping Sometimes we have loop-carried • $!ACC parallel loop collapse(2) dependencies. Loop over 1 st dimension Loop over 2 nd dimension These can be performed • $!ACC loop seq sequentially by each thread at the innermost level. Loop with dependencies … Can still perform well if there is • Operation enough parallelism in outermost loops. 12
EXPOSING PARALLELISM OpenACC Loop Mapping Sometimes extent of a loop depends Can replace extent with max value, • • on index of another (outer) loop and use conditional statement in loop which prevents loop collapsing. body. $!ACC parallel loop collapse(3) do i=1, N do i=1, N do j=1, i do j=1, MAX_J do k=1, P do k=1, P … if(j .le. i) then Operation … Operation 13
SINGLE-GPU: MEMORY COALESCING 14
KERNEL OPTIMIZATION Memory Coalescing: data layout For memory coalescing, fastest moving index in array access should correspond to • vector level (CUDA thread), which will correspond to innermost collapsed loop index. $!ACC parallel loop collapse( 2 ) $!ACC parallel loop collapse( 3 ) do m=1, … do k=1, … do n =1, … do j=1, … $!ACC loop seq do i =1, … do p=1, … … … Array( i ,j,k )=… Array( n ,m,p )=… 15
KERNEL OPTIMIZATION Memory Coalescing: transposing with tile clause If you need to transpose, either • read or write will be in wrong layout for coalescing. !$ACC parallel loop tile(16,32) But can use OpenACC “tile” clause do j=1, … • to improve performance do i=1, … … Compiler tiles the operation by • array_t(i,j)=array(j,i) generating new innermost loops For each tile, data is staged on-chip • Results in better global memory • access patterns Experiment with tile sizes • 16
KERNEL OPTIMIZATION Memory Coalescing: transposing within CUDA BLAS In some of our cases, the transpose kernels were adjacent to CUDA BLAS matrix • multiplication (DGEMM) calls. Coalescing was facilitated through replacing C = AB matrix multiplications by • equivalent 𝐷 𝑈 = 𝐶 𝑈 𝐵 𝑈 . • This allows transpose operations to be pushed into the DGEMM library calls, which have much higher-performing implementations of transposed data accesses 17
SINGLE-GPU: DATA MANAGEMENT OPTIMIZATION 18
DATA MANAGEMENT OPTIMIZATION Minimizing data allocation and movement Important to keep as much data as possible • Data allocation and movement resident on GPU within timestep loop. is expensive All allocations/frees should be outside timestep • loop. Many codes have: • Copies for constant data should be outside main timestep loop. Loop over timestep: … Re-use temporary scratch arrays on device • loops over spatial dims … Any necessary repeated copies (e.g. halo • regions in MPI code): volume copied should be minimized. 19
SINGLE-GPU: INTEROPERABILITY 20
INTEROPERABILITY Simple example: Calling C/CUDA from PGI Fortran !main.f90 //kernel.cu program main #include <stdio.h> interface __global__ void kernel(int arg){ subroutine kernel_from_f(arg) & if (threadIdx.x==0) bind(C,name='kernel_wrapper') printf("hello from kernel\n"); use iso_c_binding return; integer(c_int),value :: arg } end subroutine kernel_from_f end interface extern "C" void kernel_wrapper(int arg){ call kernel_from_f(1) kernel <<<1,1>>> (arg); cudaDeviceSynchronize(); end program main return; } $ nvcc -c -arch=sm_60 kernel.cu -o kernel.o $ pgf90 -c main.f90 -o main.o $ pgf90 main.o kernel.o -o code -L $CUDA_HOME/lib64 – lcudart $ ./code hello from kernel CUDA libraries can be called from C code in the usual manner. 21
INTEROPERABILITY AND LIBRARIES: SH DWARF OpenACC OpenACC cuFFT cuFFT OpenACC OpenACC cublasDgemm cublasDgemm OpenACC OpenACC Base language Fortran, MPI for multi-GPU communications. 22
BLAS/FFT LIBRARY CALLS IN SH DWARF At each timestep, SH dwarf performs transforms using Matrix Multiplications and FFTs. • Multiple operations - one for each: • Field (associated with vertical levels) • • Longitude (Matmult) / Latitude (FFT) Can batch over fields, since sizes are the same. But different longitudes/latitudes • have different sizes: not supported by batched versions of cublasDgemm/cuFFT. • So, originally we had many small calls: low parallelism exposure and launch latency sensitivity. • For DGEMM, we pad with zeros up to largest size and batch over longitudes as well as fields: single call to library; extra operations do not contribute to result. • But FFT does not allow padding in the same way. Worked around launch latency problem by removing sync after each call: allows launch latency to be hidden behind execution. As will be seen, however, this is the only part of the dwarf which remains suboptimal. • Future: batched FFT with differing sizes should improve performance. 23
SINGLE-GPU: ESCAPE RESULTS 24
MPDATA OPTIMIZATION: P100 Before: After: 25
OPTIMIZED MPDATA: P100 VS V100 P100 V100 26
ESCAPE DWARF V100 PERFORMANCE 27
MPDATA KERNEL PERFORMANCE 100% Roofline is STREAM benchmark throughput, since all kernels are memory bandwidth bound • 28
ESCAPE DWARF V100 PERFORMANCE 29
Recommend
More recommend