high performance and productivity with unified memory and
play

HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus, Senior Devtech Compute, April 7th 2016 OPENACC DIRECTIVES Incremental #pragma acc data copyin(a,b) copyout(c)


  1. April 4-7, 2016 | Silicon Valley HIGH PERFORMANCE AND PRODUCTIVITY WITH UNIFIED MEMORY AND OPENACC: A LBM CASE STUDY Jiri Kraus, Senior Devtech Compute, April 7th 2016

  2. OPENACC DIRECTIVES Incremental #pragma acc data copyin(a,b) copyout(c) Manage Data { Movement Single source #pragma acc parallel { Interoperable Initiate #pragma acc loop gang vector Parallel for ( i = 0 ; i < n ; ++ i ) { Execution Performance portable z [ i ] = x [ i ] + y [ i ]; ... CPU, GPU, MIC Optimize } Loop } Mappings ... } 2 4/11/2016

  3. UNIFIED MEMORY Dramatically Lower Developer Effort Developer View With Traditional Developer View Unified Memory System GPU Memory Unified Memory Memory 3

  4. UNIFIED MEMORY Developer View With Traditional Developer View Unified Memory void foo ( FILE * fp , int N ) { void foo ( FILE * fp , int N ) { float * x , * y , * z ; float * x , * y , * z ; x = ( float *) malloc ( N *sizeof( float )); x = ( float *) malloc ( N *sizeof( float )); y = ( float *) malloc ( N *sizeof( float )); y = ( float *) malloc ( N *sizeof( float )); z = ( float *) malloc ( N *sizeof( float )); z = ( float *) malloc ( N *sizeof( float )); fread ( x , sizeof( float ), N , fp ); fread ( x , sizeof( float ), N , fp ); fread ( y , sizeof( float ), N , fp ); fread ( y , sizeof( float ), N , fp ); #pragma acc kernels copy(x[0:N],y[0:N],z[0:N]) #pragma acc kernels for ( int i = 0 ; i < N ; ++ i ) for ( int i = 0 ; i < N ; ++ i ) z [ i ] = x [ i ] + y [ i ]; z [ i ] = x [ i ] + y [ i ]; use_data ( z ); use_data ( z ); free ( z ); free ( y ); free ( x ); free ( z ); free ( y ); free ( x ); } } 4

  5. Identify Available Parallelism Optimize Express Loop Parallelism Performance Express Data Movement 5

  6. OPENACC AND UNIFIED MEMORY PGI Support for Unified Memory with OpenACC All heap allocations are in managed memory (Unified Memory Heap) Pointers can be used on GPU and CPU Enabled with compiler switch – ta=tesla: managed ,… More Info at „OpenACC and CUDA Unified Memory”, by Michael Wolfe, PGI Compiler Engineer: https://www.pgroup.com/lit/articles/insider/v6n2a4.htm 6

  7. OPENACC AND UNIFIED MEMORY Advantages Unified Memory can be used in CPU and GPU code No need for any data clauses No need to fully understand data flow and allocation logic of application Simplifies handling of complex data structures Incremental profiler driven acceleration -> Data movement is just another optimization 7

  8. OPENACC AND UNIFIED MEMORY Implementations Details on Kepler and Maxwell Does not apply for stack, static or global data (only heap data) Limits allocatable memory to available device memory even on the host Because all heap allocations are placed in device memory even the ones never needed on the GPU. This can (depending on application) significantly limit the maximal problem size. Data is coherent only at kernel launch and sync points. Its not allowed to access unified memory in host code while a kernel is running. Doing so may result in a segmentation fault. 8

  9. LBM D2Q37 Lattice Boltzmann Method (LBM) D2Q37 model Application developed at U Rome Tore Vergata/INFN,U Ferrara/INFN, TU Eindhoven Reproduce dynamics of fluid by simulating virtual particles which collide and propagate Simulation of large systems requires double precision computation and many GPUs 9

  10. LBM D2Q37 Versions MPI + OpenMP + vector intrinsics using AoS data layout MPI + OpenACC using SoA data layout and traditional data staging with data regions and data clauses (this version, starting without OpenACC directives, was used for the following) MPI + CUDA C using SoA data layout OpenCL Paper comparing these variants have been presented at EUROPAR 2015: „Accelerating Lattice Boltzmann Applications with OpenACC“ – E. Calore, J. Kraus, S. F. Schifano and R. Tripiccione 10

  11. LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) – 1 MPI rank Rank Method Time (s) Initial 1 collide 17.01 2 10.71 propagate 3 other 2.26 4 bc 0.17 collide propagate other bc Application Reported Solvertime: 27.85 s Profiler: Total Time for Process: 30.15 s 11

  12. LBM D2Q37 – INITIAL VERSION Change build environment Enable OpenACC and Managed Memory -acc -ta=tesla: managed ,… Enable Accelerator Information -Minfo=accel Enable CPU Profiling information -Mprof=func 12

  13. LBM D2Q37 – INITIAL VERSION CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) UM Initial 1 propagate 41.18 10.71 2 16.82 17.01 collide 3 other 6.58 2.26 4 bc 0.17 0.17 collide propagate other bc Application Reported Solvertime: 62.96 s (Initial: 27.85 s) Profiler: Total Time for Process: 64.75 s (Initial: 30.15 s) 13

  14. LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory – 1 MPI rank MPI handling periodic boundary conditions – causes flush of data to GPU in every iteration 14

  15. LBM D2Q37 – INITIAL VERSION NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank Propagate slowed down due to unified memory page migrations 15

  16. LBM D2Q37 – ACCELERATING PROPAGATE inline void propagate ( const data_t * restrict prv , data_t * restrict nxt ) { int ix , iy , site_i ; #pragma acc kernels #pragma acc loop independent device_type(NVIDIA) gang for ( ix = HX ; ix < ( HX + SIZEX ); ix ++) { #pragma acc loop independent device_type(NVIDIA) vector(LOCAL_WORK_SIZEX) for ( iy = HY ; iy < ( HY + SIZEY ); iy ++) { site_i = ( ix * NY ) + iy ; nxt [ site_i ] = prv [ site_i - 3 * NY + 1 ]; nxt [ NX * NY + site_i ] = prv [ NX * NY + site_i - 3 * NY ]; //... nxt [ 35 * NX * NY + site_i ] = prv [ 35 * NX * NY + site_i + 3 * NY ]; nxt [ 36 * NX * NY + site_i ] = prv [ 36 * NX * NY + site_i + 3 * NY - 1 ]; } } } 16

  17. LBM D2Q37 – PROPAGATE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) Time (s) +propagate UM Initial 1 bc 34.59 0.17 0.17 2 16.75 16.82 17.01 collide 3 other 6.94 6.58 2.26 4 propagate 2.14 41.18 10.71 collide propagate other bc Application Reported Solvertime: 57.65 s (UM: 62.96 s) Propagate Profiler: Total Time for Process: 60.42 s (UM: 64.75 s) on GPU 17

  18. LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank 18

  19. LBM D2Q37 – PROPAGATE ACCELERATED NVVP Timeline (480x512) using Unified Memory - Zoom – 1 MPI rank BC slowed down due to unified memory page migrations 19

  20. LBM D2Q37 – BC ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) Time (s) Time (s) +bc +propagate UM Initial 1 collide 49.99 16.75 16.82 17.01 2 7.61 6.94 6.58 2.26 other 3 propagate 2.15 2.14 41.18 10.71 4 bc 0.11 34.59 0.17 0.17 bc on GPU collide propagate Propagate other on GPU bc Application Reported Solvertime: 55.74 s (propagate: 57.65 s) Propagate Profiler: Total Time for Process: 59.86 s (propagate: 60.42 s) on GPU 20

  21. LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank 21

  22. LBM D2Q37 – BC ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank collide slowed down due to unified memory page migrations 22

  23. LBM D2Q37 – COLLIDE ACCELERATED CPU Profile (480x512) using Unified Memory – 1 MPI rank Rank Method Time (s) Time (s) Time (s) Final UM+propagate+bc Initial 0 7.69 2.39 1.89 main 1 collide 0.52 49.99 17.01 2 lbm 0.41 4.72 0.06 3 init 0.19 0.19 0.04 4 0.15 0.17 0.01 printMass 5 0.13 2.15 10.71 propagate 6 bc 0.09 0.11 0.17 7 projection 0.05 0.05 0.06 Application Reported Solvertime: 0.96 s (bc: 55.74 s, Initial: 27.85 s) 23 Profiler: Total Time for Process: 9.33 s (bc: 59.86 s, Initial: 30.15 s)

  24. LBM D2Q37 – COLLIDE ACCELERATED NVVP Timeline (480x512) using Unified Memory – 1 MPI rank Data stays on GPU while simulation is running 24

  25. LBM D2Q37 – MULTI GPU Requirements CUDA-aware MPI with support for Unified Memory E.g. OpenMPI since 1.8.5 or MVAPICH2-GDR since 2.2b with CUDA 7.0 Start one MPI rank per GPU 25

  26. LBM D2Q37 – MULTI GPU Handling GPU AFFINITY int rank = 0 ; int size = 1 ; MPI_Init (& argc , & argv ); MPI_Comm_rank ( MPI_COMM_WORLD , & rank ); MPI_Comm_size ( MPI_COMM_WORLD , & size ); #if _OPENACC int ngpus = acc_get_num_devices ( acc_device_nvidia ); int devicenum = rank % ngpus ; acc_set_device_num ( devicenum , acc_device_nvidia ); acc_init ( acc_device_nvidia ); #endif /*_OPENACC*/ 26

  27. LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory – 2 MPI ranks 27

  28. LBM D2Q37 – MULTI GPU NVVP Timeline (480x512) using Unified Memory - Zoom – 2 MPI ranks MPI 28

  29. LBM D2Q37 – MULTI GPU Strong Scaling 400 350 300 Runtime (s) 250 200 150 100 50 0 1 GPUs (1/2 K80) 2 GPUs (1 K80) 4 GPUs (2 K80) 8 GPUs (4 K80) 1000 Steps - 1440x10240 Grid Tesla K80 Linear 29

  30. LBM D2Q37 – MULTI GPU Overlapping Communication and Computation Possible but need to be careful not to use unified memory pointers in host code while kernels are running asynchronously. All kernel launches when using – ta=tesla: managed are synchronous by default, i.e. PGI_ACC_SYNCHRONOUS=1 Set PGI_ACC_SYNCHRONOUS=0 to allow overlap 30

  31. LBM D2Q37 – MULTI GPU Overlapping Communication and Computation Grid size: 1920x2048 31

  32. LBM D2Q37 – MULTI GPU Overlapping Communication and Computation Grid size: 1920x2048 32

Recommend


More recommend