April 4-7, 2016 | Silicon Valley HIGH-ORDER DISCRETIZATIONS FOR GEOMETRIC MULTI-GRID METHODS Nikolay Sakharnykh (NVIDIA), Dusan Stosic (UFPE), 4/4/2016
MULTI-GRID METHODS Introduction Multi-grid solves elliptic PDEs (Ax=b) using a hierarchical approach solution to hard problem is expressed as solution to an easier problem accelerates iterative method and provides O(N) complexity 2 4/4/2016
MULTI-GRID METHODS Applications EXTERNAL FLOW OVER BODY LATTICE QCD (BROOKHAVEN NATIONAL LABORATORY) RESERVOIR SIMULATION 3 4/4/2016
MULTI-GRID METHODS Geometric vs algebraic Ax=b GMG AMG Uses a structured grid Uses an arbitrary graph Operator (A) is a stencil Operator (A) is a sparse matrix Constructs coarse grid directly Constructs coarse grid using RAP product More efficient approach More general approach Requires geometric information Performance optimization is challenging 4 4/4/2016
NVIDIA AMGX Black-box library for AMG Fast, scalable linear solvers, emphasis on iterative methods Flexible toolkit for GPU accelerated Ax = b solver Simple API makes it easy to solve your problems faster Used in industry-leading CFD packages! 5 4/4/2016
NVIDIA AMGX Example usage Configuration file // One header #include “ amgx_c.h ” solver(main)=FGMRES // Read config file main:max_iters=100 AMGX_create_config(&cfg, cfgfile); main:convergence=RELATIVE_MAX // Create resources based on config main:tolerance=0.1 AMGX_resources_create_simple(&res, cfg); main:preconditioner(amg)=AMG // Create solver object, A,x,b, set precision amg:algorithm=AGGREGATION AMGX_solver_create(&solver, res, mode, cfg); amg:selector=SIZE_8 AMGX_matrix_create(&A,res,mode); amg:cycle=V AMGX_vector_create(&x,res,mode); amg:max_iters=1 AMGX_vector_create(&b,res,mode); amg:max_levels=10 // Read coefficients from a file amg:smoother(smoother)=BLOCK_JACOBI AMGX_read_system(&A,&x,&b, matrixfile); amg:relaxation_factor= 0.75 // Setup and Solve Loop amg:presweeps=1 AMGX_solver_setup(solver,A); amg:postsweeps=2 AMGX_solver_solve(solver, b, x); amg:coarsest_sweeps=4 // Download Result determinism_flag=1 AMGX_download_vector(&x) 6 4/4/2016
HPGMG High-Performance Geometric Multi-Grid Lawrence Berkeley National Laboratory FVM and FEM variants, we focus on FVM Proxy AMR and Low Mach Combustion codes Used in Top500 benchmarking http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/ 7 4/4/2016
HPGMG Multi-grid cycles V-CYCLE F-CYCLE FEW FINER GRIDS SMOOTHER SMOOTHER & RESIDUAL SMOOTHER SMOOTHER & RESIDUAL MANY COARSER GRIDS DIRECT SOLVE HPGMG implements F-cycle which has better convergence rate than V-cycle Poisson or Helmholtz operator using 2 nd or 4 th order discretization 8 4/4/2016
HPGMG High-order discretizations Order describes the relationship between grid spacing and error Why do we need higher-order discretizations? 1. Improving accuracy without significant increase in memory usage 2. Higher flops per byte ratio enables better utilization of modern architectures Many solvers are moving towards high-order schemes for production codes 9 4/4/2016
HPGMG v0.2: second-order 2 nd order: 7-point stencil K-1 K K+1 6 x 2-point stencils 10 4/4/2016
HPGMG v0.3: fourth-order 4 th order: 25-point stencil K-2 K-1 K K+1 K+2 18 x 4-point stencils, Flop:Byte ~1.0 11 4/4/2016
HPGMG 4 th order vs 2 nd order Performs 4x the FP operations MPI: sends 3x the messages, doubles the size (2-deep halos) DRAM memory footprint is the same (assuming no overfetch) Attains lower relative residual: ~10 -9 for a single F-cycle 12 4/4/2016
HYBRID IMPLEMENTATION 13
HYBRID IMPLEMENTATION Take advantage of both architectures V-CYCLE F-CYCLE GPU THRESHOLD CPU Fine levels are executed on throughput-optimized processors (GPU) Coarse levels are executed on latency-optimized processors (CPU) 14 4/4/2016
HYBRID IMPLEMENTATION What is the optimal threshold? HPGMG v0.3 hybrid performance h 2h 6.5E+07 6.0E+07 5.5E+07 5.0E+07 4.5E+07 DOF/s 4.0E+07 3.5E+07 execute on GPU if 3.0E+07 >10K grid points 2.5E+07 2.0E+07 1.5E+07 1.0E+07 7 6 5 4 3 2 1 All levels on GPU All levels on CPU number of levels on GPU 15 4/4/2016
MEMORY MANAGEMENT HPGMG-FV entities naturally map to GPU hierarchy MULTI-GPU GRID 0 GRID 1 GPU THREAD BLOCK DOMAIN BLOCK BOX GRID 2 GRID 3 OMP THREAD MPI RANK MULTI-PROCESS 16 4/4/2016
MEMORY MANAGEMENT box 1 box 2 box 3 box 1 box 2 box 3 vectors (~10) volume Vector data within a level is contiguous Vector data within a level is disjoint Requires one copy per vector Requires one copy per box 17 4/4/2016
MEMORY MANAGEMENT Using Unified Memory Developer View With No changes to data structures Unified Memory No explicit data movements Single pointer for CPU and GPU data Use cudaMallocManaged for allocations Unified Memory 18 4/4/2016
UNIFIED MEMORY Simplified GPU programming Minimal modifications to the original code: (1) malloc replaced with cudaMallocManaged for levels accessed by GPU (2) Invoke CUDA kernel if level size is greater than threshold void smooth(level_type *level,...){ ... if(level->use_cuda) { // run on GPU cuda_cheby_smooth(level,...); } else { // run on CPU #pragma omp parallel for for(block = 0; block < num_blocks; block++) ... }} 19 4/4/2016
UNIFIED MEMORY What about performance? NVVP timeline for HPGMG level 0 level 2 level 1 Problem: excessive faults and migrations at CPU-GPU crossover points 20 4/4/2016
UNIFIED MEMORY Eliminating page migrations and faults data Level N Level N+1 Smoother Residual Restriction Smoother Redisual GPU kernels CPU functions Level N+1 (small) is shared between CPU and GPU Level N (large) is shared between CPU and GPU Solution: allocate the first CPU level with cudaMallocHost (zero-copy memory) 21 4/4/2016
UNIFIED MEMORY CPU levels allocated with cudaMallocManaged CPU levels allocated with cudaMallocHost +20% speed-up! no page faults 22 4/4/2016
KERNEL OPTIMIZATIONS 23
MULTI-GRID BOTTLENECK Cost of operations smoother interpolation copy_blocks smoother interpolation copy_blocks residual restriction apply_bc residual restriction apply_bc kernel time / level time kernel time / total time 0.8 0.5 0.7 SURFACE 0.4 0.6 MOST TIME SPENT VOLUME ON STENCILS 0.5 0.3 0.4 0.2 0.3 0.2 0.1 0.1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 level level 24 4/4/2016
STENCILS ON GPU Parallelization strategies 3D thread blocks 2D thread blocks Provides more than enough parallelism Parallelize over XY plane and march along Z k k j j i i 2D BLOCKS 3D BLOCKS 25 4/4/2016
STENCILS ON GPU Baseline 2D kernel __global__ void smooth_kernel(level_type level,...) { blockCopy_type block = level.my_blocks[blockIdx.x]; THREAD BLOCK MARCHES // prologue FROM 0 TO DIMZ const double * __restrict__ rhs = ...; ... for(int k = 0; k < dimz; k++) { // apply operator const double Ax = apply_op_ijk(); APPLY STENCIL OPERATION // smoother POISSON, HELMHOLTZ (FV2,FV4) #ifdef CHEBY CHEBYSHEV POLYNOMIALS xo[ijk] = x[ijk] + ... + c2*lambda*(rhs[ijk]-Ax); #elif JACOBI JACOBI xo[ijk] = x[ijk] + (2.0/3.0)*lambda*(rhs[ijk]-Ax); #elif GSRB GAUSS SEIDEL RED-BLACK xo[ijk] = x[ijk] + RB[ij]*lambda*(rhs[ijk]-Ax); #endif } } 26 4/4/2016
STENCILS ON GPU Register caching 38 REGS IN KERNEL WITHOUT STENCIL 7-POINT STENCIL, 18 REGS 4 TH ORDER STENCIL, 90 REGS // load k and k-1 planes into registers const double Ax = const double Ax = -b*h2inv*( -b*h2inv*( double xc0 = x[ijk – kStride]; STENCIL_TWELFTH*( STENCIL_TWELFTH*( double xc1 = x[ijk]; ... + bir1 * (xr1 - xc1) + bic1 * ( 15.0*(xl1-xc1) - (xll-xr1) ) + bic1 * (zl1 - xc1) + bir1 * ( 15.0*(xr1-xc1) - (xrr-xl1) ) + bju1 * (zu1 - xc1) + bjc1 * ( 15.0*(xu1-xc1) - (xuu-xd1) ) for(k=0; k<dimz; k++) { + bjc1 * (zd1 - xc1) + bjd1 * ( 15.0*(xd1-xc1) - (xdd-xu1) ) + bkc2 * (xc2 - xc1) + bkc1 * ( 15.0*(xc0-xc1) - (xbb-xc2) ) // load k+1 plane into registers + bkc1 * (xc0 - xc1) + bkc2 * ( 15.0*(xc2-xc1) - (xff-xc0) ) ) xc2 = x[ijk + kStride]; ... ); + 0.25*STENCIL_TWELFTH*( + (bid - biu ) * (xld - xd1 - xlu + xu1) // apply operator + (bic2 - bic0) * (xl2 - xc2 - xl0 + xc0) + (bjr - bjl ) * (xru - xr1 - xlu + xl1) const double Ax = apply_op_ijk(); + (bjc2 - bjc0) * (xu2 - xc2 - xu0 + xc0) + (bkr1 - bkl1) * (xr0 - xr1 - xl0 + xl1) + (bkd1 - bku1) * (xd0 - xd1 - xu0 + xu1) // smoother xo[ijk] = xc1 + ...; + (bird - biru) * (xrd - xd1 - xru + xu1) up to 1.5x speed-up! + (bir2 - bir0) * (xr2 - xc2 - xr0 + xc0) + (bjrd - bjld) * (xrd - xr1 - xld + xl1) // update k and k-1 planes in registers + (bjd2 - bjd0) * (xd2 - xc2 - xd0 + xc0) + (bkr2 - bkl2) * (xr2 - xr1 - xl2 + xl1) xc0 = xc1 ; xc1 = xc2 ; ... + (bkd2 - bku2) * (xd2 - xd1 - xu2 + xu1) }} )); TOTAL REG USAGE: 56 FOR FV2 AND 128 FOR FV4 27 4/4/2016
STENCILS ON GPU Using caches on SM By default global loads are not cached in L1 on Kepler Use -Xptxas =“ -dlcm=ca ” to enable L1 up to 1.4x speed-up! 128B load granularity Better for coalesced access Use __ldg intrinsic for read-only accesses up to 1.3x speed-up! 32B load granularity Better for scattered access 28 4/4/2016
Recommend
More recommend