Unstructured-Grid CFD Algorithms on the NVIDIA Pascal and Volta Architectures Eric Nielsen and Aaron Walden Justin Luitjens Mohammad Zubair and Jason Orender NASA Langley Research Center NVIDIA Corporation Old Dominion University John Wohlbier John Linford DoD HPCMP PETTT ParaTools, Inc . Engility Corporation
General Background Collaboration across government, industry, and academia: • NASA ARMD TTT/RCA, NASA Langley CDT and HPC Incubator efforts • Old Dominion University • NVIDIA and Portland Group • Department of Defense HPCMP PETTT • ParaTools, Inc. and University of Oregon • ORNL GPU hackathons and others • Attempted GPU acceleration of FUN3D kernels in the past, but limitations in the programming model and earlier hardware could not compete with traditional CPU technology • Newer GPU hardware with improved memory bandwidth such as Pascal and Volta, coupled with an optimized CUDA approach, has now made GPU computing a compelling alternative
Motivation NASA Langley’s FUN3D solver is used to tackle some of the nation’s most complex aerodynamics problems Mars InSight Lander Gulfstream G550 DoD Applications Ares 1-X NASA/Boeing Truss-Braced Wing
Solver Background for i = 1 to n_time_steps do // Form Right Hand Side // Form Left Hand Side • FUN3D solves the Navier-Stokes equations // Solve Ax = b of fluid dynamics using implicit time integration // Update Solution on general unstructured grids end for We focus on one • component of the LHS This approach gives rise to a large block-sparse and the linear solver system of linear equations that must be solved at each time step • Two kernels are generally the largest contributors to run time: • Kernel 1: Construction and storage of the compressible viscous flux Jacobians • Kernel 2: Multicolor point-implicit linear solver used to solve Ax=b • This overview focuses on CUDA implementations of these two kernels
Kernel 1: Algorithm A DIAG : Diagonal block matrix Initialize A DIAG ← 0 and A OFF ← 0 A OFF : Off-diagonal block-sparse matrix n : Number of cells in grid for each cell ∈ Grid do Update A DIAG for nodes in cell • Updates to A DIAG and A OFF require several Update A OFF for nodes in cell loops over the number of edges, faces, and end for nodes within the current cell Parallelization : The computation can be parallelized over the number of cells, however; atomic updates are required to avoid race conditions when writing to A DIAG and A OFF Challenges : Traversal of the grid results in irregular memory accesses, complexities related to the underlying physics, and a large number of variables and temp arrays resulting in cache and register pressure
Kernel 1: Initial Implementation • Assign a thread to a cell and use atomic updates Initialize A DIAG ← 0 and A OFF ← 0 for A DIAG and A OFF • Refactor code to minimize the number of for each cell ∈ Grid do variables and temp arrays Update A DIAG for nodes in cell • Also resulted in ~2x Xeon speedup Update A OFF for nodes in cell end for • Use shared memory for a few critical temp arrays Issues : Each thread still uses many temp arrays and a large number of registers
Kernel 1: Optimized Implementation Initialize A DIAG ← 0 and A OFF ← 0 Parallelize across for each cell ∈ Grid do gridDim.x * blockDim.y threads for each node ∈ cell do // compute cell averages, set local arrays end for Parallelize using blockDim.x threads for each face ∈ cell do // linearize cell gradients • Assign a CTA of blockDim.x * blockDim.y Flatten nested loops and end for threads to process blockDim.y cells parallelize using for each edge ∈ cell do blockDim.x threads • Increases number of active threads and // compute edge contributions to jacobian for each node ∈ cell do improves thread utilization // compute gradients at dual face • Coalesce memory access pattern end for • Reduces register and shared memory end for Parallelize using for each node ∈ cell do pressure increasing occupancy blockDim.x threads // assemble 17 contributions to Jacobian • Enable reduction in inner loops using shared end for memory end for • Auto-tuning used to choose blockDim.x and blockDim.y
Kernel 2: Multicolor Algorithm FUN3D uses a series of multicolor point-implicit sweeps to form an approximate solution to Ax = b • Color by rows which share no adjacent unknowns; and re-order matrix rows by color contiguously • Unknowns of the same color carry no data dependency and may be updated in parallel • Updates of unknowns of each color use the latest updated values at grid points corresponding to other colors • The overall process may be repeated using several outer sweeps over the entire system 2 29 9 14 12 18 43 6 36 32 1 27 40 25 22 16 3 10 13 5 31 34 37 38 41 28 17 20 26 23 8 15 11 7 33 30 35 4 39 24 21 19 44 42
Kernel 2: Challenge • With an average of few off-diagonal blocks per row, the arithmetic intensity of the computation is quite low ( ≈ 0.5 ) – memory bound on GPU • The number of rows associated with a color can vary significantly. Consequently the amount of parallelism available for different colors varies significantly • To support strong scalability, the single node performance for light workload should also be good • The solver uses indirect memory addressing • The implementation must support a broad range of block sizes
BLOCK CSR Storage
Kernel 2: Naive Single GPU Algorithm Assign a thread to process a row block. As all rows can be processed To solve Ax = b: independently the parallelism is determined by number of row blocks Initialize x ← 0 𝐵 𝐸𝐽𝐵𝐻 𝑦 𝑟 for i = 1 to number_of_sweeps do for k = 1 to number_of_colors do Compute q k ← b k - A OFF k x Solve for y k in L D k y k = q k Solve for x k in U D k x k = y k end for end while 𝐵 𝑃𝐺𝐺 𝑦 x : Solution vector k : Lower triangular of A DIAG k 𝑔𝑝𝑠𝑥𝑏𝑠𝑒 & 𝑐𝑏𝑑𝑙𝑥𝑏𝑠𝑒 L D k : Upper triangular of A DIAG k U D
Kernel 2: Use of BLAS Initialize x ← 0 This is a block-sparse matrix-vector product, which can be performed using for i = 1 to number_of_sweeps do the cuSparse library function for k = 1 to number_of_colors do cusparseSbsrmv Compute q k ← b k - A OFF k x Solve for y k in L D k y k = q k Solve for x k in U D k x k = y k The forward-backward substitutions can end for be performed using the cuBLAS function end while cublasStrsmBatched • Experiments show that the performance of the BLAS functions available in existing CUDA libraries is suboptimal for matrices representative of those encountered in actual simulations • Instead, optimized versions of these functions are developed 1 1 Zubair, et al, “An Optimized Multicolor Point - Implicit Solver for Unstructured Grid Applications on Graphics Processing Units,” IA 3 Workshop, SC16, Salt Lake City, UT.
Kernel 2: Implementation Matrix-Vector Product An approach is developed for a broad range of block sizes nb ; here we focus on 1 nb 16 • nbk nonzeros are processed by nw warps X X X X nrbk rows are X X X processed by a CTA X X X X X X X X X X X X X X X X X X X X X X X X X nb x nb dense matrix
Kernel 2: Implementation Matrix-Vector Product Consider an example of a block-sparse matrix with a block size of 4: • A single warp is used to process a row of the matrix in a loop, where nbk = 2 consecutive non-zero blocks are processed by the warp at each iteration • The warp handles 32 (2 x (4 x 4)) matrix entries during each iteration – allowing a single warp to load the elements of the matrix in a coalesced fashion • The appropriate elements of x are also loaded from the read-only data cache, multiplied by the corresponding elements of A OFF , and the results are accumulated • After completion of the loop, the partial results are stored in shared memory to be aggregated later
Kernel 2: Implementation Forward/Backward Substitution • The columns of the lower triangular factor of A DIAG are processed from left to right using a single warp • The amount of parallelism available to the warp decreases as we move from left to right • Shuffle instruction broadcasts the value from the previous column • Upper triangular portion processed in a similar fashion
Kernel 2: Implementation Using Variant of ELLPACK ELL-S (Variant of ELLPACK) : Sort the rows by the number of non-zeros and then make groups of 32 consecutive rows and pad them to have equal number of non-zeros. Increase in number of non-zeros < 1% A group of 32 rows with 4 non-zeros in a row. For a matrix with a single group with max non-zeros as 4, nzs = 4. Another example, where there are two groups, one group with max non-zeros as 4, and the other group with max non- zeros as 6, the value of nzs = 4 + 6 = 10. a_offELL(32, 5, 5, nzs)
Kernel 2: Implementation Using Variant of ELLPACK Algorithm: Assign a warp for a group of 32 rows Performance of ELL-S Storage Scheme Vs Block CSR for Kernel 2 (Solver) • With ELL-S we observed a 1.32x improvement compared to block-CSR on Pascal P100. No significant improvement on K40. • On Volta V100, block-CSR is already achieving close to peak memory bandwidth. • Additionally, it is not clear the impact of new storage on Kernel 1. For now our integrated kernel is based on block-CSR.
Recommend
More recommend