An Optimized GPU Implementation of a Multicolor Point-Implicit Linear Solver
Outline • Motivation • Problem and Challenge • Approach • Performance Results • Future Work and Conclusion
Motivation
FUN3D • FUN3D is a suite of computational fluid dynamics software developed at the NASA Langley Research Center to solve the Navier-Stokes equations for a broad range of aerodynamics applications • For high-energy flows, traditional perfect gas assumptions are invalid and the system must be further expanded to accommodate finite-rate chemistry models. For these reasons, accurate and efficient simulations of complex aerodynamic flows are challenging and require significant computational resources.
FUN3D • FUN3D uses an implicit time-integration strategy on unstructured grid that requires that requires solution of large tightly-coupled system of block-sparse linear equations at each time step • Computing solutions to these linear systems accounts for a significant fraction of the overall runtime in virtually all FUN3D simulations.
Problem and Challenges
Linear Solver • Implicit scheme results in linear systems of equations: 𝑩𝒚 = 𝒄 , 𝑩 is a 𝑜 × 𝑜 block matrix for 𝑜 grid points; block is of size 𝑜𝑐 × 𝑜𝑐 • Matrix 𝑩 is segregated into two separate matrices: 𝑩 ≡ 𝑬 + 𝑷 , where 𝑬 and 𝑷 represent the diagonal and off-diagonal blocks of 𝑩 • Arrays for storing 𝑬 : 𝐸 𝑜𝑐, 𝑜𝑐, 𝑜 Prior to performing each linear solve, each diagonal block 𝐸 𝑗 is decomposed in-place into lower and upper triangular matrices 𝑀 𝑗 𝑏𝑜𝑒 𝑉 𝑗 , 1 ≤ 𝑗 ≤ 𝑜
Sparse Block-Matrix Storage Format for 𝑷
Multicolor Solver • Grid points are colored such that no two adjacent points are assigned the same color. • Unknowns of the same color carry no data dependency and may be updated in parallel. • Updates of unknowns of each color use the latest updated values at grid points corresponding to other colors. • The overall process may be repeated using several outer sweeps over the entire system.
Distribution of non-zeros in the sample matrix
Solver Challenges • With an average of few off-diagonal blocks per row, the arithmetic intensity of the computation is quite low ( ≈ 0.5 for 983633 sized matrix) – memory bound on GPU • The number of rows associated with a color can vary significantly. Consequently the amount of parallelism available for different colors varies significantly • To support strong scalability, the single node performance for light workload should also be good • The solver uses indirect memory addressing • The implementation must support a broad range of block sizes
Solver Computation
Solver Computation – MPI Implementation This computation is subroutine fun3d_solver identical to the do color = 1, nc sequential computation ! Process rows of each color MPI Communication: update portion of x array end do end subroutine fun3d_solver
Approach
Solver Computation Sparse block-matrix (A) vector operation Multiple triangular solve using dense blocks of matrix D
Naive Approach 𝑷𝒚 𝒓 𝒚 𝑬 𝑷 𝒚 Assign a thread to process a row block. As all rows can be processed independently the parallelism is determined by number of row blocks . 𝑔𝑝𝑠𝑥𝑏𝑠𝑒 & 𝑐𝑏𝑑𝑙𝑥𝑏𝑠𝑒
Solver Computation Use sparse BLAS: cusparseSbsrmv Use batched dense BLAS: cublasStrsmBatched
Sparse and Dense Blas Formulation • Advantage: makes code portable with performance on different architectures • Issues: For block sizes of interest BLAS routines do not perform well. • Solution: Write optimized BLAS for block sizes of interest
Overall Structure of Optimized Solver
Approach for sparse matrix vector (for nb = 1 to 16)
Approach for sparse matrix – vector (for nb = 1 to 16)
Approach for sparse matrix – vector (for nb = 1 to 16)
Approach for sparse matrix – vector (for nb = 17 to 64) Assign warps to four consecutive non-zero blocks in a row. Once the processing of the four blocks is complete, move to the next set of four blocks in the same row. Once the processing is complete for the whole row, aggregate the partial sum being held by four warps using the shared memory.
Approach for triangular solve (for nb <= 32) Process columns of lower triangular matrix of D from left to right using a single warp Amount of parallelism available to the warp decreases as we move from left to right Need for broadcast a value from previous column processing (b(j1)) – use of shuffle instruction.
Approach for triangular solve (for nb <= 32) All threads of the warp execute this line
Prefetching First prefetch outside the loop Prefetches inside the loop
Approach for triangular solve (for nb > 32)
Performance Results
Platform and Testbed • We ran all our experiments on a Sandy Bridge Host Processors with a NVIDIA K40 GPU (NASA Pleiades). • The codes were written in CUDA Fortran, compiled and executed using PGI Fortran Compiler 15.10, and CUDA Toolkit 7.5. • Performance testing was done on matrix sparsity structures that occurs often in FUN3D application. Using a sparsity structure we generated matrices with different block sizes. The blocks were populated using random values. To keep the memory requirement manageable, we only work with a small matrix for large block sizes.
Future Work and Conclusion • It is possible to get benefit out of emerging architecture but requires restructuring and designing new algorithms to map to the underlying architecture. • Explore new sparse storage tuned for GPUs • Explore NVIDIA PASCAL • Device memory bandwidth: 720 GB/s
Recommend
More recommend