iterative solution of linear systems in iterative
play

Iterative Solution of Linear Systems in Iterative Solution of Linear - PowerPoint PPT Presentation

UnConventional High Performance Computing UnConventional High Performance Computing (UCHPC) 2010 workshop at: (UCHPC) 2010 workshop at: Iterative Solution of Linear Systems in Iterative Solution of Linear Systems in Electromagnetics (and not


  1. UnConventional High Performance Computing UnConventional High Performance Computing (UCHPC) 2010 workshop at: (UCHPC) 2010 workshop at: Iterative Solution of Linear Systems in Iterative Solution of Linear Systems in Electromagnetics (and not only): Electromagnetics (and not only): Experiences with CUDA Experiences with CUDA D. De Donno, A. Esposito, G. Monti, L. Tarricone D. De Donno , A. Esposito, G. Monti, L. Tarricone Innovation Engineering Department Innovation Engineering Department University of Salento - - Lecce Lecce - - Italy Italy University of Salento August 30st, 2010 August 30st, 2010

  2. Outline Outline  Background on CEM issues  Motivations and objectives  Design and implementation  Experimental results  Conclusions and ongoing work Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  3. Background Background One of the fundamental steps of numerical computing is the ability to solve linear systems:  Ax b These systems arise very frequently in scientific computing, from finite difference or finite element approximations to partial differential equations. For example, in Computational ElectroMagnetics Computational ElectroMagnetics (CEM CEM) the process of modeling the interaction of electromagnetic fields with physical objects and the environment give rise to linear systems with large number of unknowns. Scattering problems RF components and 3D EM cicruits EMC analysis Antenna design Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  4. Method of Moments (MoM) Method of Moments (MoM) linear systems linear systems In CEM a key role is played by the Method of Moments Method of Moments (MoM MoM) which transforms the integral-differential Maxwell's equations into a linear system of algebraic equations. Solve linear system Solve linear system MoM- -based based MoM EM simulator EM simulator   Z I V Object discretization Object discretization Z is the MoM impedance matrix containing the complex complex reaction terms between basis functions. In large large EM problems Z can be reduced to an unstructured unstructured and significantly sparse sparse matrix without affecting the numerical accuracy. ITERATIVE SOLVERS (i.e. CG, BiCG) are preferred in these cases. ITERATIVE SOLVERS Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  5. Objective Objective We implemented a Bi Bi- -Conjugate Gradient Conjugate Gradient (BiCG BiCG) iterative solver for GPUs. It tackles unstructured sparse unstructured sparse matrices matrices with double precision complex double precision complex data data . Recently, cheap and powerful graphics processors graphics processors (GPU GPU) are emerging as a valide alternative to supercomputers and computational grids. Our implementation takes advantage from CUDA, a standard C language extension CUDA for parallel application development on NVIDIA GPUs. NVIDIA A CUDA application consists of SPMD SPMD computations ( kernels kernels ) performed by threads threads running in parallel on the GPU streaming multiprocessors streaming multiprocessors (SMs SMs). Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  6. The BiCG algorithm (1/2) The BiCG algorithm (1/2) BiCG is a generalization of CG CG ( Conjugate Gradient Conjugate Gradient ) method. BiCG CG method BiCG method CG method BiCG method  real symmetric matrices  real non-symmetric matrices  complex Hermitian matrices  complex non-Hermitian matrices In the initialization phase of BiCG, the following variables are defined:    * r b Ax r r Residual and bi- -residual residual Residual and bi 0 0 0 0   * p r p p Direction and bi- Direction and bi -direction direction 0 0 0 0       1 1 * Pre- -conditioned residual and bi conditioned residual and bi- -residual residual d M r d M r Pre 0 0 0 0    * T Initial residual error Initial residual error d r 0 0 Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  7. The BiCG algorithm (2a/2) The BiCG algorithm (2a/2) In the BiCG main loop, the following steps are repeated for each iteration: Calculate the step lenght parameter and form the new solution estimate. timate. 1. Calculate the step lenght parameter and form the new solution es 1. STEP 1 STEP 1   q A p  1 i i   H q A p  1 i i     1 i  * i p q  1 i i     x x p   1 1 i i i i Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  8. The BiCG algorithm (2b/2) The BiCG algorithm (2b/2) In the BiCG main loop, the following steps are repeated for each iteration: Update residual and bi- -residual, with and without preconditioning. residual, with and without preconditioning. 2. Update residual and bi 2. STEP 2 STEP 2     r r q  1 i i i i     * r r q  1 i i i i    1 d M r i i    1 d M r i i Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  9. The BiCG algorithm (2c/2) The BiCG algorithm (2c/2) In the BiCG main loop, the following steps are repeated for each iteration: Calculate the residual error ρ ρ and the bi and the bi- -conjugacy coefficient conjugacy coefficient β β . . 3. Calculate the residual error 3. STEP 3 STEP 3    * T d r i i i    i   i 1 i Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  10. The BiCG algorithm (2d/2) The BiCG algorithm (2d/2) In the BiCG main loop, the following steps are repeated for each iteration: Update next direction and bi- -direction vectors. direction vectors. 4. Update next direction and bi 4. STEP 4 STEP 4     p d p  1 i i i i     * p d p  1 i i i i Iteration is continued till the convergence criterion is satisfied: i r   2 Values of ε commonly used are 10 -6 / 10 -7 . b 2 Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  11. Design of the GPU- -enabled BiCG enabled BiCG Design of the GPU In the GPU-enabled BiCG algorithm, the main loop execution is controlled on the host side, whereas the computations inside are performed on the GPU. INITIALIZATION consists in: INITIALIZATION  reading and storing the system matrix in a given sparse format;  allocating data structures on the GPU and calculating BiCG initial variables. BiCG MAIN LOOP consists of the BiCG MAIN LOOP iterative invocation of parallel CUDA kernels performing the BiCG operations. FINALIZATION consists in retrieving FINALIZATION final results from GPU global memory. Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  12. Implementation (1/3) Implementation (1/3) Four basic CUDA kernels are enough to completely describe the BiCG main loop: Operation Description FLOPS Operation Description FLOPS Sparse matrix-vector SpMV 8· nnz product Scalar product of two Dot product 8· N vectors Element-wise product of E-w product 6· N two vectors ax+y (a scalar, x and y axpy 8· N vectors) Recall that we are considering non non- -symmetric symmetric and non non- -Hermitian Hermitian sparse sparse matrices with N rows and nnz non-zero complex complex coefficients. Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

  13. Implementation (2a/3) Implementation (2a/3)   SpMV – this CUDA kernel implements the Bell and Garland q A p SpMV  1 i i algorithm (*) which is the best performing code currently avaible   H q A p  for solving sparse matrix-vector product. 1 i i Supported sparse matrix formats Main modifications to the original code Supported sparse matrix formats Main modifications to the original code  CRS ( Compressed  double precision complex matrix support double precision complex matrix support   CRS Compressed Row Row Storage )  CUDA grid, register number, shared and CUDA grid, register number, shared and  Storage  HYB  ( hybrid texture memory exploitation optimized for texture memory exploitation optimized for HYB hybrid ELLpack- - ELLpack COOrdinate format ) double precision complex data. double precision complex data. COOrdinate format (*) N. Bell and M. Garland: “ Implementing sparse matrix Implementing sparse matrix- -vector multiplication on throughput oriented vector multiplication on throughput oriented processors ”, In Supercomputing ’09, Nov. 2009. processors Danilo De Donno – – University of Salento, Lecce, Italy University of Salento, Lecce, Italy Danilo De Donno

Recommend


More recommend