Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - PowerPoint PPT Presentation

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com

Outline  Symmetric eigenvalue solver  Experiment  Applications  Conclusions

Symmetric eigenvalue solver  The standard form is �  QR algorithm is state-of-the-art, most popular, and implemented in LAPACK  Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism of Jacobi method makes GPU implementation appealing  Naming convention syevd (heevd): QR algorithm syevj (heevj): Jacobi method

QR algorithm (syevd)  sytrd : tridiagonalization � � � � = �  stedc : spectrum of tridiagonal � � = Σ � �  ormtr : form eigenvectors � � � =

1 2 3 4 Example of QR algorithm 2 5 6 7 � = 3 6 8 9 4 7 9 10  sytrd: 1 1 −5.39 −0.37 −0.74 0.56 −5.39 22.48 2.768 � = � = −0.31 −0.77 −0.56 2.768 0.286 −0.18 0.6 0.3 −0.74 −0.18 0.232  stedc: −0.81 −0.73 0.33 0.56 −0.2 0.185 −0.24 0.05 0.05 0.97 Σ = � = 0.73 0.11 0.558 0.635 0.24 0.109 0.91 −0.4 0 24.1

Pros and Cons  Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3 n = 4096, double precision  Step 3: [GPU] ormtr is a sequence of householder product routine time(sec) sytrd 4.59  Step 2: [CPU] stedc performs QR algorithm stedc 14.55 sequentially ormtr 3.65 - the runtime of stedc is about the same as sytrd - CPU is occupied during eigenvalue solver - The performance depends on CPU as well Find an alternative to replace stedc

Jacobi method (syevj)  A series of rotations, each rotation eliminates one off-diagonal � � = � � � � � � � � � � � � � � � � � � � � = � � � �� = � � � � = � �  Monotone property � � � � − ��(�� = �� > �� = � � � �� where �� ,��  With proper termination condition Σ = lim �⟶� � � � = � � � � ⋯

1 2 3 4 Example of Jacobi method 2 5 6 7 � = 3 6 8 9 4 7 9 10 �  Eliminate (1,2): �� = 19.74842 � � � � 0.48 1.02 0.17 6.70 8.00 5.83 �� = 19.54482 � � = 0.48 6.70 8 9 1.02 8.00 9 10 �  Eliminate (1,3): � � � � 0.14 −0.4 0.47 −0.4 5.83 6.70 8.00 � � = �� = 19.53325 8.03 9.05 6.70 0.47 8.00 9.05 10  Monotone property holds,

�  Eliminate (1,4): � � � � 0.12 −0.8 −0.4 −0.8 5.83 6.70 7.97 � � = �� = 19.52188 −0.4 6.70 8.03 9.03 7.97 9.03 10.02 �  Eliminate (2,3): � � � � 0.12 −0.32 −0.84 −0.32 0.16 0.23 �� = 17.08458 � � = 13.7 12 −0.84 12 10.02 0.23  (1,4) and (2,3) operate on non-overlapped rows and columns

Cyclic Jacobi control accuracy (1) � (2) while (3) for p = 1:n-1 A sweep consists of n(n-1)/2 rotations (4) for q = p+1 : n Quadratic convergence is based on # of sweeps (5) compute Column rotation of A � (6) Row rotation of A (7) (8) end // for q Column rotation of V (9) end // for p (10) end // while

Parallel Jacobi  n/2 pairs of non-overlapped ( p, q ) which can be done in parallel � �  Eliminate (1,2) and (3,4): � � � � � � −0.32 1.07 0.17 1 � 3 4 −0.35 10.4 5.83 � 5 6 7 � � = � = 3 6 8 � −0.32 −0.35 −0.05 4 7 � 10 1.07 10.4 18.06 sweep Off(A) 1 1.195820 2 2.849376E-1 3 2.355936E-4 4 1.279163E-15

Block Jacobi  Partition A into blocks � �� ⋯ � �� ⋮ ⋱ ⋮ � = � �� ⋯ � ��  Eliminate off-diagonal block (p,q) by Jacobi rotation � � �� = � � � ��  Basic block is batched syevj  Column and row rotations are done by efficient GEMM  Propagate proper termination condition

Comparison Basic routines sytrd batched syevj stdec GEMM ormtr GPU friendly? sytrd (yes) batched syevj (yes) stedc (no), cpu with single thread GEMM (yes) ormtr (yes) Scalable for next generation GPU sytrd (yes) batched syevj (yes) stedc (no) GEMM (yes) ormtr (yes) Computational Complexity low high Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by ‘tol’ Support for ‘s’, ‘d’, ‘c’, and ‘z’ Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes

Complexity Analysis  QR algorithm is about 2 sweeps of Jacobi method  To reach machine zero, Jacobi method needs 7 sweeps for single precision and 15 sweeps for double precision  Although complexity of Jacobi method is bigger, its parallelism makes it faster on small matrices. Once the matrix gets bigger, Jacobi method suffers from big complexity � � ��

Outline • Symmetric eigenvalue solver • Experiment • Applications • Conclusions

Experimental Setup  CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, dual socket  GPU: K40  Comparison against MKL 11.0.4  Ngflops = 2*n^3 / time normalized gflops w.r.t. GEMM K40 E5-2690 v2 Number of processor cores 2880 10 Core clock 754 MHz 3 GHz bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops http://ark.intel.com/products/75279/Intel-Xeon-Processor-E5-2690-v2-25M-Cache-3_00-GHz https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf

Performance of syevj  Jacobi method is faster than QR algorithm for small matrix, up to size 256  Jacobi method left behind for large matrix due to high complexity n cusolver syevj MKL syevd syevd 32 0.008 0.04 0.004 64 0.03 0.15 0.04 128 0.11 0.46 0.19 256 0.50 1.05 1.30 512 1.42 2.88 5.19 1024 3.00 4.56 15.26 2048 5.56 5.96 32.66 4096 5.93 7.35 43.66 8192 4.85 9.80 48.35 Ngflops, Double precision

Performance of batched syevj  Batched syevj relies on shared memory, the dimension is limited by 32  The performance stabilizes when GPU is fully utilized  Batched syevj is faster than MKL with 16 threads for ‘s’, ’d’ and ‘c’ MKL, 16 threads Data MKL Speedup type Ngflops S 0.73 7.3 D 1.59 1.4 C 3.13 3.0 Z 4.55 0.8 n=32

Outline • Symmetric eigenvalue solver • Experiment • Applications • Conclusions

Application 1: SVD  SVD computes singular value and left/right singular vector U/V � � Σ = � �  LAPACK uses QR algorithm  Jacobi method can apply to SVD because monotone property still holds  Naming convention - gesvd : QR algorithm - gesvdj : Jacobi method

Performance of gesvdj  Jacobi method is faster than QR algorithm for small matrix, up to size 512  For large matrix, Jacobi method is not bad for ‘s’ and ‘c’ compared to MKL n cusolver gesvdj MKL gesvd gesvd 32 0.004 0.036 0.089 64 0.014 0.118 0.034 128 0.064 0.395 0.118 256 0.203 1.089 0.768 512 0.628 1.872 1.734 1024 1.688 2.489 5.487 2048 3.944 2.749 13.083 4096 6.178 2.714 18.111 8192 8.242 2.65 11.061 Ngflops, Double precision

Performance of batched gesvdj  The matrix size is limited by 32-by-32  The performance stablizes when GPU is fully utilized  Batched gesvdj is faster than MKL with 16 threads for ‘s’ and ‘d’ MKL, 16 threads Data MKL Speedup type Ngflops S 2.044 1.5 D 1.527 1.1 C 6.486 0.8 Z 5.176 0.3

Application 2: multiGPU syevj  syevj runs on four K40  syevj is competitive against MKL for single precision syevj is ½ to of MKL for double precision

Application 3: approximate eigensolver  Use case: to know full inaccurate spectrum quickly or cannot afford a large cluster for dense eigensolver Hydrogen Atom � Energy: �,�,� �  Naming convention: syevij (‘ij‘ stands for incomplete Jacobi)

Accuracy of syevij  The resolution is 16 grid points for each dimension matrix is 4096-by-4096 with 27136 nonzeros  There are 5 bound states but syevij reports 0 bound states  The error bound (0.01 per eigenvalue in average) �� − � �� 20.36 eV -12.43 eV

Performance of syevij � but much faster than dense eigensolver  The complexity is still  Strong scaling of multiGPU is not significant in this case 2 GPU: 1.4x speedup 4 GPU: 1.7x speedup  Single precision keeps the same accuracy but 2x faster Double precision: runtime (second) n Matrix size nnz 1 K40 2 K40 4 K40 16 4,096 27,136 0.6 0.6 0.7 32 32,768 223,232 27.2 18.8 16.0 64 262,144 1,810,432 1919.6 1320.7 1016.6

Conclusions  Optimal complexity may not be the best for parallel computing  Jacobi method is faster than MKL for small matrices, as well as batched operations  Jacobi method can be applied to symmetric eigenvalue solver and SVD  Jacobi method uses limited CPU resources  CUDA 9 will have syevj, batched syevj gesvdj, batched gesvdj multiGPU syevj multiGPU syevij

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - PowerPoint PPT Presentation

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is QR algorithm

Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com

Jacobi Forms of Lattice Index Andreea Mocanu The University of Nottingham 7th of December, 2016

The transverse Jacobi equation Luis Guijarro Universidad Aut onoma de Madrid-ICMAT Symmetry

Parallel solution of large sparse eigenproblems using a Block-Jacobi-Davidson method Melven

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Iterative methods Krylov subspace methods for symmetric eigenvalue problems Jacobi-Davidson

precise solver for chemical ODEs Fan Feng, Zifa Wang GTC 2016 San Jose, USA, 04-07 Apr. 2016

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Random Eigenvalue Problem for Linear Dynamic Systems S. A DHIKARI Cambridge University

Matrix-Eigenvalue Problems in Stochastic Structural Dynamics S Adhikari Department of Aerospace

ECS 231 Gradient descent methods for solving large scale eigenvalue problems 1 / 17 Generalized

Analysis and Computation for Analysis and Computation for Nonlinear Eigenvalue Eigenvalue

Math 221: LINEAR ALGEBRA 8-6. Singular Value Decomposition Le Chen 1 Emory University, 2020

Statistical methodology for bio iosimil ilars, comparison of f process changes and comparison

National Action Plan Workshop- Opening Address Dr Andy Hall ENSREG Chair Why this workshop? A

Urban ecology and greening cities Presentation to Sustainability Talks Green and public spaces

Judy Paterson, Joel Laity and John Moala University of Auckland Route map Introductory

Identifying the wavemaker of fluid/structure instabilities Olivier Marquet 1 & Lutz

Karl Meerbergen: Matrix algorithms for parametric dynamical systems Applications Structures and

MathLight A lightweight matrix manipulation language Boya Song (bs3065) Chunli Fu(cf2710)

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - PowerPoint PPT Presentation

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is QR algorithm

Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien, NVIDIA lchien@nvidia.com

Jacobi Forms of Lattice Index Andreea Mocanu The University of Nottingham 7th of December, 2016

The transverse Jacobi equation Luis Guijarro Universidad Aut onoma de Madrid-ICMAT Symmetry

Parallel solution of large sparse eigenproblems using a Block-Jacobi-Davidson method Melven

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Systerel Smart Solver Forum Mthodes Formelles October 2014 S3 S3 for C Systerel Smart Solver

A CDCL(LA) Solver SPASS-SATT A CDCL(LA) Solver Translation: fun (=SPASS) sated (=SATT)

Iterative methods Krylov subspace methods for symmetric eigenvalue problems Jacobi-Davidson

precise solver for chemical ODEs Fan Feng, Zifa Wang GTC 2016 San Jose, USA, 04-07 Apr. 2016

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Random Eigenvalue Problem for Linear Dynamic Systems S. A DHIKARI Cambridge University

Matrix-Eigenvalue Problems in Stochastic Structural Dynamics S Adhikari Department of Aerospace

ECS 231 Gradient descent methods for solving large scale eigenvalue problems 1 / 17 Generalized

Analysis and Computation for Analysis and Computation for Nonlinear Eigenvalue Eigenvalue

Math 221: LINEAR ALGEBRA 8-6. Singular Value Decomposition Le Chen 1 Emory University, 2020

Statistical methodology for bio iosimil ilars, comparison of f process changes and comparison

National Action Plan Workshop- Opening Address Dr Andy Hall ENSREG Chair Why this workshop? A

Urban ecology and greening cities Presentation to Sustainability Talks Green and public spaces

Judy Paterson, Joel Laity and John Moala University of Auckland Route map Introductory

Identifying the wavemaker of fluid/structure instabilities Olivier Marquet 1 &amp; Lutz

Karl Meerbergen: Matrix algorithms for parametric dynamical systems Applications Structures and

MathLight A lightweight matrix manipulation language Boya Song (bs3065) Chunli Fu(cf2710)

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Identifying the wavemaker of fluid/structure instabilities Olivier Marquet 1 & Lutz