jacobi based eigenvalue solver on gpu
play

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA - PowerPoint PPT Presentation

Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Outline Symmetric eigenvalue solver Experiment Applications Conclusions Symmetric eigenvalue solver The standard form is QR algorithm


  1. Jacobi-Based Eigenvalue Solver on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com

  2. Outline  Symmetric eigenvalue solver  Experiment  Applications  Conclusions

  3. Symmetric eigenvalue solver  The standard form is �  QR algorithm is state-of-the-art, most popular, and implemented in LAPACK  Jacobi method was introduced in 1846, prior to QR algorithm. The parallelism of Jacobi method makes GPU implementation appealing  Naming convention syevd (heevd): QR algorithm syevj (heevj): Jacobi method

  4. QR algorithm (syevd)  sytrd : tridiagonalization � � � � = �  stedc : spectrum of tridiagonal � � = Σ � �  ormtr : form eigenvectors � � � =

  5. 1 2 3 4 Example of QR algorithm 2 5 6 7 � = 3 6 8 9 4 7 9 10  sytrd: 1 1 −5.39 −0.37 −0.74 0.56 −5.39 22.48 2.768 � = � = −0.31 −0.77 −0.56 2.768 0.286 −0.18 0.6 0.3 −0.74 −0.18 0.232  stedc: −0.81 −0.73 0.33 0.56 −0.2 0.185 −0.24 0.05 0.05 0.97 Σ = � = 0.73 0.11 0.558 0.635 0.24 0.109 0.91 −0.4 0 24.1

  6. Pros and Cons  Step 1: [GPU] sytrd has 60% BLAS2 and 40% BLAS3 n = 4096, double precision  Step 3: [GPU] ormtr is a sequence of householder product routine time(sec) sytrd 4.59  Step 2: [CPU] stedc performs QR algorithm stedc 14.55 sequentially ormtr 3.65 - the runtime of stedc is about the same as sytrd - CPU is occupied during eigenvalue solver - The performance depends on CPU as well Find an alternative to replace stedc

  7. Jacobi method (syevj)  A series of rotations, each rotation eliminates one off-diagonal � � = � � � � � � � � � � � � � � � � � � � � = � � � ��� = � � � � = � �  Monotone property � � � � − ����(�� � = ��� � ��� > ��� � � ��� � = � � � �� where ��� ���,���  With proper termination condition Σ = lim �⟶� � � � = � � � � ⋯

  8. 1 2 3 4 Example of Jacobi method 2 5 6 7 � = 3 6 8 9 4 7 9 10 �  Eliminate (1,2): ��� � = 19.74842 � � � � 0.48 1.02 0.17 6.70 8.00 5.83 ��� � = 19.54482 � � = 0.48 6.70 8 9 1.02 8.00 9 10 �  Eliminate (1,3): � � � � 0.14 −0.4 0.47 −0.4 5.83 6.70 8.00 � � = ��� � = 19.53325 8.03 9.05 6.70 0.47 8.00 9.05 10  Monotone property holds,

  9. �  Eliminate (1,4): � � � � 0.12 −0.8 −0.4 −0.8 5.83 6.70 7.97 � � = ��� � = 19.52188 −0.4 6.70 8.03 9.03 7.97 9.03 10.02 �  Eliminate (2,3): � � � � 0.12 −0.32 −0.84 −0.32 0.16 0.23 ��� � = 17.08458 � � = 13.7 12 −0.84 12 10.02 0.23  (1,4) and (2,3) operate on non-overlapped rows and columns

  10. Cyclic Jacobi control accuracy (1) � (2) while (3) for p = 1:n-1 A sweep consists of n(n-1)/2 rotations (4) for q = p+1 : n Quadratic convergence is based on # of sweeps (5) compute Column rotation of A � (6) Row rotation of A (7) (8) end // for q Column rotation of V (9) end // for p (10) end // while

  11. Parallel Jacobi  n/2 pairs of non-overlapped ( p, q ) which can be done in parallel � �  Eliminate (1,2) and (3,4): � � � � � � −0.32 1.07 0.17 1 � 3 4 −0.35 10.4 5.83 � 5 6 7 � � = � = 3 6 8 � −0.32 −0.35 −0.05 4 7 � 10 1.07 10.4 18.06 sweep Off(A) 1 1.195820 2 2.849376E-1 3 2.355936E-4 4 1.279163E-15

  12. Block Jacobi  Partition A into blocks � �� ⋯ � �� ⋮ ⋱ ⋮ � = � �� ⋯ � ��  Eliminate off-diagonal block (p,q) by Jacobi rotation � � �� � � � �� � � � �� �� �� �� �� = � � � �� � �� � � � �� �� �� �� ��  Basic block is batched syevj  Column and row rotations are done by efficient GEMM  Propagate proper termination condition

  13. Comparison Basic routines sytrd batched syevj stdec GEMM ormtr GPU friendly? sytrd (yes) batched syevj (yes) stedc (no), cpu with single thread GEMM (yes) ormtr (yes) Scalable for next generation GPU sytrd (yes) batched syevj (yes) stedc (no) GEMM (yes) ormtr (yes) Computational Complexity low high Good for small matrix No Yes, with batched syevj Approximate eigenvalue No, it computes exact eigenvalues Yes, accuracy is controlled by ‘tol’ Support for ‘s’, ‘d’, ‘c’, and ‘z’ Yes Yes Stable algorithm Yes Yes Quadratic convergence Yes Yes

  14. Complexity Analysis  QR algorithm is about 2 sweeps of Jacobi method  To reach machine zero, Jacobi method needs 7 sweeps for single precision and 15 sweeps for double precision  Although complexity of Jacobi method is bigger, its parallelism makes it faster on small matrices. Once the matrix gets bigger, Jacobi method suffers from big complexity � � �� � � ���

  15. Outline • Symmetric eigenvalue solver • Experiment • Applications • Conclusions

  16. Experimental Setup  CPU: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, dual socket  GPU: K40  Comparison against MKL 11.0.4  Ngflops = 2*n^3 / time normalized gflops w.r.t. GEMM K40 E5-2690 v2 Number of processor cores 2880 10 Core clock 754 MHz 3 GHz bandwidth 180 GB/s 12 GB/s SGEMM 2768 Gflops 386 Gflops DGEMM 1221 Gflops 185 Gflops http://ark.intel.com/products/75279/Intel-Xeon-Processor-E5-2690-v2-25M-Cache-3_00-GHz https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf

  17. Performance of syevj  Jacobi method is faster than QR algorithm for small matrix, up to size 256  Jacobi method left behind for large matrix due to high complexity n cusolver syevj MKL syevd syevd 32 0.008 0.04 0.004 64 0.03 0.15 0.04 128 0.11 0.46 0.19 256 0.50 1.05 1.30 512 1.42 2.88 5.19 1024 3.00 4.56 15.26 2048 5.56 5.96 32.66 4096 5.93 7.35 43.66 8192 4.85 9.80 48.35 Ngflops, Double precision

  18. Performance of batched syevj  Batched syevj relies on shared memory, the dimension is limited by 32  The performance stabilizes when GPU is fully utilized  Batched syevj is faster than MKL with 16 threads for ‘s’, ’d’ and ‘c’ MKL, 16 threads Data MKL Speedup type Ngflops S 0.73 7.3 D 1.59 1.4 C 3.13 3.0 Z 4.55 0.8 n=32

  19. Outline • Symmetric eigenvalue solver • Experiment • Applications • Conclusions

  20. Application 1: SVD  SVD computes singular value and left/right singular vector U/V � � Σ = � �  LAPACK uses QR algorithm  Jacobi method can apply to SVD because monotone property still holds  Naming convention - gesvd : QR algorithm - gesvdj : Jacobi method

  21. Performance of gesvdj  Jacobi method is faster than QR algorithm for small matrix, up to size 512  For large matrix, Jacobi method is not bad for ‘s’ and ‘c’ compared to MKL n cusolver gesvdj MKL gesvd gesvd 32 0.004 0.036 0.089 64 0.014 0.118 0.034 128 0.064 0.395 0.118 256 0.203 1.089 0.768 512 0.628 1.872 1.734 1024 1.688 2.489 5.487 2048 3.944 2.749 13.083 4096 6.178 2.714 18.111 8192 8.242 2.65 11.061 Ngflops, Double precision

  22. Performance of batched gesvdj  The matrix size is limited by 32-by-32  The performance stablizes when GPU is fully utilized  Batched gesvdj is faster than MKL with 16 threads for ‘s’ and ‘d’ MKL, 16 threads Data MKL Speedup type Ngflops S 2.044 1.5 D 1.527 1.1 C 6.486 0.8 Z 5.176 0.3

  23. Application 2: multiGPU syevj  syevj runs on four K40  syevj is competitive against MKL for single precision syevj is ½ to of MKL for double precision

  24. Application 3: approximate eigensolver  Use case: to know full inaccurate spectrum quickly or cannot afford a large cluster for dense eigensolver Hydrogen Atom � Energy: �,�,� �  Naming convention: syevij (‘ij‘ stands for incomplete Jacobi)

  25. Accuracy of syevij  The resolution is 16 grid points for each dimension matrix is 4096-by-4096 with 27136 nonzeros  There are 5 bound states but syevij reports 0 bound states  The error bound (0.01 per eigenvalue in average) ��� ������ � � ��� − � ������ 20.36 eV -12.43 eV

  26. Performance of syevij � but much faster than dense eigensolver  The complexity is still  Strong scaling of multiGPU is not significant in this case 2 GPU: 1.4x speedup 4 GPU: 1.7x speedup  Single precision keeps the same accuracy but 2x faster Double precision: runtime (second) n Matrix size nnz 1 K40 2 K40 4 K40 16 4,096 27,136 0.6 0.6 0.7 32 32,768 223,232 27.2 18.8 16.0 64 262,144 1,810,432 1919.6 1320.7 1016.6

  27. Conclusions  Optimal complexity may not be the best for parallel computing  Jacobi method is faster than MKL for small matrices, as well as batched operations  Jacobi method can be applied to symmetric eigenvalue solver and SVD  Jacobi method uses limited CPU resources  CUDA 9 will have syevj, batched syevj gesvdj, batched gesvdj multiGPU syevj multiGPU syevij

Recommend


More recommend