hicma hierarchical computations on manycore architectures
play

HiCMA: Hierarchical Computations on Manycore Architectures Hatem - PowerPoint PPT Presentation

HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016 Outline Motivations


  1. HiCMA: Hierarchical Computations on Manycore Architectures Hatem Ltaief Extreme Computing Research Center King Abdullah University of Science and Technology Thuwal, Saudi Arabia NVIDIA GTC - San Jose April 5th, 2016

  2. Outline Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H -Matrices HiCMA in a nutshell

  3. Students/Collaborators/Support Academic: ◮ Extreme Computing Research Center @ KAUST W. Boukaram, A. Charara, G. Ch´ avez, D. Keyes, D. Sukkari and G. Turkiyyah ◮ Tokyo Institute of Technology R. Yokota ◮ Institut Polytechnique de Bordeaux - INRIA Bordeaux M. Faverge ◮ Innovative Computing Laboratory - UTK Industry:

  4. Outline Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H -Matrices HiCMA in a nutshell

  5. Hardware/Software Trends ◮ Flops are free ◮ On/Off-chip network bandwidth limited ◮ Increasing gap between flops and bandwidth ◮ Data movement are the most energy-consuming operations ◮ Synchronization-reducing ◮ Communication-reducing

  6. Going hierarchical all the way down the software stack ◮ Recursive formulation (increase data locality) ◮ Old concept! ◮ Tree structure (depth-first Vs breadth-first tree traversal) ◮ Reduce vertical/horizontal data motion ◮ Without compromising concurrency ◮ Trade-off between data reuse and parallelism

  7. Outline Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H -Matrices HiCMA in a nutshell

  8. Standard SVD solver One stage reduction: Figure: Computational stages of the standard SVD algorithm: (a) bidiagonal reduction, (b) bidiagonal SVD solver and (c) back transformation.

  9. Two-stage SVD solver Two-stage reduction: Figure: Reduction of a general dense matrix to bidiagonal form using a two-stage approach.

  10. QDWH-SVD [1 , 2] solver Three computational stages: ◮ Polar decomposition A = U p H : iterative procedure using the matrix inversion free formulation based on QR/Cholesky factorization ◮ Symmetric eigensolver H = V Σ V ⊤ to calculate the singular values and the right singular vectors ◮ Matrix-matrix multiplication U = U p V to calculate the left singular vectors [1] Y. Nakatsukasa and N. J.Higham, Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, SISC, 35 (2013), pp. A1325-A1349. [2] D. Sukkari, H. Ltaief and D. Keyes, A High Performance QDWH-SVD Solver Using Hardware Accelerators, submitted to Trans. on Math. Soft., 2015.

  11. Divide-and-Conquer Figure: The recursive QDWH-SVD algorithm. The matrix A i , j corresponds to the submatrix indexed j at the i th level of recursion.

  12. Performance results 1000 1000 2.3x 3.5x 100 100 Log (time (s)) Log (time (s)) 10 10 MKL-DGESVD MKL-DGESDD 1 1 MKL-DGESVD MAGMA-QDWH-SVD MAGMA-QDWH-SVD 0.1 0.1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 0 4 8 2 6 0 0 4 8 2 6 0 Matrix size Matrix size (a) Ill-conditioned matrix. (b) Well-conditioned matrix. Figure: Performance comparisons of MAGMA-QDWH-SVD (GPU) against Intel MKL (CPU).

  13. Performance results 1000 1000 18% 2.1x 100 100 Log (time (s)) Log (time (s)) 10 10 MAGMA-DGESVD MAGMA-DGESVD MAGMA-DGESDD MAGMA-QDWH-SVD 1 1 MAGMA-QDWH-SVD 0.1 0.1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 0 0 0 0 1 1 1 1 2 0 1 2 3 4 5 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 2 4 7 9 2 4 6 9 1 2 2 2 3 3 3 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 4 8 2 6 0 4 8 2 6 4 6 8 1 3 6 0 4 8 2 6 0 0 4 8 2 6 0 Matrix size Matrix size (a) Ill-conditioned matrix. (b) Well-conditioned matrix. Figure: Performance comparisons against existing MAGMA SVD solvers (GPU).

  14. Outline Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H -Matrices HiCMA in a nutshell

  15. Recursive formulation ◮ Usually used for Level 2 BLAS algorithms (e.g., panel factorization) ◮ Increase data locality ◮ Run at the cache level speed ◮ Again, not new and literature is quite rich ◮ And it does pay off for Level 3 BLAS too!

  16. Triangular matrix-matrix multiplication (TRMM) 1-RecTRMM N 1 N 2 2-GEMM N 1 A u ¡ ¡ ¡ ¡ ¡B l ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ B r ¡ M N 2 A ll A r N 1 N 2 3-RecTRMM Figure: Illustrating a Right-Lower-NonTranspose-NonUnit recursive TRMM, and splitting along the vertical direction. Operations are performed according to their numbering.

  17. Triangular matrix-matrix multiplication GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ GEMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ TRMM ¡ Figure: A hypothetical tree representing the operations executed by the recursive algorithm. Operations are to be executed by traversing the tree in depth-first order.

  18. Performance results on NVIDIA GPUs 1200 � 1100 � 1000 � Performance (GFlop / s) � 900 � 800 � 700 � Theo-Peak � 600 � DGEMM � 500 � cuBLAS_DTRMM (OOP) � 400 � KBLAS_DTRMM (IP) � 300 � cuBLAS_DTRMM (IP) � 200 � 100 � 0 � 1 � 2 � 3 � 4 � 5 � 6 � 7 � 8 � 9 � 10 � 11 � 12 � 13 � 14 � Matrix Dimension (x1024) � Figure: Performance comparisons of KBLAS DTRMM against that of IP and OOP cuBLAS DTRMM (Integration to CUDA 8.0).

  19. Performance results on higher DLA computations using GPUs 1600 � 1400 � Theo-Peak � DGEMM � Performance (GFlop / s) � DPOTRI + KBLAS_TRMM � DPOTRI + cuBLAS_TRMM � 1200 � 1000 � 800 � 600 � 400 � 200 � 0 � 1 � 2 � 3 � 4 � 5 � 6 � 7 � 8 � 9 � 10 � 11 � 12 � 13 � 14 � Matrix Dimension (x1024) � Figure: Performance speedup of matrix inversion in MAGMA library (DPOTRI) using KBLAS DTRMM vs using cuBLAS DTRMM.

  20. Outline Motivations QR-based Dynamically Weighted Halley for SVD Level 3 BLAS H -Matrices HiCMA in a nutshell

  21. Low Rank Approximation using H -Matrices ◮ Introduced by E. Tyrtyshnikov and revisited later by W. Hackbush [1 , 2] ◮ R = U X V T [1] W. Hackbush, A Sparse Matrix Arithmetic based on H -Matrices (Part I), Computing, 62(2), pp 89-108, 1999. [2] W. Hackbusch and B. Khoromskij, A Sparse H -Matrix Arithmetic (Part II): Application to multi-dimensional problems, Computing, 64(1), pp. 21-47, 2000.

  22. Nice Properties of H -Matrices ◮ Memory footprint saving from O ( n 2 ) to O ( k n log ( n )) ◮ Linear arithmetic complexity MVM: from O ( n 2 ) to O ( k n log ( n )) MMM and A − 1 : from O ( n 3 ) to O ( k 2 n log 2 ( n ))

  23. Examples of H -Matrices 4 3 5 9 9 3 3 9 9 6 9 5 5 5 9 9 9 9 5 5 9 4 9 9 9 9 9 9 9 9 9 4 4 4 9 6 9 6 5 9 9 9 9 9 9 9 9 9 5 5 5 9 9 4 4 4 9 9 9 9 9 5 5 9 9 9 4 5 5 5 9 9 9 9 9 9 9 9 4 3 6 9 9 9 5 3 3 9 6 6 5 9 9 9 9 9 9 9 9 9 5 5 5 9 9 5 5 5 9 9 9 9 9 9 9 9 9 5 6 6 9 3 3 5 9 9 3 4 9 9 6 9 5 5 5 9 9 9 9 9 9 5 5 4 9 9 9 9 9 9 9 9 4 4 4 9 9 5 5 5 9 9 9 9 9 9 9 9 9 5 6 9 6 9 4 4 4 9 9 9 9 9 9 9 4 5 5 9 9 9 9 5 5 5 9 9 9 9 3 3 9 6 9 9 5 9 4 3 Figure: Example of H -matrix approximation for BEM.

  24. Examples of H -Matrices 25 20 20 30 20 16 30 20 16 30 16 20 20 32 27 20 30 20 30 30 20 20 32 18 20 30 20 32 28 30 27 32 28 28 28 20 20 20 30 20 32 9 32 29 30 20 20 32 19 20 29 20 29 29 18 32 29 32 19 29 19 32 19 19 19 20 20 20 30 20 32 30 30 32 20 32 30 30 30 20 20 32 10 20 30 20 32 20 30 20 32 20 20 20 9 32 30 32 20 30 32 10 32 20 20 20 10 32 20 32 10 20 10 32 10 10 10 32 Figure: Examples of H -matrix approximation for covariance matrix.

  25. Tree Structure

  26. H -MVM

  27. Implementation Details Dense MVM: ◮ Calculate the products V T x for the leaves in the tree Batch MVM operation Upsweep: ◮ Sweep up the column basis tree tree calculating the products of the inner nodes from the products of their children block SpMV (BSR) Mult: ◮ It is also block SpMV (BSR) per level of the tree Downsweep: ◮ Transpose operation of the upsweep phase block SpMV (BSR) Pipelining: ◮ Overlapping computations possible within Dense MVM / Upsweep / Mult phases. Downsweep, however, requires a sync point!

  28. Performance Results Figure: Performance (GB/s) of H -MVM using k = 8 and n min = 32.

  29. Advanced Hierarchical Matrix Operations Context: ◮ Very small sizes! ◮ Batch operation executions at each level of the tree ◮ (usually) Fixed sizes ◮ Recursive formulation, stressing register usage ◮ State-of-the-art implementations not well optimized for this scope or not supported ◮ NVIDIA K40 GPU (single GPU)

  30. Advanced Hierarchical Matrix Operations H -Matrix compression: ◮ Batch QR factorizations (square and tall-and-skinny) ◮ Batch SVD H -Matrix computations: ◮ Level 3 BLAS: SYRK, TRSM ◮ Factorizations: POTRF ◮ Solves: POTRS, POSV

Recommend


More recommend