high performance matrix inversion based on lu
play

High Performance Matrix Inversion Based on LU Factorization for - PowerPoint PPT Presentation

Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures presented by Piotr Luszczek Preliminaries Problem Statement n n A R PA = LU 1 U


  1. Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Piotr Luszczek High Performance Matrix Inversion Based on LU Factorization for Multicore Architectures presented by Piotr Luszczek

  2. Preliminaries

  3. Problem Statement n × n A ∈ R PA = LU − 1 U → U − 1 L → L − 1 ∈ R n × n A

  4. To Keep in Mind... In the vast majority of practical computational problems, it is unnecessary and inadvisable to actually compute A -1 . Forsythe, Malcolm, and Moler

  5. Data Layouts for Matrix Elements Column-major (LAPACK and derivatives) Tile (PLASMA)

  6. Tasks and DAGs

  7. Block LU Inversion Tile LU Inversion For each panel LU factorization For each diagonal tile ● ● DGETF2( ) -DGETRFR() parallel recursive LU DLASWP( ) for each tail tile panel DLASWP( ) -DLASWP( ) DTRSM( ) for each tail tile DGEMM( ) -DGEMM( ) for each left tile panel For each panel Invert U ● -DLASWP( ) DTRMM( ) DTRSM( ) For each diagonal tile Invert U ● DTRTI2( ) for each tile in panel -DTRSM( ) For each panel Invert L ● for each tail tile DLACPY( ) -DGEMM( ) DLASET( ) for each left panel tile DGEMM( ) -DTRSM( ) DTRSM( ) -DTRTRI( ) For each left tile Invert L ● DLASWP( ) column interchanges ● -DLACPY( ) -DLASET( ) ...

  8. Queuing Functions with QUARK QUARK_Insert_Task( panel_LU_task, M, matrix_1 , INPUT, N, matrix_2 , INOUT, 1, result , OUTPUT, K, buffer , SCRATCH, 0);

  9. DAGs of Tasks, Each State Separately 3 – Computation of U -1 1 – LU Factorization 4 – Column swapping 2 – Computation of L -1

  10. DAGs of Tasks, All Stages Overlapped

  11. Execution Traces No Overlap of Stages Overlap of Stages

  12. The Case for Nested Parallelism

  13. Panel Factorization as the Sequential Bottleneck xGETRF-REC Swap + xTRSM Swap + xTRSM xGEMM xGEMM xGETRF-REC xGEMM xGEMM

  14. Panel Factorization is On Critical Path of DAG

  15. Parallel Panel Factorization: Data Partitioning

  16. Parallel Panel Factorization: Algorithm

  17. Quick Performance Experiment

  18. Results

  19. Performance on AMD MagnyCours, 4x12=48 cores

  20. LU Inversion's Power Profile: LAPACK

  21. LU Inversion's Power Profile: MKL

  22. LU Inversion's Power Profile: PLASMA

  23. PLASMA LAPACK MKL This work was sponsored by NSF, DOE, and Microsoft

Recommend


More recommend