communication avoiding lu and qr factorizations for
play

Communication-avoiding LU and QR factorizations for multicore - PowerPoint PPT Presentation

Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori Alok Kumar Gupta INRIA Saclay BCCS,Norway-5075 16th April 2010 Communication-avoiding LU and QR


  1. Communication-avoiding LU and QR factorizations for multicore architectures DONFACK Simplice INRIA Saclay Joint work with Laura Grigori Alok Kumar Gupta INRIA Saclay BCCS,Norway-5075 16th April 2010 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 1 / 25

  2. Introduction 1 CALU and CAQR factorization 2 Multithreaded CALU and CAQR 3 Experimental section 4 Conclusion 5 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 2 / 25

  3. Introduction 1 CALU and CAQR factorization 2 Multithreaded CALU and CAQR 3 Experimental section 4 Conclusion 5 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 3 / 25

  4. Introduction Architectural trends show an increasing communication cost compared to the time it takes to perform arithmetic operations Motivated the design of communication avoiding algorithms that minimize communication First results are CAQR [Demmel, Grigori, Hoemmen, Langou ’08] and CALU [Grigori, Demmel, Xiang ’08], implemented for distributed memory. Our goal is to design multithreaded QR and LU factorizations for multicores based on communication avoiding algorithms. Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 4 / 25

  5. LU factorization with partial pivoting Factorization on Pr by Pc grid of processors as implemented in SCALAPACK: For ib = 1 to n-1 step b A(ib) = A(ib:n, ib:n) Compute panel factorization (pdgetf2) 1 O ( nlog 2 P r ) - find pivot in each column, swap rows Apply all row permutations (pdlaswp) 2 O ( n/b ( log 2 P c + log 2 P r )) - broadcast pivot information along the rows - swap rows at left and right Compute block row of U (pdtrsm) 3 O ( n/blog 2 P c ) - broadcast right diagonal block of L of current panel Update trailing matrix (pdgemm) 4 O ( n/b ( log 2 P c + log 2 P r )) - broadcast right block column of L - broadcast down block row of U Pivoting requires communication among processors on distributed memory and synchronisation between threads on multicores. Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 5 / 25

  6. CALU and CAQR approach Communication avoiding algorithms [Demmel, Grigori, Hoemmen, Langou, Xiang ’08] approach: Decrease communication required for pivoting and overcome the latency bottleneck of classic algorithms by performing the factorization of a block column (a tall and skinny matrix) as a reduction operation and doing some redundant computations They are communication optimal in terms of both latency and bandwidth They lead to important speedups on distributed memory computers Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 6 / 25

  7. Goal Our goal Combine the main ideas to reduce communication in CALU and CAQR with : appropriate blocking task identification dynamic scheduling The reduction operation to use for a block-column factorization is based on a binary tree with asynchronous tasks : reduces synchronisation between threads (only O ( log 2 ( Pr )) ) avoids bus contention Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 7 / 25

  8. Introduction 1 CALU and CAQR factorization 2 Multithreaded CALU and CAQR 3 Experimental section 4 Conclusion 5 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 8 / 25

  9. CAQR Each panel factorization is computed as a reduction operation where at each node a QR factorization is performed. The reduction tree is chosen depending on the underlying architecture. For a binary tree log 2 ( Pr ) steps are used. Figure: Parallel TSQR Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 9 / 25

  10. CAQR Update the submatrix using the tree in log 2 ( Pr ) steps Figure: The update of the trailing submatrix is triggered by the reduction tree used during panel factorization Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 10 / 25

  11. CALU[Grigori, Demmel, Xiang ’08] The panel factorization is performed in two steps: A preprocessing steps aims at identifying at low communication cost good pivot rows The pivot rows are permuted in the first positions of the panel and LU without pivoting of the panel is performed Figure: Stable parallel panel factorization Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 11 / 25

  12. CALU (Stability) P=256,b=32 700 P=256,b=16 P=128,b=64 600 P=128,b=32 P=128,b=16 P=64, b=128 500 P=64, b=64 P=64, b=32 average growth factor P=64, b=16 400 GEPP n 2/3 2*n 2/3 300 3*n 1/2 200 100 1024 2048 4096 8192 Figure: Stability of binary tree based CALU factorization for random matrices Extensive tests performed on random matrices and a set of special matrices using binary tree and flat tree show that CALU is as stable as GEPP in practice. Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 12 / 25

  13. Introduction 1 CALU and CAQR factorization 2 Multithreaded CALU and CAQR 3 Experimental section 4 Conclusion 5 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 13 / 25

  14. Multithreaded CALU The matrix is partitioned in blocks of size Tr x b The computation of each block is associated with a task The task dependency graph is scheduled using a dynamic scheduler Figure: Matrix 4 × 4 blocks and T r = 2 and Corresponding task dependency graph Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 14 / 25

  15. Multithreaded CALU Panel factorization is performed in two steps: find good pivots at low communication cost, permute them and compute LU factorization of the panel without pivoting. The panel factorization stays on the critical path but it is done more efficiently Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 15 / 25

  16. Multithreaded CALU (Execution) Figure: Example of execution of CALU for a 10 5 × 1000 tall skinny matrix, using b = 100 and T r = 1 , on 8-core Figure: Example of execution of CALU for a 10 5 × 1000 tall skinny matrix, using b = 100 and T r = 8 , on 8-core Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 16 / 25

  17. Multithreaded CAQR Same approach as CALU but: Panel factorization is performed only once The update of the trailing matrix is triggered by the binary tree used for the panel factorization. Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 17 / 25

  18. Introduction 1 CALU and CAQR factorization 2 Multithreaded CALU and CAQR 3 Experimental section 4 Conclusion 5 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 18 / 25

  19. Environments Tests performed on: two-socket, quad-core machine based on Intel Xeon EMT64 processor running on Linux and on a four-socket, quad-core machine based on AMD Opteron processor Comparison with MKL-10.0.4.23 and PLASMA 2.0 (with default parameters) b = MIN ( n, 100) has been chosen as block size Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 19 / 25

  20. Performance of CALU Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 8 cores 5 Tall Skinny Matrix, CALU, m=10 35 MKL_dgetf2 MKL_dgetrf 30 PLASMA_dgetrf CALU(Tr=4) CALU(Tr=8) 25 GFlops/s 20 15 10 5 0 3 4 5 6 7 8 9 10 log2(n) Figure: m= 10 5 and varying n from 10 to 1000 . Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 20 / 25

  21. Performance of CALU Performance of CALU, MKL_dgetrf, PLASMA_dgetrf on 16 cores 5 Tall Skinny Matrix, CALU, m=10 45 ACML_dgeqrf 40 PLASMA_dgeqrf CALU(Tr=8) 35 CALU(Tr=16) 30 GFlops/s 25 20 15 10 5 0 3 4 5 6 7 8 9 10 log2(n) Figure: m= 10 5 and varying n from 10 to 1000 . Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 21 / 25

  22. Performance of CAQR Performance of CAQR, MKL_dgeqrf, PLASMA_dgeqrf on 8 cores 5 Tall Skinny Matrix, CAQR, m=10 45 MKL_dgeqrf 40 PLASMA_dgeqrf CAQR(Tr=2) 35 CAQR(Tr=4) CAQR(Tr=8) 30 TSQR GFlops/s 25 20 15 10 5 0 3 4 5 6 7 8 9 10 log2(n) Figure: m= 10 5 and varying n from 10 to 1000 . Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 22 / 25

  23. Introduction 1 CALU and CAQR factorization 2 Multithreaded CALU and CAQR 3 Experimental section 4 Conclusion 5 Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 23 / 25

  24. Conclusion Multithreaded CALU and CAQR lead to important improvements for tall and skinny matrices with respect to the corresponding routines in MKL and PLASMA. PLASMA becomes more efficient with increasing number of columns. No significant improvements obtained so far for square matrices. Prospects: Improve the performance of the trailing matrix update by increasing the block size to optimize BLAS3 operations. Compare with the recent approach of [Hadri, Ltaief, Agullo, Dongarra’09] for QR factorization, which uses a different reduction tree during panel factorization. Communication-avoiding LU and QR factorizations for multicore architectures 16th April 2010 24 / 25

Recommend


More recommend