dense linear algebra on heterogeneous platforms state of
play

Dense Linear Algebra on Heterogeneous Platforms: State of the Art - PowerPoint PPT Presentation

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala


  1. Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34

  2. Setting the stage 1 Part 1: blocked algorithms 2 Part 2: multithreading, fork-join 3 Part 3: multihtreading, algorithms-by-blocks 4 5 Part 4: streaming Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34

  3. Dense Linear Algebra ××××××××××××   ×××××××××××× ××××××××××××   ××××××××××××   ××××××××××××   ××××××××××××   M =  ××××××××××××   ××××××××××××    ××××××××××××   ××××××××××××   ×××××××××××× ×××××××××××× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  4. Dense Linear Algebra ××× × ××× ×   ×××××××××××× ××××× ×××××   ××××××××××××   ××× ××××× ××   ×××× ××××× ×   M =  ××× × ××× ×   ××××××××××××    ××××× ×××××   ××××××××××××   ××× ××××× ×× ×××× ××××× × Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  5. Dense Linear Algebra ×× × × × ×   × × ××× ×× × ×× × ××   × ×× ×× ×   × × ×× ×   × ××× × ×   M =  × × × × × ×   ×× ××××× ××    × × ××   ××× × ×××   × × × ×× ×× × ×× × ×× × Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  6. Dense Linear Algebra ×   ×× ×××   ××××   ××××   ×××× ×   M =  ×××××××   ×× ×××××    ×××××××××   ××××××× ×   ×××××××××× ×××××××××××× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  7. Dense Linear Algebra ××   ××× ×××   ×××   ×××   ×××   M =  ×××   ×××    ×××   ×××   ××× ×× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34

  8. Dense Linear Algebra Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

  9. Dense Linear Algebra Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Support routines factorizations, reductions, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

  10. Dense Linear Algebra Matrix equations AX + XB = C , A = A + A − 1 , . . . 2 Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Support routines factorizations, reductions, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34

  11. Organization in layers Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  12. Organization in layers BLAS Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  13. Organization in layers BLAS x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  14. Organization in layers BLAS A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  15. Organization in layers BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  16. Organization in layers LAPACK LL T = A, Q T TQ = A, . . . LU = A, QR = A, BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  17. Organization in layers other libraries ScaLAPACK, Elemental, PETSc, . . . LAPACK LL T = A, Q T TQ = A, . . . LU = A, QR = A, BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34

  18. Example: AX = B (full A ) AX = B Linear System LX = B LU = A Triangular LU System Factorization LX = B C = AB + C C = AB + C Triangular Gemm Gemm System C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34

  19. Why BLAS-3? Why GEMM? Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  20. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  21. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS Level 1 2 n 3 n 2 / 3 x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  22. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS Level 1 2 n 3 n 2 / 3 A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  23. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  24. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  25. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 3 4 n 2 Level 3 n/ 2 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  26. Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 3 4 n 2 Level 3 n/ 2 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK. Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34

  27. Part 1: Blocked algorithms Simple example: Cholesky factorization Input: Matrix A , symmetric and positive definite. Determine L (lower triangular matrix) such that LL T = A Goal: � L T L � L = = ? L BL L BR Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34

  28. Cholesky factorization iteration i DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

  29. Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm ❅ ❅ ❘ sqrt CHOL ✲ syr trsv TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

  30. Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm DONE DONE DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34

  31. Cholesky: unblocked vs. blocked algorithms Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34

  32. Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

  33. Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL, . . . Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34

  34. Multithreaded BLAS Xeon, 32 physical cores, MKL Efficiency of GEMM 1 0.8 Efficiency 0.6 1 thread 0.4 2 threads 4 threads 8 threads 0.2 16 threads 32 threads 0 1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34

  35. Development of LA libraries - New architecture / new architectural features Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  36. Development of LA libraries - New architecture / new architectural features - GEMM Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  37. Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  38. Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

  39. Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] → - BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34

Recommend


More recommend