Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact on algorithms June 7th, 2013 Uppsala University, Sweden Paolo Bientinesi (AICES, RWTH Aachen) 1 / 34
Setting the stage 1 Part 1: blocked algorithms 2 Part 2: multithreading, fork-join 3 Part 3: multihtreading, algorithms-by-blocks 4 5 Part 4: streaming Paolo Bientinesi (AICES, RWTH Aachen) 2 / 34
Dense Linear Algebra ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× M = ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× ×××××××××××× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
Dense Linear Algebra ××× × ××× × ×××××××××××× ××××× ××××× ×××××××××××× ××× ××××× ×× ×××× ××××× × M = ××× × ××× × ×××××××××××× ××××× ××××× ×××××××××××× ××× ××××× ×× ×××× ××××× × Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
Dense Linear Algebra ×× × × × × × × ××× ×× × ×× × ×× × ×× ×× × × × ×× × × ××× × × M = × × × × × × ×× ××××× ×× × × ×× ××× × ××× × × × ×× ×× × ×× × ×× × Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
Dense Linear Algebra × ×× ××× ×××× ×××× ×××× × M = ××××××× ×× ××××× ××××××××× ××××××× × ×××××××××× ×××××××××××× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
Dense Linear Algebra ×× ××× ××× ××× ××× ××× M = ××× ××× ××× ××× ××× ×× Paolo Bientinesi (AICES, RWTH Aachen) 3 / 34
Dense Linear Algebra Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34
Dense Linear Algebra Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Support routines factorizations, reductions, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34
Dense Linear Algebra Matrix equations AX + XB = C , A = A + A − 1 , . . . 2 Linear systems Eigenproblems Ax = b , AX = B , least squares, . . . Ax = λx , AX = BX Λ , SVD, . . . Support routines factorizations, reductions, . . . Paolo Bientinesi (AICES, RWTH Aachen) 4 / 34
Organization in layers Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Organization in layers BLAS Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Organization in layers BLAS x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Organization in layers BLAS A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Organization in layers BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Organization in layers LAPACK LL T = A, Q T TQ = A, . . . LU = A, QR = A, BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Organization in layers other libraries ScaLAPACK, Elemental, PETSc, . . . LAPACK LL T = A, Q T TQ = A, . . . LU = A, QR = A, BLAS A, B, C, L ∈ R n × n BLAS-3 : C := C + AB C := L − 1 B A, L ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax y := L − 1 x x, y ∈ R n , α ∈ R BLAS-1: y := y + αx dot := α + x T y Paolo Bientinesi (AICES, RWTH Aachen) 5 / 34
Example: AX = B (full A ) AX = B Linear System LX = B LU = A Triangular LU System Factorization LX = B C = AB + C C = AB + C Triangular Gemm Gemm System C = AB + C Gemm Paolo Bientinesi (AICES, RWTH Aachen) 6 / 34
Why BLAS-3? Why GEMM? Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS Level 1 2 n 3 n 2 / 3 x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS Level 1 2 n 3 n 2 / 3 A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 3 4 n 2 Level 3 n/ 2 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Why BLAS-3? Why GEMM? BLAS Mem. refs. Ratio #FLOPS 2 n 3 4 n 2 Level 3 n/ 2 2 n 2 n 2 2 Level 2 Level 1 2 n 3 n 2 / 3 A, B, C, ∈ R n × n BLAS-3 : C := C + AB A ∈ R n × n , x, y ∈ R n BLAS-2: y := y + Ax x, y ∈ R n , α ∈ R BLAS-1: y := y + αx Morale BLAS-3: The larger the problem the better, as long as it fits in memory. GEMM is the building block for all the other BLAS-3 kernels, and for LAPACK. Paolo Bientinesi (AICES, RWTH Aachen) 7 / 34
Part 1: Blocked algorithms Simple example: Cholesky factorization Input: Matrix A , symmetric and positive definite. Determine L (lower triangular matrix) such that LL T = A Goal: � L T L � L = = ? L BL L BR Paolo Bientinesi (AICES, RWTH Aachen) 8 / 34
Cholesky factorization iteration i DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34
Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm ❅ ❅ ❘ sqrt CHOL ✲ syr trsv TRSM SYRK Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34
Cholesky factorization iteration i + 1 unblocked algorithm blocked algorithm DONE DONE DONE DONE Paolo Bientinesi (AICES, RWTH Aachen) 9 / 34
Cholesky: unblocked vs. blocked algorithms Paolo Bientinesi (AICES, RWTH Aachen) 10 / 34
Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34
Part 2: Parallelism? fork-join Solution #1: Multithreaded BLAS (+ vector instructions) Chol LU TRSM TRSM GEMM TRSM GEMM Advantage: ease of use. Legacy code! Drawback: unnecessary synchronization points OpenBLAS, ATLAS, BLIS, old versions of MKL, . . . Paolo Bientinesi (AICES, RWTH Aachen) 11 / 34
Multithreaded BLAS Xeon, 32 physical cores, MKL Efficiency of GEMM 1 0.8 Efficiency 0.6 1 thread 0.4 2 threads 4 threads 8 threads 0.2 16 threads 32 threads 0 1000 2000 3000 4000 5000 6000 7000 8000 Matrix dimension Paolo Bientinesi (AICES, RWTH Aachen) 12 / 34
Development of LA libraries - New architecture / new architectural features Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
Development of LA libraries - New architecture / new architectural features - GEMM Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] - BLAS3, factorizations, AX=B Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
Development of LA libraries - New architecture / new architectural features - GEMM → peak performance [FFT, SpMV] → - BLAS3, factorizations, AX=B LINPACK benchmark Paolo Bientinesi (AICES, RWTH Aachen) 13 / 34
Recommend
More recommend