Accelerating ¡Linear ¡Algebra ¡on ¡Small ¡Matrices ¡– ¡ ¡ from ¡Batched ¡BLAS ¡to ¡Large ¡Scale ¡Solvers ¡ Stan ¡Tomov ¡ and ¡Ichitaro ¡Yamazaki ¡ Azzam ¡Haidar, ¡Ahmad ¡Abdelfattah, ¡Mark ¡Gates, ¡and ¡Jack ¡Dongarra ¡ ¡ Innovative Computing Laboratory Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville ¡ In collaboration with : LLNL, Livermore, CA, USA University of Manchester, Manchester, UK University of Paris-Sud, France GTC 2018 San Jose, CA March 26 – 29, 2018 ¡
Outline • Introduction • Batched BLAS • MAGMA Batched functionalities and techniques • Accelerating applications with Batched LA – MAGMA DNN, Templates, and Tensors – Fused Batched BLAS – Applications in exascale discretizations (CEED project) • PART II: Batched computations for large scale solvers with low-rank approximations and precond. • Conclusions 2 / 37
Dense Linear Algebra in Applications Dens ense e Linear Linear Alge lgebr bra a (DLA LA) is is needed needed in in a a wide ide var ariet iety of of science cience and and engineer engineering ing applica pplications ions: : • Linear systems: Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more • SVD: A = U Σ V* (Au = σ v and A*v = σ u) Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, • sparse with dense blocks, etc. • DLA is crucial to the development of sparse solvers 3 / 37
Dense Linear Algebra in Applications Dens ense e Linear Linear Alge lgebr bra a (DLA LA) is is needed needed in in a a wide ide var ariet iety of of science cience and and engineer engineering ing applica pplications ions: : Provided in MAGMA 2.3 • Linear systems: Solve Ax = b • Computational electromagnetics, material science, applications using boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more • Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), econometrics, control theory, signal processing, curve fitting, and many more • Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more • SVD: A = U Σ V* (Au = σ v and A*v = σ u) Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more • Many variations depending on structure of A A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, • sparse with dense blocks, etc. • DLA is crucial to the development of sparse solvers http://icl.cs.utk.edu/magma https://bitbucket.org/icl/magma
Why use GPUs in HPC? PERFORMANCE & ENERGY EFFICIENCY Energy efficiency MAGMA 2.3 LU factorization in double precision arithmetic (under ~ the same power draw) CPU Intel Xeon E5-2650 v3 (Haswell) 15 MP x 192 @ 0.88 GHz P100 NVIDIA Pascal GPU NVIDIA Kepler GPU V100 NVIDIA Volta GPU K40 2x10 cores @ 2.30 GHz 56 MP x 64 @ 1.19 GHz 80 MP x 64 @ 1.38 GHz 6000 ¡ 25 ¡ V100 ¡ 5000 ¡ 10x Performance GFLOP/s 20 ¡ GFLOPs / Watt 10x P100 ¡ 4000 ¡ 15 ¡ K40 ¡ 3000 ¡ CPU ¡ 10 ¡ 2000 ¡ 5 ¡ 1000 ¡ 0 ¡ 0 ¡ 2k ¡ 4k ¡ 6k ¡ 8k ¡ 10k ¡12k ¡14k ¡16k ¡18k ¡20k ¡22k ¡24k ¡26k ¡28k ¡30k ¡32k ¡34k ¡36k ¡ CPU ¡ K40 ¡ P100 ¡ V100 ¡ Matrix size N x N 5 / 37
Many applications need LA on many small matrices Sparse/Dense solvers & preconditioners Data Analytics and associated with it Linear Algebra on small LA problems are needed in many applications: DAG-based factorization Sparse / Dense Matrix Batched LAPACK System • Machine learning, • Neuroscience, • Data mining, • Astrophysics, • High-order FEM, • Quantum chemistry, Single calls to • Numerical LA, • Multi-physics problems, Batched BLAS • Graph analysis, • Signal processing, etc. Machine learning Applications using high-order FEM Convolution Pooling Convolution Pooling Fully Output • Matrix-free basis evaluation needs efficient tensor contractions, Data D connected predictions Output O n ∑ chicken 0.4 C A B = person 0.1 . k , i 1 k , i 2, i 3 i 1, i 2, i 3 boat 0.3 O k dog 0.01 n , k D k • Within ECP CEED Project, designed MAGMA batched methods Convolution of Filters F i (feature detection) and input image D: Filters F • For every filter F n and every channel, the computation for to split the computation in many small high-intensity GEMMs, F n every pixel value O n,k is a tensor contraction : grouped together (batched) for efficient execution: ∑ O D F = k , i n , i n , k i Plenty of parallelism; small operations that must be batched • • With data “reshape” the computation can be transformed into Batch_{ C i3 = A T B i3 , for range of i3 } a batched GEMM (for efficiency; among other approaches) 6 / 37
MAGMA Batched Computations 1. Non-batched computation • loop over the matrices one by one and compute using multithread (note that, since matrices are of small sizes there is not enough work for all the cores). So we expect low performance as well as threads contention might also affect the performance for (i=0; i<batchcount; i++) dgemm(…) There ¡is ¡not ¡enough ¡work ¡ Low ¡percentage ¡of ¡the ¡ to ¡fulfill ¡all ¡the ¡cores. ¡ ¡ resources ¡is ¡used ¡ 7 / 37
MAGMA Batched Computations 1. Batched computation Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to operate on it independently • For very small matrices, assign a matrix/core (CPU) or per TB for GPU • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU) • For large size switch to multithreads classical 1 matrix per round. Batched_dgemm(…) Tasks manager dispatcher Based on the kernel High ¡percentage ¡of ¡the ¡ design that decide the resources ¡is ¡used ¡ n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler 8 / 37
MAGMA Batched Computations 1. Batched computation Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to operate on it independently • For very small matrices, assign a matrix/core (CPU) or per TB for GPU • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU) • For large size switch to multithreads classical 1 matrix per round. Batched_dgemm(…) Tasks manager dispatcher Based on the kernel High ¡percentage ¡of ¡the ¡ design that decide the resources ¡is ¡used ¡ n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler 9 / 37
MAGMA Batched Computations 1. Batched computation Distribute all the matrices over the available resources by assigning a matrix to each group of core/TB to operate on it independently • For very small matrices, assign a matrix/core (CPU) or per TB for GPU • For medium size a matrix go to a team of cores (CPU) or many TB’s (GPU) • For large size switch to multithreads classical 1 matrix per round. Batched_dgemm(…) Tasks manager dispatcher Based on the kernel High ¡percentage ¡of ¡the ¡ design that decide the resources ¡is ¡used ¡ n u m b e r o f T B o r threads (GPU/CPU) and through the Nvidia/ OpenMP scheduler 10 / 37
Recommend
More recommend