Experiments with Mixed Prevision Algorithms in Linear Algebra Jack Dongarra (UTK/ORNL/U Manchester) Azzam Haidar (Nvidia) Stan Tomov (UTK) Nick Higham (U of Manchester) 8/28/19 1
Mixed Precision • Today many precisions to deal with (IEEE Standard) • Note the number range with half precision (16 bit fl.pt.) IEEE SP largest fl pt largest fl pt number number 65,504 Google TPU: bfloat16 O(10 38 ) float16 2
Nvidia Volta Peak Rates • Four Performance levels for the different precision • 64 bit floating point (FMA): 7.5 Tflop/s • 32 bit floating point (FMA): 15 Tflop/s • 16 bit floating point (FMA): 30 Tflop/s • 16 bit floating point with Tensor core: 120 Tflop/s • Numerical characteristics of arithmetic on Tensor core different Tensor Core Performance from: Mixed Precision Matrix Multiply 4x4 Matrices 3
4x4 matrix multiply: 32 bit floating point accuracy with 16 bit inputs 07 4
Dense Linear Algebra (DLA) is needed in a wide variety of science and engineering applications: • Linear systems: Solve Ax = b Computational electromagnetics, material science, applications using • boundary integral equations, airflow past wings, fluid flow around ship and other offshore constructions, and many more Least squares: Find x to minimize || Ax – b || • Computational statistics (e.g., linear least squares or ordinary least squares), • econometrics, control theory, signal processing, curve fitting, and many more Eigenproblems: Solve Ax = λ x • Computational chemistry, quantum mechanics, material science, face recognition, • PCA, data-mining, marketing, Google Page Rank, spectral clustering, vibrational analysis, compression, and many more SVD: A = U Σ V* (Au = σ v and A*v = σ u) • Information retrieval, web search, signal processing, big data analytics, low rank • matrix approximation, total least squares minimization, pseudo-inverse, and many more Many variations depending on structure of A • • A can be symmetric, positive definite, tridiagonal, Hessenberg, banded, sparse with dense blocks, etc. DLA is crucial to the development of sparse solvers • 5 / 19
Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP64 GEMM 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size
Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP32 GEMM sgemm achieve about 14 Tflop/s • FP64 GEMM 90 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 ~2X 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size
Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM sgemm achieve about 14 Tflop/s • FP32 GEMM 90 FP64 GEMM hgemm achieve about 27 Tflop/s • 85 80 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 ~4X 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size
Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM Tensor Cores sgemm achieve about 14 Tflop/s • FP16 GEMM 90 FP32 GEMM hgemm achieve about 27 Tflop/s • 85 FP64 GEMM 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 ~12X 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size
Leveraging Half Precision in HPC on V100 Study of the Matrix Matrix multiplication kernel on Nvidia V100 dgemm achieve about 6.4 Tflop/s • FP16 GEMM Tensor Cores sgemm achieve about 14 Tflop/s • FP16 GEMM 90 FP32 GEMM hgemm achieve about 27 Tflop/s • 85 FP64 GEMM 80 Tensor cores gemm reach about 85 Tflop/s • 75 70 65 60 Tflop/s 55 50 45 40 35 30 Matrix matrix multiplication GEMM 25 20 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size
Leveraging Half Precision in HPC on V100 Study of the rank k update used by the LU factorization algorithm on Nvidia V100 In LU factorization need matrix • FP16 TC square FP16 square FP32 square FP64 square multiple but operations is a FP16 TC k=256 FP16 k=256 FP32 k=256 FP64 k=256 90 rank-k update computing the 85 80 Schur complement 75 70 65 60 Tflop/s 55 50 45 40 35 Rank-k Ra k GEMM needed by 30 LU LU does not perform as 25 20 we well a as s square b but s still O OK 15 10 5 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k m=n
Leveraging Half Precision in HPC on V100 solving linear system Ax = b solving linear system Ax = b LU factorization is used to solve a • linear system Ax=b LU factorization A x = b x b A U x b LUx = b L y b Ly = b L then U x y Ux = y
Leveraging Half Precision in HPC on V100 solving linear system Ax = b For s = 0, nb, .. N 1. 1. pan panel f fac actoriz ize LU factorization requires O(n 3 ) 2. up 2. update e tr trailing ma matrix most of the operations are spent in GEMM TRSM nb U • Panel Factorization Panel • TRSM - Triangular solve L GEMM • GEMM – Matrix Multiply step 1 step 2 step 3 step 4 panel update
Leveraging Half Precision in HPC on V100 Study of the LU factorization algorithm on Nvidia V100 LU factorization is used to solve a • 24 FP16 hgetrf LU factorization Tensor Cores linear system Ax=b FP16 hgetrf LU factorization FP32 sgetrf LU factorization A x = b x b FP64 dgetrf LU factorization A 20 16 U x b 3~4X Tflop/s LUx = b L 12 8 y b L Ly = b 4 then U x y Ux = y 0 2k 4k 6k 8k 10k12k14k16k18k20k22k24k26k28k30k matrix size
Leveraging Half Precision in HPC on V100 solving linear system Ax = b For s = 0, nb, .. N 1. pan 1. panel f fac actoriz ize 2. 2. up update e tr trailing ma matrix • Panel Factorization performed with 32 bit fl pt Done using MAGMA on the front-end system • TRSM Panel • TRSM - Triangular solve performed with 32 bit fl pt GEMM Done using VT100 (no Tensor core) • • GEMM – Matrix Multiply performed with 16 bit fl pt Done on VT100 with Tensor cores • Most of the performance comes from GEMM using 16 bit fl pt
Leveraging Half Precision in HPC on V100 Us Use e Mixed ed Prec ecision algorithm hms Ø Achieve higher performance à faster time to solution Ø Reduce power consumption by decreasing the execution time à Ene Energy Saving ngs !!! Reference: A. Haidar, P. Wu, S. Tomov, J. Dongarra, Investigating Half Precision Arithmetic to Accelerate Dense Linear System Solvers, SC-17, ScalA17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ACM, Denver, Colorado, November 12-17, 2017. A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , SC-18, Dallas, TX, IEEE, November 2018.
Leveraging Half Precision in HPC on V100 Idea: use low precision to compute the expensive flops (LU O(n 3 )) and then iteratively refine the solution in order to achieve the FP64 arithmetic Iterative refinement for dense systems, Ax = b , can work this way. L U = lu(A) lower precision O(n 3 ) x = U\(L\b) lower precision O(n 2 ) r = b – Ax FP64 precision O(n 2 ) WHILE || r || not small enough 1. find a correction “z” to adjust x that satisfy Az=r solving Az=r could be done by either: z = U\(L\r) Classical Iterative Refinement lower precision O(n 2 ) Ø GMRes preconditioned by the LU to solve Az=r Iterative Refinement using GMRes lower precision O(n 2 ) Ø 2. x = x + z FP64 precision O(n 1 ) 3. r = b – Ax FP64 precision O(n 2 ) END Higham and Carson showed can solve the inner problem with iterative method and not infect the solution. Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt. Ø E. Carson & N. Higham, “Accelerating the Solution of Linear Systems by Iterative Refinement in Three It can be shown that using this approach we can compute the solution to 64-bit floating point precision. Ø Precisions SIAM J. Sci. Comput. , 40(2), A817–A847. Need the original matrix to compute residual (r) and matrix cannot be too badly conditioned Ø
Improving Solution • z is the correction or (x i+1 – x i ) • Computed in lower precision and then added to the approximate solution in higher precision x i + z x i z x i+1 • Can be used in situations like this …
Recent Results Run at Scale… • Mixed precision iterative refinement approach solved a matrix of order 10,091,520 on ORNL’s Summit system. – Composed of nodes made up of 2 IBM Power-9 processors (22 cores each) plus 6 Nvidia V100 GPUs (84 SMs each) – The run used 4500 nodes of Summit, 2,466,000 cores = 4500*(22*2 + 84*6) – Used a random matrix with large diagonal elements to insure convergence of the method. • Mixed precision HPL achieved 445 PFLOPS or 2.95X over DP precision HPL result on the Top500 (148 PFLOPS). – 43 Gflops/Watt • Same accuracy compared to full 64 bit precision
Recommend
More recommend