tensor core
play

Tensor Core Performance and Precision Josef Schle, University - PowerPoint PPT Presentation

Tensor Core Performance and Precision Josef Schle, University Kaiserslautern, Germany, josef.schuele@rhrk.uni-kl.de Why attend this Session? 90 deviation of weights and biases Assumed learning curve - deviation from final values 80 blue:


  1. Tensor Core Performance and Precision Josef Schüle, University Kaiserslautern, Germany, josef.schuele@rhrk.uni-kl.de

  2. Why attend this Session? 90 deviation of weights and biases Assumed learning curve - deviation from final values 80 blue: trend in FP32 70 red: range according to precision loss in FP16 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Learning Iterations Tensor Core Performance and Precision

  3. Why attend this Session? Assumed learning curve - deviation from final values blue: trend in FP32 red: range according to precision loss in FP16 green: possible behaviours in FP16 stagnation divergence 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Learning Iterations Tensor Core Performance and Precision

  4. Why attend this Session? Mixed precision  Each iteration is faster  Number of iterations is increased But: Does mixed precision really fasten up the learning? Tensor Core Performance and Precision

  5. Outline  Tensor Cores - Way of Operation and Consequences  Improving Quality of Tensor Core Usage  Performance  Results and Outlook Tensor Core Performance and Precision

  6. Source: NVIDIA Tensor Core Performance and Precision

  7. How can we use Tensor Cores?  Easiest and fastest way - NVIDIAs BLAS library (cublas) #include "cublas_v2.h" cublasHandle_t handle=0; cublasStatus_t cublasStat = cublasCreate(&handle); cublasStat=cublasSetMathMode(handle,CUBLAS_TENSOR_OP_MATH); cublasGemmEx(handle,CUBLAS_OP_N,CUBLAS_OP_N,m,n,k,&beta, B, CUDA_R_16F, ldb, A, CUDA_R_16F, lda, &alpha, C, CUDA_R_32F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP); N= 8192 -> 91 Tflops (of 120 Tflops Peak) Tensor Core Performance and Precision

  8. Tensor core API  Nvidia provides Warp Matrix Multiply Accumulate API  contains very few functionality:  fill_fragment - initialize an accumulator  load_matrix_sync - load input data  mma_sync - perform the multiplication  store_matrix_sync - store result  limitations Matrices A, B, C, D may be  8x16 (A), 16x32 (B), 8x32 (C,D)  16x16 (A, B, C, D)  32x16 (A), 16x8 (B), 32x8 (C,D)  and - like cublas - it's FORTRAN data layout Tensor Core Performance and Precision

  9. What is a FLOAT16? sign 5bits exponent 10bits significand  maximum absolute value ±𝟕𝟔, 𝟔𝟏𝟓  machine epsilon 𝟑 −𝟐𝟏 (0.0009765)  non-uniform precision loss  1,024 representable values for each power-interval i.e.  1,024 representables between 0.0 and 1.0 𝟏, 𝟑 𝟏  1,024 representables between 1,024 and 2,048  32,768; 32,800; 32,832, … only representables in 𝟑 𝟐𝟔 , 𝟑 𝟐𝟕 Tensor Core Performance and Precision

  10. What is a FLOAT32? sign 8bits exponent 23bits significand  maximum absolute value ±𝟐𝟏 𝟒𝟗  machine epsilon 𝟑 −𝟑𝟒 (1 𝟏 −𝟖 ) Conversion of FLOAT32 x to FLOAT16 produces round(x) with  abserr(x)= |round(x)-x|  Significands b11..b23 are lost (assuming proper range)  relerr(x)=abserr(x)/|x|= 𝟑 −𝟐𝟏 =eps Tensor Core Performance and Precision

  11. Multiply-Accumulate with Float16 x=( 𝟑 −𝟕 , 𝟑 −𝟕 , 𝟑 −𝟕 , 𝟑 −𝟕 ) y= ( 𝟑 −𝟔 , 𝟑 −𝟔 , 𝟑 −𝟔 , 𝟑 −𝟔 ) Float32 𝒕 = 𝒚 𝑼 ∙ 𝒛 = 𝟓 ∙ 𝟑 −𝟕 ∙ 𝟑 −𝟔 = 𝟓 ∙ 𝟑 −𝟐𝟐 = 𝟑 −𝟘 = 𝟐. 𝟘𝟔 ∙ 𝟐𝟏 −𝟒 Float16 abserr(x)=abserr(y)= 0. (no initial rounding error) Conversion of intermediate product 𝟑 −𝟐𝟐 into float16 results in 0. Final result in float16 is 0., abserr(s)= 𝟑 −𝟘 . Tensor Core Performance and Precision

  12. Source: NVIDIA Additional rounding errors are prevented. If abserr(__float2half(.)) = 0., it remains 0. Thus - good and important that FP16 products are accumulated in FP32 precision. Tensor Core Performance and Precision

  13. absolute error and matrix values abserr(x)=eps= 𝟑 −𝟐𝟏 for 𝐲 ∈ 𝟏, 𝟑 𝟏 abserr(x)=2 eps= 𝟑 −𝟘 for 𝐲 ∈ 𝟑 𝟐 , 𝟑 𝟑 for 𝐲 ∈ 𝟑 𝟐𝟏 , 𝟑 𝟐𝟐 abserr(x)=1 for 𝐲 ∈ 𝟑 𝟐𝟔 , 𝟑 𝟐𝟕 abserr(x)=32 absolute rounding error increases with value Tensor Core Performance and Precision

  14. absolute error and matrix size x=( 𝟐 − 𝟑 −𝟐𝟐 ) Float32 𝒕 = 𝒚 𝑼 ∙ 𝒚 = 𝟐 − 𝟑 −𝟐𝟏 + 𝟑 −𝟑𝟑 If x is a vector of length N , 𝑡 = 𝑶 − 𝑶 ∙ 2 −10 + 𝑶 ∙ 2 −22 Float16 x=1., abserr(x) = 𝟑 −𝟐𝟐 𝒕 = 𝒚 𝑼 ∙ 𝒚 = 𝟐 If x is a vector of length N, 𝒕 = 𝐎 Final result in float16 is N, abserr(s) ≈ 𝑶 ∙ 𝟑 −𝟐𝟏 . Rounding errors increase with matrix size. Tensor Core Performance and Precision

  15. absolute errors for C=AB 2,00E-01 different matrix sizes and intervals 1,80E-01 1,60E-01 A,B in [1,-1] A in [1,-1], B in [1,0] A in [1,-1], B in [4,-4] 1,40E-01 1,20E-01 larger value, larger size error larger error 1,00E-01 larger error 8,00E-02 6,00E-02 4,00E-02 2,00E-02 0,00E+00 64 128 256 512 1024 2048 4096 8192 matrix sizes Tensor Core Performance and Precision

  16. But - it is the 4th digit 4,5 affected digit for different matrix sizes A,B in [1,-1] A in [1,-1], B in [4,-4] 4 error in digit 0.984…. 3.98…. 3,5 3 64 128 256 512 1024 2048 4096 8192 matrix sizes Tensor Core Performance and Precision

  17. range problems NVIDIA and others recommend scaling to prevent over- or underflow  reason: small gradient values otherwise will be ignored because they are treated as 0. Assume all entries of A below threshold T. Scale A with σ : Ã= σ A D= α D= α AB+ β C becomes: σ ÃB+ β C Larger value, larger error: abserr(ÃB) ≈ σ abserr(AB) α σ ÃB) = Τ α σ abserr(ÃB) ≈ abserr(AB) abserr( Τ Division by σ : Scaling introduces no additional rounding error to AB, but Tensor Core Performance and Precision

  18. scaling factor Scaling of A may introduce additional rounding: Choosing a proper scaling factor Scaling with 1200. only 1200.*(1.+ 𝟑 −𝟐𝟏 ) = 1201.1718 1024 representables for [1024,2048] in in FP16 = 1201 (precision loss) FP16 Scaling with powers of 2 avoids this problem. Tensor Core Performance and Precision

  19. scaling factor Scaling with 1024. 1024.*(1.+ 𝟑 −𝟐𝟏 ) = 1024.+1.=1025. Scaling with powers of 2 corresponds to a  change in the exponent.  significand is unchanged. Tensor Core Performance and Precision

  20. Outline  Way of Operation and Consequences  limited range -> scaling, but use powers of 2  rounding erros increase with  matrix values (scaling has no influence)  matrix size  4th digit of result has no significance  Improving Quality of Tensor Core Usage Tensor Core Performance and Precision

  21. increasing accuracy Binomial approach Markidis et al. Mar 2018. X(32) ≈ Xh(16) + Xl(16) with Xh(16)=(half) X(32) Xl(16) =X(32)-(float)Xh(16) 𝒀𝒊 + 𝒀𝒎 ∗ 𝒁𝒊 + 𝒁𝒎 = 𝒀𝒊 ∗ 𝒁𝒊 + 𝒀𝒊 ∗ 𝒁𝒎 + 𝒀𝒎 ∗ 𝒁𝒊 + 𝒀𝒎 ∗ 𝒁𝒎  higher accuracy compared to Xh*Yh in FP16  4 MMAs instead of one Tensor Core Performance and Precision

  22. Example - binomial approach x(32)= 𝟑 −𝟐 + 𝟑 −𝟐𝟐 - 𝟑 −𝟐𝟒 0.5003662 xh(16)= 𝟑 −𝟐 xl(16) = 𝟑 −𝟐𝟐 − 𝟑 −𝟐𝟒 𝒚 𝟑 ≈ 𝟑 −𝟑 + 𝟑 −𝟐𝟐 − 𝟑 −𝟐𝟒 𝒈𝒒𝟐𝟕 𝟑 = 𝟑 −𝟑 𝒄𝒋𝒐𝒑𝒏𝒋𝒃𝒎 = 𝟑 −𝟑 + 𝟑 −𝟐𝟐 − 𝟑 −𝟐𝟒 abserr( 𝒈𝒒𝟐𝟕 )= 𝟑 −𝟐𝟐 − 𝟑 −𝟐𝟒 (0.0003662) abserr(binomial) = 0. Using these numbers in a 8192x8192 matrix: abserr( 𝒚𝒊 𝟑 ) ≈ 𝟑 𝟑 Tensor Core Performance and Precision

  23. normal vs. binomial 5,00E-02 different matrix sizes and intervals 4,50E-02 4,00E-02 [1,-1]*[1,-1] [1,-1]*[1,0] Full Binomi [1,-1] 3 Term Binomi 3,50E-02 3,00E-02 error 2,50E-02 2,00E-02 1,50E-02 1,00E-02 5,00E-03 0,00E+00 64 128 256 512 1024 2048 4096 8192 matrix sizes Tensor Core Performance and Precision

  24. difference in digits 7,5 affected digit for different matrix sizes 7 and intervalls 6,5 A,B in [1,-1] A in [1,-1], B in [4,-4] 3T Binomi 6 error in digit 5,5 5 4,5 4 3,5 3 64 128 256 512 1024 2048 4096 8192 matrix sizes Tensor Core Performance and Precision

  25. Karatsuba Algorithm  Fast multiplication algorithm  Divide a number X into two halves, the high bits h and the low bits l with respect to a base b: X=Xh*b+Xl  Form H=Xh*Yh, L=Xl*Yl, D=(Xh+Xl)(Yh+Yl)-H-L  Products are formed in lower precision  Final Product in full precision: XY = H*b*b + D*b + L  Only 3 low precision products needed to form H, L and D (compared to 4 with binomial approach) Tensor Core Performance and Precision

  26. Karatsuba Algorithm Example: X=35, Y=34, b=10 Xh=3, Xl=5, Yh=3, Yl=4 H=3*3=9 L=5*4=20 D=(3+5)(3+4)-9-20=56-29=27 XY=9*b*b+27*b+20 = 900+270+20 = 1190 Only 2-digit multiplications required Just 3 of them (multiplication by b is a shift operation) Tensor Core Performance and Precision

Recommend


More recommend