cutensor
play

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - PowerPoint PPT Presentation

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com ACKNOWLEDGMENTS Colleagues at NVIDIA Collaborators outside of NVIDIA Albert Di Dmitry Liakh


  1. CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com

  2. ACKNOWLEDGMENTS • Colleagues at NVIDIA • Collaborators outside of NVIDIA Albert Di Dmitry Liakh (TAL-SH) • • Alex Fit-Florea Jutho Haegeman (Julia) • • Evghenii Gaburov Tim Besard (Julia) • • • Harun Bayraktar Sharan Chetlur • Timothy Costa • • Zachary Zimmerman *alphabetic order 2

  3. WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,& • 3

  4. WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,' • 4

  5. WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,',( • 5

  6. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 6

  7. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 7

  8. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 8

  9. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 9

  10. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 10

  11. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 11

  12. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 12

  13. BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 13

  14. TENSORS ARE UBIQUITOUS Potential Use Cases Deep Learning Quantum Chemistry Condensed Matter Physics PYRO LS-DALTON TAL-SH Multi-GPU • Out-of-Core • TensorLy TAL-SH: https://github.com/DmitryLyakh/TAL_SH Itensor: http://itensor.org TensorLy: http://tensorly.org Julia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU/CUDAnative.jl 14

  15. CUTENSOR A High-Performance CUDA Library for Tensor Primitives • Tensor Contractions (generalization of matrix-matrix multiplication) = ∑ ( ) * + D A B C • Element-wise operations (e.g., permutations, additions) = + + D A B C • Mixed precision support • Generic and flexible interface 15

  16. Tensor Contractions 16

  17. � TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples 𝐸 (,, = α ∑ 𝐵 (,& • ∗ 𝐶 &,, // GEMM & 17

  18. TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples • 𝐸 (,, = α 𝐵 (,& ∗ 𝐶 &,, // GEMM 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, 3 ,, 2 // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,& 2 ,( 3 ,& 3 ∗ 𝐶 & 3 ,& 2 ,, 3 ,, 2 // Multi-mode Tensor Contraction • 18

  19. TENSOR CONTRACTIONS Examples (cont.) = ∑ ( ) * + D A B C • Examples 𝐸 (,, = α 𝐵 ( ∗ 𝐶 , // outer product • 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,( 3 ∗ 𝐶 , // outer product • 𝐸 ( 2 ,, 2 ,' 2 = α 𝐵 ( 2 ,&,' 2 ∗ 𝐶 &,, 2 ,' 2 // batched GEMM • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 // single-mode batched tensor contraction • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 ,' 3 = α 𝐵 ( 2 ,&,' 3 , ' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 ,' 3 // multi-mode batched tensor contraction • 19

  20. TENSOR CONTRACTIONS Key Features = ∑ ( ) * + D A B C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Mixed-precision • No additional work-space required • Auto-tuning capability (similar to cublasGemmEx) • High performance 20

  21. TENSOR CONTRACTIONS Key Challenges • Keep the fast FPUs busy Reuse data in shared memory & registers as much as possible • Coalesced accesses to/from global memory • 21

  22. TENSOR CONTRACTIONS Key Challenges • Loading a scalar 𝛽 ✅ ✅ Loading a vector 𝐵 # • ( ✅ ) Loading a matrix 𝐵 #,% • Loading a general tensor 𝐵 #,%,& • (( ✅ )) 22

  23. TENSOR CONTRACTIONS Technical insight = * B D A [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 23

  24. TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 To SHMEM 𝒠 𝒝 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 24

  25. TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 𝒝 𝒠 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 25

  26. TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 To Global 𝒝 𝒠 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 26

  27. PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 ~8x over two-socket CPU Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 27

  28. PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 (data) & FP32 (compute) Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 28

  29. Element-wise Operations 29

  30. ELEMENT-WISE TENSOR OPERATIONS Examples = α + β + 𝛿 B D A C 𝐸 8,9,:,, = α 𝐵 :,8,9,, • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β𝐶 :,8,9,, • 𝐸 8,9,:,, = min ( α 𝐵 :,8,9,, , β 𝐶 :,8,9,, ) • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = 𝐺𝑄32( α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, ) • Enables users to fuse multiple element-wise calls. 30

  31. ELEMENT-WISE TENSOR OPERATIONS Key Features = α + β + 𝛿 B D A C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Φ are binary operators • E.g., MAX, MIN, ADD, MUL, … • Mixed-precision • High performance 31

  32. PERFORMANCE = α + β C A B Element-wise Operation ~5x over two-socket CPU HPTT (https://github.com/springer13/hptt) 32 * FP32 tensor permutation (e.g., reformatting)

  33. CUTENSOR’s API 33

  34. TENSOR CONTRACTIONS API = * + D A B C cutensorStatus_t cutensorContraction ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void *workspace, uint64_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream ); 34 Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md

Recommend


More recommend