CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com
ACKNOWLEDGMENTS • Colleagues at NVIDIA • Collaborators outside of NVIDIA Albert Di Dmitry Liakh (TAL-SH) • • Alex Fit-Florea Jutho Haegeman (Julia) • • Evghenii Gaburov Tim Besard (Julia) • • • Harun Bayraktar Sharan Chetlur • Timothy Costa • • Zachary Zimmerman *alphabetic order 2
WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,& • 3
WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,' • 4
WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,',( • 5
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 6
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 7
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 8
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 9
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 10
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 11
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 12
BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 13
TENSORS ARE UBIQUITOUS Potential Use Cases Deep Learning Quantum Chemistry Condensed Matter Physics PYRO LS-DALTON TAL-SH Multi-GPU • Out-of-Core • TensorLy TAL-SH: https://github.com/DmitryLyakh/TAL_SH Itensor: http://itensor.org TensorLy: http://tensorly.org Julia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU/CUDAnative.jl 14
CUTENSOR A High-Performance CUDA Library for Tensor Primitives • Tensor Contractions (generalization of matrix-matrix multiplication) = ∑ ( ) * + D A B C • Element-wise operations (e.g., permutations, additions) = + + D A B C • Mixed precision support • Generic and flexible interface 15
Tensor Contractions 16
� TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples 𝐸 (,, = α ∑ 𝐵 (,& • ∗ 𝐶 &,, // GEMM & 17
TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples • 𝐸 (,, = α 𝐵 (,& ∗ 𝐶 &,, // GEMM 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, 3 ,, 2 // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,& 2 ,( 3 ,& 3 ∗ 𝐶 & 3 ,& 2 ,, 3 ,, 2 // Multi-mode Tensor Contraction • 18
TENSOR CONTRACTIONS Examples (cont.) = ∑ ( ) * + D A B C • Examples 𝐸 (,, = α 𝐵 ( ∗ 𝐶 , // outer product • 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,( 3 ∗ 𝐶 , // outer product • 𝐸 ( 2 ,, 2 ,' 2 = α 𝐵 ( 2 ,&,' 2 ∗ 𝐶 &,, 2 ,' 2 // batched GEMM • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 // single-mode batched tensor contraction • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 ,' 3 = α 𝐵 ( 2 ,&,' 3 , ' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 ,' 3 // multi-mode batched tensor contraction • 19
TENSOR CONTRACTIONS Key Features = ∑ ( ) * + D A B C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Mixed-precision • No additional work-space required • Auto-tuning capability (similar to cublasGemmEx) • High performance 20
TENSOR CONTRACTIONS Key Challenges • Keep the fast FPUs busy Reuse data in shared memory & registers as much as possible • Coalesced accesses to/from global memory • 21
TENSOR CONTRACTIONS Key Challenges • Loading a scalar 𝛽 ✅ ✅ Loading a vector 𝐵 # • ( ✅ ) Loading a matrix 𝐵 #,% • Loading a general tensor 𝐵 #,%,& • (( ✅ )) 22
TENSOR CONTRACTIONS Technical insight = * B D A [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 23
TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) ℬ To SHMEM ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 24
TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) ℬ ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 25
TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) ℬ To Global ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 26
PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 ~8x over two-socket CPU Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 27
PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 (data) & FP32 (compute) Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 28
Element-wise Operations 29
ELEMENT-WISE TENSOR OPERATIONS Examples = α + β + 𝛿 B D A C 𝐸 8,9,:,, = α 𝐵 :,8,9,, • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β𝐶 :,8,9,, • 𝐸 8,9,:,, = min ( α 𝐵 :,8,9,, , β 𝐶 :,8,9,, ) • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = 𝐺𝑄32( α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, ) • Enables users to fuse multiple element-wise calls. 30
ELEMENT-WISE TENSOR OPERATIONS Key Features = α + β + 𝛿 B D A C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Φ are binary operators • E.g., MAX, MIN, ADD, MUL, … • Mixed-precision • High performance 31
PERFORMANCE = α + β C A B Element-wise Operation ~5x over two-socket CPU HPTT (https://github.com/springer13/hptt) 32 * FP32 tensor permutation (e.g., reformatting)
CUTENSOR’s API 33
TENSOR CONTRACTIONS API = * + D A B C cutensorStatus_t cutensorContraction ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void *workspace, uint64_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream ); 34 Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md
Recommend
More recommend