CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - PowerPoint PPT Presentation

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com

ACKNOWLEDGMENTS • Colleagues at NVIDIA • Collaborators outside of NVIDIA Albert Di Dmitry Liakh (TAL-SH) • • Alex Fit-Florea Jutho Haegeman (Julia) • • Evghenii Gaburov Tim Besard (Julia) • • • Harun Bayraktar Sharan Chetlur • Timothy Costa • • Zachary Zimmerman *alphabetic order 2

WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,& • 3

WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,' • 4

WHAT IS A TENSOR? • mode-0: scalar 𝛽 mode-1: vector 𝐵 # • mode-2: matrix 𝐵 #,% • mode-n: general tensor 𝐵 #,%,&,',( • 5

BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 6

BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 7

BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 8

BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 9

BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ 10

BASIC LINEAR SUBPROGRAMS A Success Story • 1969 – BLAS Level 1: Vector-Vector + = 𝛽 1972 – BLAS Level 2: Matrix-Vector • = ∗ 1980 – BLAS Level 3: Matrix-Matrix • = ∗ Now? – BLAS Level 4: Tensor-Tensor • = ∗ 11

TENSORS ARE UBIQUITOUS Potential Use Cases Deep Learning Quantum Chemistry Condensed Matter Physics PYRO LS-DALTON TAL-SH Multi-GPU • Out-of-Core • TensorLy TAL-SH: https://github.com/DmitryLyakh/TAL_SH Itensor: http://itensor.org TensorLy: http://tensorly.org Julia: https://github.com/Jutho/TensorOperations.jl & https://github.com/JuliaGPU/CUDAnative.jl 14

CUTENSOR A High-Performance CUDA Library for Tensor Primitives • Tensor Contractions (generalization of matrix-matrix multiplication) = ∑ ( ) * + D A B C • Element-wise operations (e.g., permutations, additions) = + + D A B C • Mixed precision support • Generic and flexible interface 15

Tensor Contractions 16

� TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples 𝐸 (,, = α ∑ 𝐵 (,& • ∗ 𝐶 &,, // GEMM & 17

TENSOR CONTRACTIONS Examples = ∑ ( ) * + D A B C • Einstein notation (einsum) • Modes that appear in A and B are contracted • Examples • 𝐸 (,, = α 𝐵 (,& ∗ 𝐶 &,, // GEMM 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,( 3 ∗ 𝐶 &,, 3 ,, 2 // Tensor Contraction • 𝐸 ( 2 ,, 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,& 2 ,( 3 ,& 3 ∗ 𝐶 & 3 ,& 2 ,, 3 ,, 2 // Multi-mode Tensor Contraction • 18

TENSOR CONTRACTIONS Examples (cont.) = ∑ ( ) * + D A B C • Examples 𝐸 (,, = α 𝐵 ( ∗ 𝐶 , // outer product • 𝐸 ( 2 ,,,( 3 = α 𝐵 ( 2 ,( 3 ∗ 𝐶 , // outer product • 𝐸 ( 2 ,, 2 ,' 2 = α 𝐵 ( 2 ,&,' 2 ∗ 𝐶 &,, 2 ,' 2 // batched GEMM • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 = α 𝐵 ( 2 ,&,' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 // single-mode batched tensor contraction • 𝐸 ( 2 ,, 2 ,' 2 ,, 3 ,( 3 ,' 3 = α 𝐵 ( 2 ,&,' 3 , ' 2 ,( 3 ∗ 𝐶 &,, 3 ,, 2 ,' 2 ,' 3 // multi-mode batched tensor contraction • 19

TENSOR CONTRACTIONS Key Features = ∑ ( ) * + D A B C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Mixed-precision • No additional work-space required • Auto-tuning capability (similar to cublasGemmEx) • High performance 20

TENSOR CONTRACTIONS Key Challenges • Keep the fast FPUs busy Reuse data in shared memory & registers as much as possible • Coalesced accesses to/from global memory • 21

TENSOR CONTRACTIONS Key Challenges • Loading a scalar 𝛽 ✅ ✅ Loading a vector 𝐵 # • ( ✅ ) Loading a matrix 𝐵 #,% • Loading a general tensor 𝐵 #,%,& • (( ✅ )) 22

TENSOR CONTRACTIONS Technical insight = * B D A [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 23

TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 To SHMEM 𝒠 𝒝 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 24

TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 𝒝 𝒠 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 25

TENSOR CONTRACTIONS Technical insight = GEMM-like ( , ) 𝒝 ℬ 𝒠 To Global 𝒝 𝒠 ℬ = * D A B [1] Paul Springer and Paolo Bientinesi: “Design of a high-performance GEMM-like Tensor-Tensor Multiplication” (2016) 26

PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 ~8x over two-socket CPU Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 27

PERFORMANCE = * C A B Tensor Contractions Random tensor contractions: • 3D to 6D tensors • FP64 (data) & FP32 (compute) Arithmetic Intensity TBILS (https://github.com/devinamatthews/tblis) 28

Element-wise Operations 29

ELEMENT-WISE TENSOR OPERATIONS Examples = α + β + 𝛿 B D A C 𝐸 8,9,:,, = α 𝐵 :,8,9,, • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β𝐶 :,8,9,, • 𝐸 8,9,:,, = min ( α 𝐵 :,8,9,, , β 𝐶 :,8,9,, ) • 𝐸 8,9,:,, = α 𝐵 :,8,9,, + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, • 𝐸 8,9,:,, = 𝐺𝑄32( α 𝑆𝐹𝑀𝑉(𝐵 :,8,9,, ) + β 𝐶 8,9,:,, + 𝛿 𝐷 8,9,:,, ) • Enables users to fuse multiple element-wise calls. 30

ELEMENT-WISE TENSOR OPERATIONS Key Features = α + β + 𝛿 B D A C • Ψ are unary operators • E.g., Identity, RELU, CONJ, … • Φ are binary operators • E.g., MAX, MIN, ADD, MUL, … • Mixed-precision • High performance 31

PERFORMANCE = α + β C A B Element-wise Operation ~5x over two-socket CPU HPTT (https://github.com/springer13/hptt) 32 * FP32 tensor permutation (e.g., reformatting)

CUTENSOR’s API 33

TENSOR CONTRACTIONS API = * + D A B C cutensorStatus_t cutensorContraction ( cuTensorHandle_t handle, const void *alpha, const void *A, const cutensorTensorDescriptor_t descA, const int modeA[], const void *B, const cutensorTensorDescriptor_t descB, const int modeB[], const void *beta, const void *C, const cutensorTensorDescriptor_t descC, const int modeC[], void *D, const cutensorTensorDescriptor_t descD, const int modeD[], cutensorOperator_t opOut, cudaDataType_t typeCompute, cutensorAlgo_t algo, void *workspace, uint64_t workspaceSize, // Workspace is optional and may be null cudaStream_t stream ); 34 Devin Matthews et al. “Tensor interfaces”: https://github.com/MolSSI/tensor-interfaces/blob/master/interface.md

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - PowerPoint PPT Presentation

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com ACKNOWLEDGMENTS Colleagues at NVIDIA Collaborators outside of NVIDIA Albert Di Dmitry Liakh

+ ; = ; where the

Vector Extensions for Decision Support DBMS Acceleration Timothy Hayes, Oscar Palomar, Osman

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

CMAT Final Presentation Language Guru: Michael Berkowitz (meb2235) Project Manager: Frank Cabada

Vectors Presented by: Ana Chang-Gonzalez, Alyssa Michalke, Kirsten Schroeder, and Connie Xavier

Honolulu Authority for Rapid Transportation STAFF SUMMARY TITLE: STAFF CONTACT:

SDIO: a new peripheral attack vector Research Project 2 Thom Does Dana Geist thom.does@os3.nl

Vector diagrams Vectors should be drawn tip-to-tail Put arrows on all vectors *Resultant arrow

Greedy Algorithm Fails in Compact Vector Summation G. Chelidze, S. Chobanyan, G. Giorgobiani, V.

A JEALOUS A JEALOUS CRYPTANALYST CRYPTANALYST In search of a short vector A story by Leo

TMVCA Student Presentation Competition Student presentation competition is held in conjunction

ACCELERATE TO EQUAL: INCREASING THE INVOLVEMENT OF WOMEN AND COMMUNITY IN VECTOR CONTROL

Matrices, Vector Spaces, and Information Retrieval Steve Richards and Azuree Lovely

Massively-Parallel Vector Graphics Francisco Ganacim Rodolfo S. Lima Luiz Henrique de

product presentation _1 MENTOR 6 Extend your Range High-Tech for the EN B Class For many

THE VECTOR CONTROL PROGRAM & MOSQUITO PREVENTION Allison Bray, M.S. Environmental Health

Introduction to Differential Evolution Rajib Kumar Bhattacharjya Department of Civil Engineering

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Improving snow nowcasts for airports Elena Saltikoff, Finnish Meteorological Institute (FMI)

Ho How Ji Jivana is is us using ing onc ncoly lytic ic vi viruses to tre reat canc ancer

Simple, Seedy Derivations of Generating Functions for Simple Polytopes and cd -indices Jiyang Gao,

GloVe: Global Vectors for Word Representation Fengyang Zhang, Yutong Wang Presentation Overview

Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel

Vectors Slide 2 / 36 Scalar versus Vector A scalar has only a physical quantity such as mass,

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, - PowerPoint PPT Presentation

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019 pspringer@nvidia.com and chenhany@nvidia.com ACKNOWLEDGMENTS Colleagues at NVIDIA Collaborators outside of NVIDIA Albert Di Dmitry Liakh

+ ; = ; where the

Vector Extensions for Decision Support DBMS Acceleration Timothy Hayes, Oscar Palomar, Osman

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

CMAT Final Presentation Language Guru: Michael Berkowitz (meb2235) Project Manager: Frank Cabada

Vectors Presented by: Ana Chang-Gonzalez, Alyssa Michalke, Kirsten Schroeder, and Connie Xavier

Honolulu Authority for Rapid Transportation STAFF SUMMARY TITLE: STAFF CONTACT:

SDIO: a new peripheral attack vector Research Project 2 Thom Does Dana Geist thom.does@os3.nl

Vector diagrams *Vectors should be drawn tip-to-tail *Put arrows on all vectors *Resultant arrow

Greedy Algorithm Fails in Compact Vector Summation G. Chelidze, S. Chobanyan, G. Giorgobiani, V.

A JEALOUS A JEALOUS CRYPTANALYST CRYPTANALYST In search of a short vector A story by Leo

TMVCA Student Presentation Competition Student presentation competition is held in conjunction

ACCELERATE TO EQUAL: INCREASING THE INVOLVEMENT OF WOMEN AND COMMUNITY IN VECTOR CONTROL

Matrices, Vector Spaces, and Information Retrieval Steve Richards and Azuree Lovely

Massively-Parallel Vector Graphics Francisco Ganacim Rodolfo S. Lima Luiz Henrique de

product presentation _1 MENTOR 6 Extend your Range High-Tech for the EN B Class For many

THE VECTOR CONTROL PROGRAM &amp; MOSQUITO PREVENTION Allison Bray, M.S. Environmental Health

Introduction to Differential Evolution Rajib Kumar Bhattacharjya Department of Civil Engineering

High-Dimensional Nearest Neighbor Search High-Dimensional Nearest Neighbor Search Who?

Improving snow nowcasts for airports Elena Saltikoff, Finnish Meteorological Institute (FMI)

Ho How Ji Jivana is is us using ing onc ncoly lytic ic vi viruses to tre reat canc ancer

Simple, Seedy Derivations of Generating Functions for Simple Polytopes and cd -indices Jiyang Gao,

GloVe: Global Vectors for Word Representation Fengyang Zhang, Yutong Wang Presentation Overview

Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel

Vectors Slide 2 / 36 Scalar versus Vector A scalar has only a physical quantity such as mass,

Vector diagrams Vectors should be drawn tip-to-tail Put arrows on all vectors *Resultant arrow

THE VECTOR CONTROL PROGRAM & MOSQUITO PREVENTION Allison Bray, M.S. Environmental Health