PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - PowerPoint PPT Presentation

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV 𝐸 𝑏, 𝑐, 𝑗 += 𝑀 𝑏, 𝑑, 𝑗, 𝑙 𝑆 𝑙, 𝑐, 𝑑 Sum over repeated indices 𝑑 and 𝑙 Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cuBLAS) to perform matrix multiply 2

TENSOR CONTRACTIONS Indirect approach Reduction over a pair of indices shared by two tensors, e.g. 𝐸 𝑏, 𝑐, 𝑗 += 𝑀 𝑏, 𝑑, 𝑗, 𝑙 𝑆 𝑙, 𝑐, 𝑑 This can be evaluated as 𝑀 𝑏, 𝑑, 𝑗, 𝑙 → 𝑀 𝑏, 𝑗, 𝑙, 𝑑 # tensor transpose 𝑆 𝑙, 𝑐, 𝑑 → 𝑆 𝑙, 𝑑, 𝑐 # tensor transpose 𝐸 𝑏, 𝑗, 𝑐 += 𝑀 𝑏, 𝑗, 𝑙, 𝑑 𝑆 𝑙, 𝑑, 𝑐 # matrix multiply 𝐸 𝑏, 𝑗, 𝑐 → 𝐸 𝑏, 𝑐, 𝑗 # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS 3

PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime [1] Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013 [2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures 4

TENSOR TRANSPOSE ALGORITHMS 5

MATRIX TRANSPOSE: TILED ALGORITHM __syncthreads() Step 1: Step 2: Read 32x32 tile from global Read shared memory in transposed memory order to shared memory and write to global memory 6 Mark Harris “An Efficient Matrix Transpose in CUDA C/C+”, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc

TILED ALGORITHM 1 2 3 4 5 6 Constant shared memory usage (~32x32) Performs well when d1 and d5 are 1 5 2 3 4 6 fairly large (~32) shared memory volume looped over using TB Poor performance for small (2-8) dimensions Would it be possible to pack multiple small dimensions into 5 4 1 6 3 2 shared memory? 7

PACKED ALGORITHM 1 2 3 4 5 6 No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) 1 2 5 3 4 6 As much shared memory is allocated as it shared memory TB loop volume takes to store the elements Must choose which dimensions to pack New problem: What if e.g. d5 is very large? 5 4 1 6 3 2 8

PACKED-SPLIT ALGORITHM 1 2 3 4 5 6 Split largest dimension Number of splits is determined by the shared memory size 1 2 5 3 4 6 Must choose which dimensions to pack, and shared memory TB loop volume number of splits 5 4 1 6 3 2 9

MEMORY POSITION CALCULATION 10

GLOBAL MEMORY POSITION CALCULATION glRead H = Number of elements in shared memory 1 2 3 4 5 6 M = Number of elements in loop volume Need to convert scalar positions s and p to s= 0, ..., H-1 p= 0, ..., M-1 global memory positions: 1 2 5 3 4 6 glRead = Global memory read shRead glWrite = Global memory write Global memory position is split into: 5 4 1 6 3 2 glRead = glMinorRead(s) + glMajorRead(p) glWrite glWrite = glMinorWrite(s) + glMajorWrite(p) 11

MAJOR POSITION CALCULATION p= 0, ..., M-1 3 4 6 // int p =0,...,M-1 glMajorRead(p) // int c[n] = {1, d3, d3*d4} 1 2 3 4 5 6 // int d[n] = {d3, d4, d6} // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} 𝑜 𝑞 glMajorRead 𝑞 = ෍ 𝑛𝑝𝑒 , 𝑒 𝑗 𝑢 𝑗 int glMajorRead = 0; 𝑑 𝑗 𝑗=1 for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; } O(n) Observation: p is constant within thread block (and therefore warp) 12

WARP-PARALLEL POSITION CALCULATION p= 0, ..., M-1 // int p = 0,...,M-1 3 4 6 // int c = {1, d3, d3*d4, 1, ..., 1} glMajorRead(p) // int d = {d3, d4, d6 , 1, ..., 1} 1 2 3 4 5 6 // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; 𝑜 𝑞 for (int i=16;i >= 1;i/=2) { ෍ 𝑛𝑝𝑒 , 𝑒 𝑗 𝑢 𝑗 𝑑 𝑗 𝑗=1 glMajorRead += __shfl_xor(glMajorRead, i); } Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32 13

MINOR POSITION CALCULATION For Tiled algorithm this is trivial 1 2 3 4 5 6 For Packed and Packed-Split, pre-compute positions and store into registers glMinorRead(s) Number of registers per thread: s= 0, ..., H-1 numReg = (H - 1)/blockDim.x + 1 shared memory 1 2 5 int glMinorRead[numReg] shRead(s) int shRead[numReg] int glMinorWrite[numReg] 5 4 1 6 3 2 Template kernel with numReg glMinorWrite(s) 14

ALGORITHM & PARAMETER CHOICE 15

CHOOSING THE BEST ALGORITHM Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters 1 2 3 4 5 6 5 4 1 6 3 2 Packed: input and output ranks 1 2 3 4 5 6 5 4 1 6 3 2 Packed-Split: input and output ranks, number of splits 1 2 3 4 5 6 5 4 1 6 3 2 Large performance differences between different algorithm and parameter choices 16

CUTT PLANS cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata); cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream); Measure – plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic – plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?) 17

BENCHMARKS 18

Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average BENCHMARK 1 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total 19

TESLA K20X * * maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany 20

TESLA M40 21

TESLA P100 22

Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) BENCHMARK 2 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations 23

TESLA K20X 24

TESLA M40 25

TESLA P100 26

PERFORMANCE DISTRIBUTION 27

Set of 57 tensor transposes from (TTC): P . Springer, J. R. Hammond, and P . Bientinesi. TTC: A high performance compiler for tensor BENCHMARK 3 transpositions . CoRR, 2016. http://arxiv.org/abs/1603.02297 Somewhat “easy” benchmark due to small number of permutations 28

TESLA K40M TTC average 140 GiB/s cuTT average 144 GiB/s TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249 29

Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory BENCHMARK 4 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS 30

TESLA K20X GPU (a) 100 100 (b) 1200 90 90 3500 Percentage of max. performance Best Percentage of max. performance Best 80 80 Average 3000 1000 Average 70 70 Worst 2500 Worst GFlop/s GFlop/s 800 60 60 2000 50 50 600 40 40 1500 400 30 30 1000 20 20 200 500 10 10 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 2 vol 𝐸 vol 𝑀 vol 𝑆 𝐵𝑠𝑗𝑢ℎ𝑛𝑓𝑢𝑗𝑑 𝐽𝑜𝑢𝑓𝑜𝑡𝑗𝑢𝑧 = 𝐸 = 𝐸 + 𝑀 ∙ 𝑆 vol 𝐸 + vol 𝑀 + vol 𝑆 31

TESLA M40 Single precision 32

TESLA P100 100 100 (a) (b) 9000 4500 Percentage of max. performance Best 90 90 Percentage of max. performance Best 8000 4000 Average 80 80 7000 Average 3500 Worst 70 70 6000 3000 Worst GFlop/s GFlop/s 60 60 5000 2500 50 50 4000 2000 40 40 3000 1500 30 30 2000 1000 20 20 1000 500 10 10 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 33

CONCLUSIONS & ACKNOWLEDGEMENTS 34

PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - PowerPoint PPT Presentation

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Lax Gray tensor product for 2-quasi-categories Yuki Maehara Macquarie University CT 2019 Yuki

e.m. Field tensor & covariant equation of motion Define the tensor of dimension 2 4

Higher order black holes of scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T

MENTAL HEALTH WARRANTS: EMERGENCY APPREHENSION AND DETENTION MENTAL HEALTH CRISIS STATISTICS

Subregional Forum April 17, 2018 Click to edit Master title style Review of Previous TIP Process

2020-2023 TIP Jefferson County Subregional Forum April 9, 2018 Click to edit Master title style

WELCOME COMMUNITY ADVISORY GROUP MEETING #1 Joliet Junior College October 10, 2017

Nested Lists and Dictionaries [Andersen, Gries, Lee, Marschner, Van Loan, White] Announcements

The Effluent Tax in Germany Manuel Lago, Ecologic Institute Jennifer Mller-Gulland, Ecologic

case 12/06/2019 Structure 1. The technical pillar of the 4th Railway Package 2. The transposition

Implementation of the Victims Rights Directive our Summer Course on EU Criminal Justice -

PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - PowerPoint PPT Presentation

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Lax Gray tensor product for 2-quasi-categories Yuki Maehara Macquarie University CT 2019 Yuki

e.m. Field tensor &amp; covariant equation of motion Define the tensor of dimension 2 4

Higher order black holes of scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T

MENTAL HEALTH WARRANTS: EMERGENCY APPREHENSION AND DETENTION MENTAL HEALTH CRISIS STATISTICS

Subregional Forum April 17, 2018 Click to edit Master title style Review of Previous TIP Process

2020-2023 TIP Jefferson County Subregional Forum April 9, 2018 Click to edit Master title style

WELCOME COMMUNITY ADVISORY GROUP MEETING #1 Joliet Junior College October 10, 2017

Nested Lists and Dictionaries [Andersen, Gries, Lee, Marschner, Van Loan, White] Announcements

The Effluent Tax in Germany Manuel Lago, Ecologic Institute Jennifer Mller-Gulland, Ecologic

case 12/06/2019 Structure 1. The technical pillar of the 4th Railway Package 2. The transposition

Implementation of the Victims Rights Directive our Summer Course on EU Criminal Justice -

e.m. Field tensor & covariant equation of motion Define the tensor of dimension 2 4