S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA
MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV πΈ π, π, π += π π, π, π, π π π, π, π Sum over repeated indices π and π Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cuBLAS) to perform matrix multiply 2
TENSOR CONTRACTIONS Indirect approach Reduction over a pair of indices shared by two tensors, e.g. πΈ π, π, π += π π, π, π, π π π, π, π This can be evaluated as π π, π, π, π β π π, π, π, π # tensor transpose π π, π, π β π π, π, π # tensor transpose πΈ π, π, π += π π, π, π, π π π, π, π # matrix multiply πΈ π, π, π β πΈ π, π, π # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS 3
PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime [1] Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013 [2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures 4
TENSOR TRANSPOSE ALGORITHMS 5
MATRIX TRANSPOSE: TILED ALGORITHM __syncthreads() Step 1: Step 2: Read 32x32 tile from global Read shared memory in transposed memory order to shared memory and write to global memory 6 Mark Harris βAn Efficient Matrix Transpose in CUDA C/C+β, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc
TILED ALGORITHM 1 2 3 4 5 6 Constant shared memory usage (~32x32) Performs well when d1 and d5 are 1 5 2 3 4 6 fairly large (~32) shared memory volume looped over using TB Poor performance for small (2-8) dimensions Would it be possible to pack multiple small dimensions into 5 4 1 6 3 2 shared memory? 7
PACKED ALGORITHM 1 2 3 4 5 6 No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) 1 2 5 3 4 6 As much shared memory is allocated as it shared memory TB loop volume takes to store the elements Must choose which dimensions to pack New problem: What if e.g. d5 is very large? 5 4 1 6 3 2 8
PACKED-SPLIT ALGORITHM 1 2 3 4 5 6 Split largest dimension Number of splits is determined by the shared memory size 1 2 5 3 4 6 Must choose which dimensions to pack, and shared memory TB loop volume number of splits 5 4 1 6 3 2 9
MEMORY POSITION CALCULATION 10
GLOBAL MEMORY POSITION CALCULATION glRead H = Number of elements in shared memory 1 2 3 4 5 6 M = Number of elements in loop volume Need to convert scalar positions s and p to s= 0, ..., H-1 p= 0, ..., M-1 global memory positions: 1 2 5 3 4 6 glRead = Global memory read shRead glWrite = Global memory write Global memory position is split into: 5 4 1 6 3 2 glRead = glMinorRead(s) + glMajorRead(p) glWrite glWrite = glMinorWrite(s) + glMajorWrite(p) 11
MAJOR POSITION CALCULATION p= 0, ..., M-1 3 4 6 // int p =0,...,M-1 glMajorRead(p) // int c[n] = {1, d3, d3*d4} 1 2 3 4 5 6 // int d[n] = {d3, d4, d6} // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} π π glMajorRead π = ΰ· πππ , π π π’ π int glMajorRead = 0; π π π=1 for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; } O(n) Observation: p is constant within thread block (and therefore warp) 12
WARP-PARALLEL POSITION CALCULATION p= 0, ..., M-1 // int p = 0,...,M-1 3 4 6 // int c = {1, d3, d3*d4, 1, ..., 1} glMajorRead(p) // int d = {d3, d4, d6 , 1, ..., 1} 1 2 3 4 5 6 // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; π π for (int i=16;i >= 1;i/=2) { ΰ· πππ , π π π’ π π π π=1 glMajorRead += __shfl_xor(glMajorRead, i); } Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32 13
MINOR POSITION CALCULATION For Tiled algorithm this is trivial 1 2 3 4 5 6 For Packed and Packed-Split, pre-compute positions and store into registers glMinorRead(s) Number of registers per thread: s= 0, ..., H-1 numReg = (H - 1)/blockDim.x + 1 shared memory 1 2 5 int glMinorRead[numReg] shRead(s) int shRead[numReg] int glMinorWrite[numReg] 5 4 1 6 3 2 Template kernel with numReg glMinorWrite(s) 14
ALGORITHM & PARAMETER CHOICE 15
CHOOSING THE BEST ALGORITHM Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters 1 2 3 4 5 6 5 4 1 6 3 2 Packed: input and output ranks 1 2 3 4 5 6 5 4 1 6 3 2 Packed-Split: input and output ranks, number of splits 1 2 3 4 5 6 5 4 1 6 3 2 Large performance differences between different algorithm and parameter choices 16
CUTT PLANS cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata); cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream); Measure β plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic β plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?) 17
BENCHMARKS 18
Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average BENCHMARK 1 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total 19
TESLA K20X * * maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany 20
TESLA M40 21
TESLA P100 22
Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) BENCHMARK 2 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations 23
TESLA K20X 24
TESLA M40 25
TESLA P100 26
PERFORMANCE DISTRIBUTION 27
Set of 57 tensor transposes from (TTC): P . Springer, J. R. Hammond, and P . Bientinesi. TTC: A high performance compiler for tensor BENCHMARK 3 transpositions . CoRR, 2016. http://arxiv.org/abs/1603.02297 Somewhat βeasyβ benchmark due to small number of permutations 28
TESLA K40M TTC average 140 GiB/s cuTT average 144 GiB/s TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249 29
Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory BENCHMARK 4 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS 30
TESLA K20X GPU (a) 100 100 (b) 1200 90 90 3500 Percentage of max. performance Best Percentage of max. performance Best 80 80 Average 3000 1000 Average 70 70 Worst 2500 Worst GFlop/s GFlop/s 800 60 60 2000 50 50 600 40 40 1500 400 30 30 1000 20 20 200 500 10 10 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 2 vol πΈ vol π vol π π΅π ππ’βπππ’ππ π½ππ’πππ‘ππ’π§ = πΈ = πΈ + π β π vol πΈ + vol π + vol π 31
TESLA M40 Single precision 32
TESLA P100 100 100 (a) (b) 9000 4500 Percentage of max. performance Best 90 90 Percentage of max. performance Best 8000 4000 Average 80 80 7000 Average 3500 Worst 70 70 6000 3000 Worst GFlop/s GFlop/s 60 60 5000 2500 50 50 4000 2000 40 40 3000 1500 30 30 2000 1000 20 20 1000 500 10 10 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 33
CONCLUSIONS & ACKNOWLEDGEMENTS 34
Recommend
More recommend