performance tensor
play

PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, - PowerPoint PPT Presentation

S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body


  1. S7255: CUTT: A HIGH- PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

  2. MOTIVATION Tensor contractions Tensor contractions are the most computationally intensive part of quantum many- body methods used in NWCHEM, DIRAC, LS-DALTON, and ACES-IV 𝐸 𝑏, 𝑐, 𝑗 += 𝑀 𝑏, 𝑑, 𝑗, 𝑙 𝑆 𝑙, 𝑐, 𝑑 Sum over repeated indices 𝑑 and 𝑙 Evaluating tensor contractions directly requires implementing a lot of hard-to-write custom code Indirect approach transposes tensors and uses efficient linear algebra libraries (such as cuBLAS) to perform matrix multiply 2

  3. TENSOR CONTRACTIONS Indirect approach Reduction over a pair of indices shared by two tensors, e.g. 𝐸 𝑏, 𝑐, 𝑗 += 𝑀 𝑏, 𝑑, 𝑗, 𝑙 𝑆 𝑙, 𝑐, 𝑑 This can be evaluated as 𝑀 𝑏, 𝑑, 𝑗, 𝑙 β†’ 𝑀 𝑏, 𝑗, 𝑙, 𝑑 # tensor transpose 𝑆 𝑙, 𝑐, 𝑑 β†’ 𝑆 𝑙, 𝑑, 𝑐 # tensor transpose 𝐸 𝑏, 𝑗, 𝑐 += 𝑀 𝑏, 𝑗, 𝑙, 𝑑 𝑆 𝑙, 𝑑, 𝑐 # matrix multiply 𝐸 𝑏, 𝑗, 𝑐 β†’ 𝐸 𝑏, 𝑐, 𝑗 # tensor transpose Able to take advantage of the high-performance matrix multiply routines provided by cuBLAS 3

  4. PREVIOUS WORK No runtime high-performance tensor transpose library exists for GPUs Previous implementation by my co-author [1] was sub-optimal on GPU platforms Work in [2] relies on compiler to build custom kernels e.g. not runtime [1] Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189, (2015), 84-91. DOI: http://dx.doi.org/10.1016/j.cpc.2014.12.013 [2] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures 4

  5. TENSOR TRANSPOSE ALGORITHMS 5

  6. MATRIX TRANSPOSE: TILED ALGORITHM __syncthreads() Step 1: Step 2: Read 32x32 tile from global Read shared memory in transposed memory order to shared memory and write to global memory 6 Mark Harris β€œAn Efficient Matrix Transpose in CUDA C/C+”, Parallel Forall Blog: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc

  7. TILED ALGORITHM 1 2 3 4 5 6 Constant shared memory usage (~32x32) Performs well when d1 and d5 are 1 5 2 3 4 6 fairly large (~32) shared memory volume looped over using TB Poor performance for small (2-8) dimensions Would it be possible to pack multiple small dimensions into 5 4 1 6 3 2 shared memory? 7

  8. PACKED ALGORITHM 1 2 3 4 5 6 No longer uses 32x32 shared memory tile Loads entire dimensions into shared memory (not tiled) 1 2 5 3 4 6 As much shared memory is allocated as it shared memory TB loop volume takes to store the elements Must choose which dimensions to pack New problem: What if e.g. d5 is very large? 5 4 1 6 3 2 8

  9. PACKED-SPLIT ALGORITHM 1 2 3 4 5 6 Split largest dimension Number of splits is determined by the shared memory size 1 2 5 3 4 6 Must choose which dimensions to pack, and shared memory TB loop volume number of splits 5 4 1 6 3 2 9

  10. MEMORY POSITION CALCULATION 10

  11. GLOBAL MEMORY POSITION CALCULATION glRead H = Number of elements in shared memory 1 2 3 4 5 6 M = Number of elements in loop volume Need to convert scalar positions s and p to s= 0, ..., H-1 p= 0, ..., M-1 global memory positions: 1 2 5 3 4 6 glRead = Global memory read shRead glWrite = Global memory write Global memory position is split into: 5 4 1 6 3 2 glRead = glMinorRead(s) + glMajorRead(p) glWrite glWrite = glMinorWrite(s) + glMajorWrite(p) 11

  12. MAJOR POSITION CALCULATION p= 0, ..., M-1 3 4 6 // int p =0,...,M-1 glMajorRead(p) // int c[n] = {1, d3, d3*d4} 1 2 3 4 5 6 // int d[n] = {d3, d4, d6} // int t[n] = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5} π‘œ π‘ž glMajorRead π‘ž = ෍ 𝑛𝑝𝑒 , 𝑒 𝑗 𝑒 𝑗 int glMajorRead = 0; 𝑑 𝑗 𝑗=1 for (int i=0;i < n;i++) { glMajorRead += ((p / c[i]) % d[i]) * t[i]; } O(n) Observation: p is constant within thread block (and therefore warp) 12

  13. WARP-PARALLEL POSITION CALCULATION p= 0, ..., M-1 // int p = 0,...,M-1 3 4 6 // int c = {1, d3, d3*d4, 1, ..., 1} glMajorRead(p) // int d = {d3, d4, d6 , 1, ..., 1} 1 2 3 4 5 6 // int t = {d1*d2, d1*d2*d3, d1*d2*d3*d4*d5,...} int glMajorRead = ((p / c) % d) * t; π‘œ π‘ž for (int i=16;i >= 1;i/=2) { ෍ 𝑛𝑝𝑒 , 𝑒 𝑗 𝑒 𝑗 𝑑 𝑗 𝑗=1 glMajorRead += __shfl_xor(glMajorRead, i); } Single divide, modulo, and multiply O(1) i.e. performance independent of tensor rank Works up to n=32 13

  14. MINOR POSITION CALCULATION For Tiled algorithm this is trivial 1 2 3 4 5 6 For Packed and Packed-Split, pre-compute positions and store into registers glMinorRead(s) Number of registers per thread: s= 0, ..., H-1 numReg = (H - 1)/blockDim.x + 1 shared memory 1 2 5 int glMinorRead[numReg] shRead(s) int shRead[numReg] int glMinorWrite[numReg] 5 4 1 6 3 2 Template kernel with numReg glMinorWrite(s) 14

  15. ALGORITHM & PARAMETER CHOICE 15

  16. CHOOSING THE BEST ALGORITHM Algorithm choice: Tiled, Packed, Packed-Split Tiled: no free parameters 1 2 3 4 5 6 5 4 1 6 3 2 Packed: input and output ranks 1 2 3 4 5 6 5 4 1 6 3 2 Packed-Split: input and output ranks, number of splits 1 2 3 4 5 6 5 4 1 6 3 2 Large performance differences between different algorithm and parameter choices 16

  17. CUTT PLANS cuttResult cuttPlanMeasure(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream, void* idata, void * odata); cuttResult cuttPlan(cuttHandle* handle, int rank, int* dim, int* permutation, size_t sizeofType, cudaStream_t stream); Measure – plans perform all possible tensor transposes and choose the best performing plan. LARGE overhead Heuristic – plans choose best plan by estimating the transpose runtime based on analytical GPU performance model. SMALL overhead Heuristic plans must be used in QM calculations Getting the heuristic planning to work accurately was a major hurdle Better approach is needed for choosing the heuristic plans (Machine Learning?) 17

  18. BENCHMARKS 18

  19. Tensor ranks 2 to 7 Ratio between largest and smallest tensor dimensions 1:1, 5:1, and 15:1 Tensor volume normally distributed with average BENCHMARK 1 200M elements and standard deviation of 20M elements 500 random permutations for each tensor rank and ratio 9000 tensor transposes in total 19

  20. TESLA K20X * * maximum bandwidth measured using GPU-STREAM: Tom Deakin, James Price, Matt J. Martineau M, and Simon N. McIntosh-Smith. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany 20

  21. TESLA M40 21

  22. TESLA P100 22

  23. Tensor ranks 8 and 12 Rank 8: (5, 3, 2, 4, 35, 33, 37, 40) 200M elements Rank 12: (2, 3, 4, 3, 2, 2, 3, 2, 20, 18, 22, 24) BENCHMARK 2 328M elements 500 random permutations for both tensor ranks Simulates realistic workload in Quantum Chemistry calculations 23

  24. TESLA K20X 24

  25. TESLA M40 25

  26. TESLA P100 26

  27. PERFORMANCE DISTRIBUTION 27

  28. Set of 57 tensor transposes from (TTC): P . Springer, J. R. Hammond, and P . Bientinesi. TTC: A high performance compiler for tensor BENCHMARK 3 transpositions . CoRR, 2016. http://arxiv.org/abs/1603.02297 Somewhat β€œeasy” benchmark due to small number of permutations 28

  29. TESLA K40M TTC average 140 GiB/s cuTT average 144 GiB/s TTC data from: Paul Springer, Aravind Sankaran, and Paolo Bientinesi. 2016. TTC: A Tensor Transposition Compiler for Multiple Architectures. https://arxiv.org/abs/1607.01249 29

  30. Real world tensor contractions performed on TAL- SH (Tensor Algebra Library for Shared Memory Computers) Dmitry I. Lyakh at Oak Ridge National Laboratory BENCHMARK 4 9306 random permutations on tensors up to rank 8 Matrix multiply performed using cuBLAS 30

  31. TESLA K20X GPU (a) 100 100 (b) 1200 90 90 3500 Percentage of max. performance Best Percentage of max. performance Best 80 80 Average 3000 1000 Average 70 70 Worst 2500 Worst GFlop/s GFlop/s 800 60 60 2000 50 50 600 40 40 1500 400 30 30 1000 20 20 200 500 10 10 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 2 vol 𝐸 vol 𝑀 vol 𝑆 π΅π‘ π‘—π‘’β„Žπ‘›π‘“π‘’π‘—π‘‘ π½π‘œπ‘’π‘“π‘œπ‘‘π‘—π‘’π‘§ = 𝐸 = 𝐸 + 𝑀 βˆ™ 𝑆 vol 𝐸 + vol 𝑀 + vol 𝑆 31

  32. TESLA M40 Single precision 32

  33. TESLA P100 100 100 (a) (b) 9000 4500 Percentage of max. performance Best 90 90 Percentage of max. performance Best 8000 4000 Average 80 80 7000 Average 3500 Worst 70 70 6000 3000 Worst GFlop/s GFlop/s 60 60 5000 2500 50 50 4000 2000 40 40 3000 1500 30 30 2000 1000 20 20 1000 500 10 10 0 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Arithmetic Intensity Arithmetic Intensity Single precision Double precision 33

  34. CONCLUSIONS & ACKNOWLEDGEMENTS 34

Recommend


More recommend