Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State University, 2 Pacific Northwest National Laboratory 1
Contents • Introduction • Overview of Mapping Strategy • Optimized Execution of Tensor Contractions • Fusion for Symmetrized Tensor Contractions • Experimental Results • Conclusion 2
Introduction 3
Tensor Contraction • Tensor • A Multidimensional Array • For example, a vector is 1D Tensor and a Matrix is 2D Tensor • Tensor Contractions • Higher Dimensional Analogs of Matrix Multiplications • High Order Models in Quantum Chemistry, Deep Learning, Finite Element Methods, etc 4
Tensor Contraction • Example: ! ", $, % = '( ", ), % × '+ ), $ 5
CCSD(T) • Coupled Cluster Singles and Doubles with perturbative Triples correction • One of the most accurate methods applicable to reasonably large molecules • A widely used application in computational chemistry suites • A set of symmetrized tensor contractions • Bandwidth-bound tensor contractions • The computational bottleneck in the CCSD(T) coupled-cluster method 6
CCSD(T) • Symmetrized Tensor Contractions in CCSD(T) 7
Background of GPU Memory Types • Global Memory (Grid-Level) • The slowest form of I/O on GPU • Shared Memory (Thread Block-Level) • Very fast memory located in the SM • Registers (Thread-Level) • The fastest memory CUDA Memory Model 8 (source: NVIDIA documentation)
Global Memory Coalescing • To Maximize Global Memory Bandwidth • All threads in a warp (32 continuous threads) executes the same instruction (SIMD) • For a load/store instruction, need to minimize the number of transactions • For a tensor contraction, • Load input tensors • Store the output tensor 9
Challenges • Mapping of the High-Dimensional Iteration Space to Threads • Choice of Data Buffering in Shared-Memory and Registers • Choice of Tile Sizes for Multi-Level Tiling • Fusing the Symmetrized Tensor Contractions in CCSD(T) 10
Our Implementations • An efficient GPU implementation of the tensor contractions in CCSD(T) • Shared-memory buffering • Register tiling • Loop fusion • Register transposition 11
Overview of Mapping Strategy 12
Overview of Mapping Strategy • The Overall Strategy • Each thread: a set of elements of the output in its registers • Shared-memory to buffer slices of input tensors • Each Thread Block: a Hyper-Rectangular Slice of the Output • Partitioning the total work of a tensor contraction among thread blocks • Mapping an iteration space to threads • Choice of Tile-sizes 13
Overview of Mapping Strategy • Two-Level Tiling • 2D Thread Block: !" # × !" % • 2D Register Tiles: &'( # × &'( % ( Register Tiling ) • Mappings • External Indices → 2D Thread Block and 2D Register Tiles • Internal Indices → All • Tile-Sizes • The size of each index handled by a thread block • E.g., ! * , ! , … 14
Overview of Mapping Strategy • Example: ! ", $, %, & = ( ", $, ) × + ), %, & , → ./ 0 , 1 → ./ 2 , 3 → 456 0 , 7 → 456 2 , 8 → ∗ • • . : = 16, . = = 4, . ? = 16, . @ = 4 { b } { a } { d } REG X = 4 TB X = 16 { c } REG Y = 4 Register Tile TB Y = 16 Thread Thread Block 15
Optimized Execution of Tensor Contractions 16
Optimized Execution of Tensor Contractions GMEM • Example: ! ", $ = & ", ' × )[', $] N j B • A thread block compute a slice of the output C (1) • Two slices of the input A and B ⌈ N k /T k ⌉ • Loading portions from the slices of A and B (1) N k T j T k SMEM (1) ⌈ N k /T k ⌉ T j N i N i T i T i A C T k N j N k SMEM 17 GMEM GMEM
Optimized Execution of Tensor Contractions GMEM • Example: ! ", $ = & ", ' × )[', $] N j B • A row-vector & A column-vector (2) (1) • An outer-product contribution (3) ⌈ N k /T k ⌉ • Threads store the output tensor slice to GMEM (4) N k T j T k SMEM (1) (3) Outer-Product ⌈ N k /T k ⌉ T j (2) N i N i T i T j (2) T i ⊗ A T i C T k N j N k (4) SMEM 18 GMEM GMEM REG
Fusion for Symmetrized Tensor Contractions 19
Fusion for Symmetrized Tensor Contractions • Symmetrized Tensor Contractions • The Accumulated Output Tensor • The identical left-hand side (LHS) • Possible to fuse tensor contractions with different parts of input tensors 20
Fusion for Symmetrized Tensor Contractions • For Example: the 9 sd2 functions in CCSD(T) • Without storing the results from registers to global memory after finishing each tensor contraction 21
Fusion for Symmetrized Tensor Contractions • Two Issues of Fusion (1/2) 1. The Size of Shared Memory • Depends on chosen Tile-Sizes • Problem : Different amounts of shared memory among tensor contractions • Issue: Lower Occupancy • Constraint : An identical amount of shared memory 22
Fusion for Symmetrized Tensor Contractions • Two Issues of Fusion (2/2) 2. Arithmetic Intensity • Register Tiles (REG X * REG Y ) • # of Loaded Elements: REG X + REG Y • # of Result-Elements: REG X * REG Y • Problem : Indices mapped on Register Tiles might come from one of inputs • Issue: Low Arithmetic Intensity • Constraint : Indices mapped to register tile should come from different input tensors 23
Fusion for Symmetrized Tensor Contractions • Example (1/2) • Tile-Sizes: T k = T j = T i = T c = T b = T a = 4 and T d ( T l ) = 16 • Mapping: !, # → %& ' , (, ) → %& * , + → ,-. ' , / → ,-. * , 0 1 → ∗ 24
Fusion for Symmetrized Tensor Contractions • Example (2/2) • Two different kernels can fuse all of them • Mapping #1 : !, # → %& ' , (, ) → %& * , + → ,-. ' , / → ,-. * , 0 1 → ∗ • Mapping #2 : !, # → %& ' , +, ) → %& * , ( → ,-. ' , / → ,-. * , 0 1 → ∗ • Partially-Fused Kernel Version 25
Fusion for Symmetrized Tensor Contractions • Register Transposition • Within a thread block, a hyper-rectangular slice of the output can be transposed via shared memory • Example: 4D Output Tensor--- ! ", $, %, & • Let a mapping be ' → )* + , , → )* - , . → /01 + , 2 → /01 - • Let tile-sizes be ) 3 = ) 5 = ) 6 = ) 7 = 2 k, j 0 0 1 1 j i, c 0 1 0 1 k 0 0 0 1 2 3 0 1 4 5 6 7 1 0 8 9 10 11 1 1 12 13 14 15 c i Registers 26
Fusion for Symmetrized Tensor Contractions • Example: 4D Output Tensor--- ! ", $, %, & TB X TB Y REG X REG Y #1 k i j c #2 k j i c k, i 0 1 2 3 k, j 0 0 1 1 i 0 0 1 1 j j , c 4 5 6 7 0 1 0 1 k i , c 0 1 0 1 k 8 9 10 11 0 1 4 5 0 0 12 13 14 15 0 0 0 1 2 3 2 3 6 7 0 1 0 1 4 5 6 7 8 9 12 13 Shared Memory 1 0 1 0 8 9 10 11 10 11 14 15 1 1 1 1 12 13 14 15 c j c i Registers Registers 27
Fusion for Symmetrized Tensor Contractions • Register Transposition • Finally , a kernel fuses all 9 sd1 functions or all sd2 functions. • Fully-Fused Kernel Version 28
Experimental Results 29
Experimental Results (1/3) • Experimental Setup • Pascal P100 and Volta V100 GPUs (16GB, PCI-Express) • CUDA 9.0 and GCC 6.2 • Two Fused Variants • Fully-Fused and Partially-Fused Kernel Versions. • The performance of the two fused variants with the NWChem kernels, TAL-SH and OpenACC implementations. Problem Sizes Parameters used in Fully-Fused and Partially-Fused Kernels 30
Experimental Results (2/3) • On P100 (Pascal) • NWChem kernels: Max. 500 GFLOPS • TAL-SH and openACC: less than 300 GFLOPS • Fully-Fused and Partially-Fused Kernel versions: Max. 2.8 and 2.1 TFLOPS, respectively 3000 3000 2500 2500 2000 2000 GFLOPS GFLOPS 1500 1500 1000 1000 500 500 0 0 Size-A Size-B Size-C Size-D Size-E Size-A Size-B Size-C Size-D Size-E Fully-Fused Partially-Fused NWChem TAL-SH openACC Fully-Fused Partially-Fused NWChem TAL-SH openACC 31 sd1 on P100 (Pascal) sd2 on P100 (Pascal)
Experimental Results (3/3) • On V100 (Volta) • The NWChem kernels: 818–1004 GFLOPS • TAL-SH and openACC kernels: Max. 400 GFLOPS • Two Fused Variants: 2.5 and 4.5 TFLOPS, with a peak of 4.5 TFLOPS. 5000 5000 4500 4500 4000 4000 3500 3500 3000 3000 GFLOPS 2500 2500 GFLOPS 2000 2000 1500 1500 1000 1000 500 500 0 0 Size-A Size-B Size-C Size-D Size-E Size-A Size-B Size-C Size-D Size-E Fully-Fused Partially-Fused NWChem TAL-SH openACC Fully-Fused Partially-Fused NWChem TAL-SH openACC sd1 on V100 (Volta) sd2 on V100 (Volta) 32
Conclusion 33
Conclusion • A novel strategy for executing tensor contractions in CCSD(T) on GPUs. • Kernel-level optimizations • Fusion across the symmetrization kernels • A novel register-level transpose operation • Experimental evaluation • Significant performance improvements as compared to existing alternatives • Over 60% of peak floating point performance on both Pascal P100 and Volta V100 GPUs. 34
Thank you 35
Recommend
More recommend