Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on - PowerPoint PPT Presentation

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State University, 2 Pacific Northwest National Laboratory 1

Contents • Introduction • Overview of Mapping Strategy • Optimized Execution of Tensor Contractions • Fusion for Symmetrized Tensor Contractions • Experimental Results • Conclusion 2

Introduction 3

Tensor Contraction • Tensor • A Multidimensional Array • For example, a vector is 1D Tensor and a Matrix is 2D Tensor • Tensor Contractions • Higher Dimensional Analogs of Matrix Multiplications • High Order Models in Quantum Chemistry, Deep Learning, Finite Element Methods, etc 4

Tensor Contraction • Example: ! ", $, % = '( ", ), % × '+ ), $ 5

CCSD(T) • Coupled Cluster Singles and Doubles with perturbative Triples correction • One of the most accurate methods applicable to reasonably large molecules • A widely used application in computational chemistry suites • A set of symmetrized tensor contractions • Bandwidth-bound tensor contractions • The computational bottleneck in the CCSD(T) coupled-cluster method 6

CCSD(T) • Symmetrized Tensor Contractions in CCSD(T) 7

Background of GPU Memory Types • Global Memory (Grid-Level) • The slowest form of I/O on GPU • Shared Memory (Thread Block-Level) • Very fast memory located in the SM • Registers (Thread-Level) • The fastest memory CUDA Memory Model 8 (source: NVIDIA documentation)

Global Memory Coalescing • To Maximize Global Memory Bandwidth • All threads in a warp (32 continuous threads) executes the same instruction (SIMD) • For a load/store instruction, need to minimize the number of transactions • For a tensor contraction, • Load input tensors • Store the output tensor 9

Challenges • Mapping of the High-Dimensional Iteration Space to Threads • Choice of Data Buffering in Shared-Memory and Registers • Choice of Tile Sizes for Multi-Level Tiling • Fusing the Symmetrized Tensor Contractions in CCSD(T) 10

Our Implementations • An efficient GPU implementation of the tensor contractions in CCSD(T) • Shared-memory buffering • Register tiling • Loop fusion • Register transposition 11

Overview of Mapping Strategy 12

Overview of Mapping Strategy • The Overall Strategy • Each thread: a set of elements of the output in its registers • Shared-memory to buffer slices of input tensors • Each Thread Block: a Hyper-Rectangular Slice of the Output • Partitioning the total work of a tensor contraction among thread blocks • Mapping an iteration space to threads • Choice of Tile-sizes 13

Overview of Mapping Strategy • Two-Level Tiling • 2D Thread Block: !" # × !" % • 2D Register Tiles: &'( # × &'( % ( Register Tiling ) • Mappings • External Indices → 2D Thread Block and 2D Register Tiles • Internal Indices → All • Tile-Sizes • The size of each index handled by a thread block • E.g., ! * , ! , … 14

Overview of Mapping Strategy • Example: ! ", $, %, & = ( ", $, ) × + ), %, & , → ./ 0 , 1 → ./ 2 , 3 → 456 0 , 7 → 456 2 , 8 → ∗ • • . : = 16, . = = 4, . ? = 16, . @ = 4 { b } { a } { d } REG X = 4 TB X = 16 { c } REG Y = 4 Register Tile TB Y = 16 Thread Thread Block 15

Optimized Execution of Tensor Contractions 16

Optimized Execution of Tensor Contractions GMEM • Example: ! ", $ = & ", ' × )[', $] N j B • A thread block compute a slice of the output C (1) • Two slices of the input A and B ⌈ N k /T k ⌉ • Loading portions from the slices of A and B (1) N k T j T k SMEM (1) ⌈ N k /T k ⌉ T j N i N i T i T i A C T k N j N k SMEM 17 GMEM GMEM

Optimized Execution of Tensor Contractions GMEM • Example: ! ", $ = & ", ' × )[', $] N j B • A row-vector & A column-vector (2) (1) • An outer-product contribution (3) ⌈ N k /T k ⌉ • Threads store the output tensor slice to GMEM (4) N k T j T k SMEM (1) (3) Outer-Product ⌈ N k /T k ⌉ T j (2) N i N i T i T j (2) T i ⊗ A T i C T k N j N k (4) SMEM 18 GMEM GMEM REG

Fusion for Symmetrized Tensor Contractions 19

Fusion for Symmetrized Tensor Contractions • Symmetrized Tensor Contractions • The Accumulated Output Tensor • The identical left-hand side (LHS) • Possible to fuse tensor contractions with different parts of input tensors 20

Fusion for Symmetrized Tensor Contractions • For Example: the 9 sd2 functions in CCSD(T) • Without storing the results from registers to global memory after finishing each tensor contraction 21

Fusion for Symmetrized Tensor Contractions • Two Issues of Fusion (1/2) 1. The Size of Shared Memory • Depends on chosen Tile-Sizes • Problem : Different amounts of shared memory among tensor contractions • Issue: Lower Occupancy • Constraint : An identical amount of shared memory 22

Fusion for Symmetrized Tensor Contractions • Two Issues of Fusion (2/2) 2. Arithmetic Intensity • Register Tiles (REG X * REG Y ) • # of Loaded Elements: REG X + REG Y • # of Result-Elements: REG X * REG Y • Problem : Indices mapped on Register Tiles might come from one of inputs • Issue: Low Arithmetic Intensity • Constraint : Indices mapped to register tile should come from different input tensors 23

Fusion for Symmetrized Tensor Contractions • Example (1/2) • Tile-Sizes: T k = T j = T i = T c = T b = T a = 4 and T d ( T l ) = 16 • Mapping: !, # → %& ' , (, ) → %& * , + → ,-. ' , / → ,-. * , 0 1 → ∗ 24

Fusion for Symmetrized Tensor Contractions • Example (2/2) • Two different kernels can fuse all of them • Mapping #1 : !, # → %& ' , (, ) → %& * , + → ,-. ' , / → ,-. * , 0 1 → ∗ • Mapping #2 : !, # → %& ' , +, ) → %& * , ( → ,-. ' , / → ,-. * , 0 1 → ∗ • Partially-Fused Kernel Version 25

Fusion for Symmetrized Tensor Contractions • Register Transposition • Within a thread block, a hyper-rectangular slice of the output can be transposed via shared memory • Example: 4D Output Tensor--- ! ", $, %, & • Let a mapping be ' → )* + , , → )* - , . → /01 + , 2 → /01 - • Let tile-sizes be ) 3 = ) 5 = ) 6 = ) 7 = 2 k, j 0 0 1 1 j i, c 0 1 0 1 k 0 0 0 1 2 3 0 1 4 5 6 7 1 0 8 9 10 11 1 1 12 13 14 15 c i Registers 26

Fusion for Symmetrized Tensor Contractions • Example: 4D Output Tensor--- ! ", $, %, & TB X TB Y REG X REG Y #1 k i j c #2 k j i c k, i 0 1 2 3 k, j 0 0 1 1 i 0 0 1 1 j j , c 4 5 6 7 0 1 0 1 k i , c 0 1 0 1 k 8 9 10 11 0 1 4 5 0 0 12 13 14 15 0 0 0 1 2 3 2 3 6 7 0 1 0 1 4 5 6 7 8 9 12 13 Shared Memory 1 0 1 0 8 9 10 11 10 11 14 15 1 1 1 1 12 13 14 15 c j c i Registers Registers 27

Fusion for Symmetrized Tensor Contractions • Register Transposition • Finally , a kernel fuses all 9 sd1 functions or all sd2 functions. • Fully-Fused Kernel Version 28

Experimental Results 29

Experimental Results (1/3) • Experimental Setup • Pascal P100 and Volta V100 GPUs (16GB, PCI-Express) • CUDA 9.0 and GCC 6.2 • Two Fused Variants • Fully-Fused and Partially-Fused Kernel Versions. • The performance of the two fused variants with the NWChem kernels, TAL-SH and OpenACC implementations. Problem Sizes Parameters used in Fully-Fused and Partially-Fused Kernels 30

Experimental Results (2/3) • On P100 (Pascal) • NWChem kernels: Max. 500 GFLOPS • TAL-SH and openACC: less than 300 GFLOPS • Fully-Fused and Partially-Fused Kernel versions: Max. 2.8 and 2.1 TFLOPS, respectively 3000 3000 2500 2500 2000 2000 GFLOPS GFLOPS 1500 1500 1000 1000 500 500 0 0 Size-A Size-B Size-C Size-D Size-E Size-A Size-B Size-C Size-D Size-E Fully-Fused Partially-Fused NWChem TAL-SH openACC Fully-Fused Partially-Fused NWChem TAL-SH openACC 31 sd1 on P100 (Pascal) sd2 on P100 (Pascal)

Experimental Results (3/3) • On V100 (Volta) • The NWChem kernels: 818–1004 GFLOPS • TAL-SH and openACC kernels: Max. 400 GFLOPS • Two Fused Variants: 2.5 and 4.5 TFLOPS, with a peak of 4.5 TFLOPS. 5000 5000 4500 4500 4000 4000 3500 3500 3000 3000 GFLOPS 2500 2500 GFLOPS 2000 2000 1500 1500 1000 1000 500 500 0 0 Size-A Size-B Size-C Size-D Size-E Size-A Size-B Size-C Size-D Size-E Fully-Fused Partially-Fused NWChem TAL-SH openACC Fully-Fused Partially-Fused NWChem TAL-SH openACC sd1 on V100 (Volta) sd2 on V100 (Volta) 32

Conclusion 33

Conclusion • A novel strategy for executing tensor contractions in CCSD(T) on GPUs. • Kernel-level optimizations • Fusion across the symmetrization kernels • A novel register-level transpose operation • Experimental evaluation • Significant performance improvements as compared to existing alternatives • Over 60% of peak floating point performance on both Pascal P100 and Volta V100 GPUs. 34

Thank you 35

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on - PowerPoint PPT Presentation

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

BMI Report for Sampled CCSD Students: 2010 2013 BMI Report for Sampled CCSD Students: 2010 2013

AGENDA School Start Date Parent Survey Results Instructional Models &

AGENDA Recommendations (released June 22)-- Information CCSD Safe Restart Plan Components

PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

AGENDA Task Force Members: Introductions Essential information Kick-Off Medical

Leveraging modern supercomputing infrastructure for tensor contractions in large

Fiber Types, Actions, and Contractions 26a A&P: Muscular System - Fiber Types, Actions, and

Inner Functions of Numerical Contractions Hwa-Long Gau Department of Mathematics, National

Unitary Dilation of Freely Independent Contractions Scott Atkinson (University of Virginia)

Hardware Design with VHDL Finite State Machines ECE 443 Finite State Machines FSMs are

THE COURNOT MODEL Overview Context: Youre in an industry where capacity constraints are

Using ELP Standards Level Descriptors to Interpret Student Work Guiding Questions What are the

Selfishness Level of Strategic Games Krzysztof R. Apt CWI, Amsterdam, the Netherlands ,

Conceptual Introduction Department of Veterinary and Animal Sciences Department of Veterinary

User-level scheduling OS does all the scheduling work Simple as that! Don Porter

Three-Level Scheduling CPU CPU scheduler Scheduling Arriving jobs How to choose which of the

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on - PowerPoint PPT Presentation

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs Jinsung Kim 1 , Aravind Sukumaran Rajam 1 , Changwan Hong 1 , Ajay Panyala 2 , Rohit Kumar Srivastava 1 , Sriram Krishnamoorthy 2 , P. Sadayappan 1 1 The Ohio State

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

BMI Report for Sampled CCSD Students: 2010 2013 BMI Report for Sampled CCSD Students: 2010 2013

AGENDA School Start Date Parent Survey Results Instructional Models &amp;

AGENDA Recommendations (released June 22)-- Information CCSD Safe Restart Plan Components

PERFORMANCE TENSOR TRANSPOSE LIBRARY FOR GPUS Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

AGENDA Task Force Members: Introductions Essential information Kick-Off Medical

Leveraging modern supercomputing infrastructure for tensor contractions in large

Fiber Types, Actions, and Contractions 26a A&amp;P: Muscular System - Fiber Types, Actions, and

Inner Functions of Numerical Contractions Hwa-Long Gau Department of Mathematics, National

Unitary Dilation of Freely Independent Contractions Scott Atkinson (University of Virginia)

Hardware Design with VHDL Finite State Machines ECE 443 Finite State Machines FSMs are

THE COURNOT MODEL Overview Context: Youre in an industry where capacity constraints are

Using ELP Standards Level Descriptors to Interpret Student Work Guiding Questions What are the

Selfishness Level of Strategic Games Krzysztof R. Apt CWI, Amsterdam, the Netherlands ,

Conceptual Introduction Department of Veterinary and Animal Sciences Department of Veterinary

User-level scheduling OS does all the scheduling work Simple as that! Don Porter

Three-Level Scheduling CPU CPU scheduler Scheduling Arriving jobs How to choose which of the

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Three

AGENDA School Start Date Parent Survey Results Instructional Models &

Fiber Types, Actions, and Contractions 26a A&P: Muscular System - Fiber Types, Actions, and