Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - PowerPoint PPT Presentation

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1

Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2

Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 3

What are TensorCores 4

Warp-Level Operation Warp 32 threads wmma::fill_fragment(Cmat, 0.0f); 5

Programming TensorCore __device__ void tensor_op_16_16_16 ( float *d, half *a, half *b, float *c) { 16x16x16 MatMul wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat; Create Fragments wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); Load Fragments wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); Perform MatMul wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); Store Results } 6

TensorCore Summary • TensorCores are hardware accelerators • Warp-level operation • New memory scope fragment 7

Steps for TensorCore Support in TVM Memory Scope Create Schedule Tensorization 9

Current Memory Scope 1 2 3 Me Memory S Scope pe Create Schedule Tensorization 10

Special Memory Scope 1 4 5 2 6 3 Me Memory S Scope pe Create Schedule Tensorization 11

Traditional GPU Memory Scope Order Shared Global Local Global Me Memory S Scope pe Create Sch chedule Tensorization 12

Enhanced TensorCore Memory Scope Order Fragment Shared Global Global Local Me Memory S Scope pe Create Sch chedule Tensorization 13

Warp Level Schedule blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 14

Warp Level Schedule blockDim.y …… Warp Warp blockDim.z …… …… …… …… Warp Warp blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 15

Tensorization for (i, 0, 16) { for (j, 0, 16) { for (k, 0, 16) { C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j))) } } } tvm_mma_sync(C, 0, A, 0, B, 0, C, 0); Memory Scope Create Schedule Te Tensorization 16

Performance Improvements over non-TensorCore 5.17 5.02 4.97 4.87 1 1 1 1 Large MatMul BatchConv Small MatMul BatchMatMul TVM w/o TensorCores tvm w/ TensorCores 17

Performance Comparison vs CuDNN 1.44 Comparable on traditional workloads 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 18

Performance Comparison vs CuDNN 1.44 1.4x on emerging workloads(BERT) 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 19

TVM TensorCore Support Summary • Massive speed up over non-tensorcore • Competitive performance with CuDNN • Based on tensor intrinsic 20

Future Work 1. Use TensorCore in TOPI and Relay 2. Apply TensorCore to popular ML model, such as BERT 22

Th Thank k you Siyuan Feng Dec 5, 2019 23

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - PowerPoint PPT Presentation

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2 Contents Te TensorCore 1 Introduct ction Te

8. Tensor Field Visualization Tensor: extension of concept of scalar and vector Tensor data

Effect of BDD Optim ization Effect of BDD Optim ization on Synthesis of Reversible and Quantum

(Some) Challenges in (Some) Challenges in Tensor Mining Tensor Mining Evrim Acar Sandia

Tensor Field Techniques Lecture 11 March 5, 2020 Outline Basics of tensor algebra Tensor

TENSOR ALGEBRA Continuum Mechanics Course (MMC) - ETSECCPB - UPC Introduction to Tensors Tensor

Tensor-Matrix Products with a Compressed Sparse Tensor Shaden Smith George Karypis University

Tensor Field Visualization 9-1 Ronald Peikert SciVis 2007 - Tensor Fields Tensors

Tensor Methods for Signal Processing and Machine Learning Qibin Zhao Tensor Learning Unit RIKEN

and You Tensor network methods Matrix product states (MPS) Projected Entangled Pair States

Tensor Invariants and Kronecker Coefficients Jiarui Fei University of California, Riverside

Higher order black holes of scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T

Hairy black holes in scalar tensor theories E Babichev and CC gr-qc/1312.3204 CC, T Kolyvaris, E

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist,

Design of a High-Performance GEMM-like Tensor-Tensor Multiplication Paul Springer and Paolo

PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS Andrew Kerr, Timmy Liu, Mostafa

Renormalization of Tensor Network States II. RG of Tensor Network States Tao Xiang Institute of

Introducing Racket A brief tour of history We wanted a language that allowed symbolic

Accumulating results Contents Three ways of computing results during re- cursion Collecting

Professor: Alvin Chao CS149 Array Activities int[ ] nums = {10, 3, 7, -5}; nums 10 3

Spark Stony Brook University CSE545, Spring 2019 Situations where MapReduce is not efficient

Instruction Set Architectures CS301 Prof. Szajda Instruction Categories Arithmetic w x =

Particle dynamics Particle overview Particle system Forces Constraints

s s rtr

A Quantum Route to Hamilton-Jacobi Theory: Considerations on the Quantum-to-Classical Transition