Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1
Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2
Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 3
What are TensorCores 4
Warp-Level Operation Warp 32 threads wmma::fill_fragment(Cmat, 0.0f); 5
Programming TensorCore __device__ void tensor_op_16_16_16 ( float *d, half *a, half *b, float *c) { 16x16x16 MatMul wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat; Create Fragments wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); Load Fragments wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); Perform MatMul wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); Store Results } 6
TensorCore Summary • TensorCores are hardware accelerators • Warp-level operation • New memory scope fragment 7
Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 8
Steps for TensorCore Support in TVM Memory Scope Create Schedule Tensorization 9
Current Memory Scope 1 2 3 Me Memory S Scope pe Create Schedule Tensorization 10
Special Memory Scope 1 4 5 2 6 3 Me Memory S Scope pe Create Schedule Tensorization 11
Traditional GPU Memory Scope Order Shared Global Local Global Me Memory S Scope pe Create Sch chedule Tensorization 12
Enhanced TensorCore Memory Scope Order Fragment Shared Global Global Local Me Memory S Scope pe Create Sch chedule Tensorization 13
Warp Level Schedule blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 14
Warp Level Schedule blockDim.y …… Warp Warp blockDim.z …… …… …… …… Warp Warp blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 15
Tensorization for (i, 0, 16) { for (j, 0, 16) { for (k, 0, 16) { C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j))) } } } tvm_mma_sync(C, 0, A, 0, B, 0, C, 0); Memory Scope Create Schedule Te Tensorization 16
Performance Improvements over non-TensorCore 5.17 5.02 4.97 4.87 1 1 1 1 Large MatMul BatchConv Small MatMul BatchMatMul TVM w/o TensorCores tvm w/ TensorCores 17
Performance Comparison vs CuDNN 1.44 Comparable on traditional workloads 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 18
Performance Comparison vs CuDNN 1.44 1.4x on emerging workloads(BERT) 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 19
TVM TensorCore Support Summary • Massive speed up over non-tensorcore • Competitive performance with CuDNN • Based on tensor intrinsic 20
Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 21
Future Work 1. Use TensorCore in TOPI and Relay 2. Apply TensorCore to popular ML model, such as BERT 22
Th Thank k you Siyuan Feng Dec 5, 2019 23
Recommend
More recommend