te tensorcore an and tensor soriz ization ion
play

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec - PowerPoint PPT Presentation

Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1 Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2 Contents Te TensorCore 1 Introduct ction Te


  1. Te TensorCore an and Tensor soriz ization ion Siyuan Feng Dec 5, 2019 1

  2. Contents Te TensorCore 1 Introduct ction TensorCore Te 2 Su Support in TVM 3 Fu Futu ture Work k 2

  3. Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 3

  4. What are TensorCores 4

  5. Warp-Level Operation Warp 32 threads wmma::fill_fragment(Cmat, 0.0f); 5

  6. Programming TensorCore __device__ void tensor_op_16_16_16 ( float *d, half *a, half *b, float *c) { 16x16x16 MatMul wmma::fragment<matrix_a> Amat; wmma::fragment<matrix_b> Bmat; wmma::fragment <accumulator> Cmat; Create Fragments wmma::load_matrix_sync(Amat, a, 16); wmma::load_matrix_sync(Bmat, b, 16); Load Fragments wmma::fill_fragment(Cmat, 0.0f); wmma::mma_sync(Cmat, Amat, Bmat, Cmat); Perform MatMul wmma::store_matrix_sync(d, Cmat, 16, wmma::row_major); Store Results } 6

  7. TensorCore Summary • TensorCores are hardware accelerators • Warp-level operation • New memory scope fragment 7

  8. Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 8

  9. Steps for TensorCore Support in TVM Memory Scope Create Schedule Tensorization 9

  10. Current Memory Scope 1 2 3 Me Memory S Scope pe Create Schedule Tensorization 10

  11. Special Memory Scope 1 4 5 2 6 3 Me Memory S Scope pe Create Schedule Tensorization 11

  12. Traditional GPU Memory Scope Order Shared Global Local Global Me Memory S Scope pe Create Sch chedule Tensorization 12

  13. Enhanced TensorCore Memory Scope Order Fragment Shared Global Global Local Me Memory S Scope pe Create Sch chedule Tensorization 13

  14. Warp Level Schedule blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 14

  15. Warp Level Schedule blockDim.y …… Warp Warp blockDim.z …… …… …… …… Warp Warp blockDim.x = warp_size= 32 Memory Scope Create Sch chedule Tensorization 15

  16. Tensorization for (i, 0, 16) { for (j, 0, 16) { for (k, 0, 16) { C[i*16 + j]= (C[i*16 + j] + (float32(A[i*16 + k])*float32(B[k*16 + j))) } } } tvm_mma_sync(C, 0, A, 0, B, 0, C, 0); Memory Scope Create Schedule Te Tensorization 16

  17. Performance Improvements over non-TensorCore 5.17 5.02 4.97 4.87 1 1 1 1 Large MatMul BatchConv Small MatMul BatchMatMul TVM w/o TensorCores tvm w/ TensorCores 17

  18. Performance Comparison vs CuDNN 1.44 Comparable on traditional workloads 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 18

  19. Performance Comparison vs CuDNN 1.44 1.4x on emerging workloads(BERT) 1.16 1 1 1 1 0.83 0.76 Large MatMul BatchConv Small MatMul BatchMatMul CuDNN w/ TensorCores tvm w/ TensorCores 19

  20. TVM TensorCore Support Summary • Massive speed up over non-tensorcore • Competitive performance with CuDNN • Based on tensor intrinsic 20

  21. Contents Te TensorCore 1 Introduct ction Te TensorCore 2 Su Support in TV TVM 3 Fu Futu ture Work k 21

  22. Future Work 1. Use TensorCore in TOPI and Relay 2. Apply TensorCore to popular ML model, such as BERT 22

  23. Th Thank k you Siyuan Feng Dec 5, 2019 23

Recommend


More recommend