s8822 optimizing nmt with tensorrt
play

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT - PowerPoint PPT Presentation

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY


  1. S8822 – OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

  2. 2 100 倍以上速く、 本当に可能ですか? 2

  3. DOUGLAS ADAMS – BABEL FISH Neural Machine Translation Unit 3

  4. 4 OVER 100X FASTER, IS IT REALLY POSSIBLE? Over 200 years 4

  5. NVIDIA TENSORRT Programmable Inference Accelerator FRAMEWORKS GPU PLATFORMS TESLA P4 TensorRT JETSON TX2 Optimizer Runtime DRIVE PX 2 NVIDIA DLA TESLA V100 5 developer.nvidia.com/tensorrt

  6. TENSORRT LAYERS Built-in Layer Support Custom Layer API Deployed Application Convolution • TensorRT Runtime LSTM and GRU • Custom Layer • Activation: ReLU, tanh, sigmoid Pooling: max and average • Scaling • • Element wise operations LRN • Fully-connected • • SoftMax Deconvolution • CUDA Runtime 6

  7. TENSORRT OPTIMIZATIONS 40x Faster CNNs on V100 vs. CPU-Only Under 7ms Latency (ResNet50) Layer & Tensor Fusion 40 5700 6,000 35 5,000 30 Weights & Activation Latency (ms) 4,000 Images/sec 25 Precision Calibration 20 3,000 14 ms 15 2,000 10 6.83 ms 6.67 ms 1,000 5 Kernel Auto-Tuning 305 140 0 0 CPU-Only V100 + V100 + TensorRT TensorFlow Inference throughput (images/sec) on ResNet50. V100 + TensorRT : NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- 16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow : Preview of volta optimized TensorFlow (FP16), batch size 2, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512. Dynamic Tensor Memory 7

  8. • What is NMT? • What is current state? Agenda • What are the problems? • How did we solve it? • What perf is possible? 8

  9. ACRONYMS AND DEFINITIONS NMT: Neural Machine Translation OpenNMT: Open source NMT project for academia and industry Token: The minimum representation used for encoding(symbol, word, character, subword) Sequence: A number of tokens wrapped by special start and end sequence tokens. Beam Search: directed partial breadth-first tree search algorithm TopK: Partial sort resulting in N min/max elements Unk: Special token that represents unknown translations. 9

  10. OPENNMT INFERENCE Decoder Output Beam TopK Search Batch Reduction Projection Beam Scoring Beam Shuffle Encoder Attention EncoderRNN Model Decoder RNN Setup Input Input 10

  11. Output Embedding Input Embedding Projection Attention Decoder Model TopK RNN DECODER EXAMPLE Beam Search Batch Beam Beam Reduction Shuffle Scoring This The Iteration 0 <S> He What The This is The house Iteration 1+ He ran What time The cow 11

  12. TRAINING VS INFERENCE Decoder Decoder Output Output Beam Projection TopK Search Attention Model Encoder Batch Reduction Projection Beam Scoring Beam Shuffle Encoder Attention EncoderRNN Decoder EncoderRNN Model RNN Input Decoder Setup RNN Setup Input Input Input 12

  13. • What is NMT? • What is current state? Agenda • What are the problems? • How did we solve it? • What perf is possible? 13

  14. INFERENCE TIME IS BEAM SEARCH TIME • Wu, Et. Al. 2016, ‘Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation’ arXiv:1609.08144 • Sharan Narang, Jun, 2017, Baidu’s DeepBench - https://github.com/baidu-research/DeepBench • Rui Zhao, Dec, 2017, ‘Why does inference run 20x slower than training.’ - https://github.com/tensorflow/nmt/issues/204 • David Levinthal, Ph.D., Jan, 2018, ‘Evaluating RNN performance across hardware platforms.’ 14

  15. • What is NMT? • What is current state? Agenda • What are the problems? • How did we solve it? • What perf is possible? 15

  16. PERF ANALYSIS 16

  17. KERNEL ANALYSIS 17

  18. • What is NMT? • What is current state? Agenda • What are the problems? • How did we solve it? • What perf is possible? 18

  19. Encoder EncoderRNN ENCODER Setup Input 19

  20. Input Setup Hello. Hello . This is a test. This is a test . Tokenization Bye. Bye . PrefixSumPlugin Constant Gather 2 42 23 0 0 0 0 Sequence Decoder Zero Encoder 5 73 3 8 19 23 0 Length Start State Input Buffer Tokens buffer 98 23 0 0 0 0 2 20

  21. Encoder 42 23 0 0 0 0 2 Encoder Sequence 73 3 8 19 23 0 5 Lengths Input 2 98 23 0 0 0 0 Trained Encoder Hidden Hidden Embedding Plugin State State PackedRNN Trained Encoder .1 .35 0 0 0 0 Cell Cell Context State State .123 .93 1.4 1 .01 0 Vector .42 .20 0 0 0 0 21

  22. Decoder Output Beam TopK Search Batch Reduction Projection Beam Scoring Beam Shuffle DECODER Attention Model Decoder RNN Input 22

  23. Decoder, 1 st Iteration Start Sentence Token Batch0 <S> Decoder Input BatchN <S> Embedding Plugin Encoder Decode Hidden Hidden State State RNN Encoder Decode Cell Cell State State Batch0 .124 Decoder Output BatchN .912 23

  24. Decoder, 2 nd + Iteration Decoder Input Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 こ ん に ち は N さ よ う な ら Prev Embedding Plugin Next Hidden Hidden State State RNN Prev Next Cell Cell State State Decoder Output Batch0 .124 Batch Beam 0 Beam1 Beam2 Beam3 Beam4 BatchN .912 0 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 24

  25. Global Attention Model Decoder Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 .18 .32 .85 .39 .75 N .79 .27 .81 .93 .73 Sequence Length FullyConnected Weights Buffer BatchedGemm RaggedSoftmax BatchedGemm Context Vector Concat FullyConnected Weights TanH 25

  26. Projection Attention Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 [.9,…,.1] [0,…,.3] [.1,…,0] [.6,…,.8] [.3,…,.2] N [.4,…,.9] [.5,…,.2] [0,…,.7] [0,…,2] [.1,…,.9] FullyConnected Weights Softmax Log Projection Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 N 26

  27. TopK Part 1 Projection Output Batch Beam 0 Beam1 Beam2 Beam3 Beam4 0 N TopK Output Intra-beam Batch Beam 0 Beam1 Beam2 Beam3 Beam4 Index [1,3] [2,4] [9,0] [5,0] [7,6] [.9,.8] [.99,.5] [.3,.8] [.1,.93] [.85,.99] Prob Gather 27

  28. TopK Part 2 Output Gather Prob [.9,.8,.99,.55,.3,.8,.1,.93,.85,.99] TopK Output Inter-beam Indices [2,9,7,0,8] Intra-beam Output Prob [.99,.99,.93,.9,.85] Beam Mapping Plugin Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 28

  29. Beam Search – Beam Shuffle Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 Beam0 Beam1 State State+1 Beam1 Beam4 State State+1 Beam Shuffle Beam2 Beam3 Plugin State State+1 Beam3 Beam0 State State+1 Beam4 Beam4 State State+1 29

  30. Beam Search – Beam Scoring Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 Beam0 Beam0 State State+1 Beam Scoring Plugin EOS Beam1 Sentence Beam1 Detection State Probability State+1 Update Backtrack Beam2 Beam2 State State Sequence State+1 Storage Length Increment End of Beam3 Beam3 Beam/Batch State State+1 Heuristic Beam4 Beam4 State State+1 Batch Finished [0001100011…010] Bitmap 30

  31. Beam Search – Batch Reduction Batch Finished [0001100011…010] Bitmap Transfer 32bit to Host Reduce Operation(Sum) as new batch size. TopK Gather Encoder/State Reduction Beam Beam Encoder Encoder Plugin State State Output Output 31

  32. Output Output Beam1,Idx2 Beam4,idx6 Beam3,Idx0 Beam0,Idx1 Beam4,Idx7 No Decoder Input All Done? Yes On Host: こんにちは。 Beam Device To Host Beam State これはテストです。 Output State さようなら。 32

  33. TENSORRT ANALYSIS 33

  34. TENSORRT KERNEL ANALYSIS 34

  35. • What is NMT? • What is current state? Agenda • What are the problems? • How did we solve it? • What perf is possible? 35

  36. RESULTS 140x Faster Language Translation RNNs on V100 vs. CPU-Only Inference (OpenNMT) 600 500 550 450 500 400 350 Sentences/sec Latency (ms) 400 300 280 ms 300 250 200 200 153 ms 150 117 ms 100 100 50 25 4 0 0 CPU-Only + Torch V100 + Torch V100 + TensorRT Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT : NVIDIA TensorRT (FP32), batch size 64, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. V100 + Torch : Torch (FP32), batch size 4, Tesla V100-PCIE-16GB, E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 v4@2.60GHz 3.5GHz Turbo (Broadwell) HT On 36

  37. SUMMARY Show that topK no longer dominates sequence inference time. • Show that RNN Inference is compute bound, not memory bound. • PRODUCT PAGE TensorRT accelerates sequence inferencing. • developer.nvidia.com/tensorrt • Over two orders of magnitude higher throughput over CPU. Latency reduction by more than half over CPU. • 37

  38. LEARN MORE PRODUCT PAGE DOCUMENTATION TRAINING nvidia.com/dli developer.nvidia.com/tensorrt docs.nvidia.com/deeplearning/sdk 38

  39. Q&A 39

Recommend


More recommend