EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu EcoSystem Research Group, Department of Computer Science University of Toronto www.cs.toronto.edu/ecosystem The 51 st Annual IEEE/ACM International Symposium on Microarchitecture, 2018, Fukuoka, Japan B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 1 / 12
Background Sequence Learning Background: Sequence Learning Machine Translation B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 2 / 12
Background Sequence Learning Background: Sequence Learning Speech Recognition Machine Translation B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 2 / 12
Background Sequence Learning Background: Sequence Learning Speech Recognition Machine Translation B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 2 / 12
Background LSTM RNN Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN) B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 3 / 12
Background LSTM RNN Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN) B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 3 / 12
Problem Statement Performance Problem Statement: (1) Performance ✖ Default has cudaLaunch overhead . ✖ CuDNN is closed-source , limits innovation . Reference: cuDNN LSTM RNN. Appleyard et al. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 4 / 12
Problem Statement Performance Problem Statement: (1) Performance ✖ Default has cudaLaunch overhead . ✖ CuDNN is closed-source , limits innovation . Reference: cuDNN LSTM RNN. Appleyard et al. Kernel Fusion B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 4 / 12
Problem Statement Memory Capacity Problem Statement: (2) Memory Capacity 100 400 Throughpt (samples/s) Throughpt (samples/s) 75 300 200 50 ResNet-50 (TF) ResNet-50 (MXNet) NMT (TF) 100 25 Sockeye (MXNet) ResNet-50 (CNTK) 0 0 4 8 16 32 64 128 4 8 16 32 64 Mini-batch size Mini-batch size Reference: TBD : DNN Training Benchmark Suite. Zhu et al. Training throughput in ResNet-50 saturates at large batch size. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 5 / 12
Problem Statement Memory Capacity Problem Statement: (2) Memory Capacity 100 400 Throughpt (samples/s) Throughpt (samples/s) 75 300 200 50 ResNet-50 (TF) ResNet-50 (MXNet) NMT (TF) 100 25 Sockeye (MXNet) ResNet-50 (CNTK) 0 0 4 8 16 32 64 128 4 8 16 32 64 Mini-batch size Mini-batch size Reference: TBD : DNN Training Benchmark Suite. Zhu et al. Training throughput in machine translation model increases almost linearly . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 5 / 12
Problem Statement Memory Capacity Problem Statement: (2) Memory Capacity 100 400 Throughpt (samples/s) Throughpt (samples/s) 75 300 200 50 ResNet-50 (TF) ResNet-50 (MXNet) NMT (TF) 100 25 Sockeye (MXNet) ResNet-50 (CNTK) 0 0 4 8 16 32 64 128 4 8 16 32 64 Mini-batch size Mini-batch size Reference: TBD : DNN Training Benchmark Suite. Zhu et al. ✖ RNN training is Memory Capacity -bounded. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 5 / 12
EcoRNN Full Vision EcoRNN Full Vision EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN . It has smaller memory footprint and supports auto-tuning . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 6 / 12
EcoRNN Full Vision EcoRNN Full Vision EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN . It has smaller memory footprint and supports auto-tuning . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 6 / 12
EcoRNN Full Vision EcoRNN Full Vision EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN . It has smaller memory footprint and supports auto-tuning . All changes are transparent to the programmers. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 6 / 12
Preliminary Results Performance Preliminary Results: (1) Performance The runtime bottleneck is FC layers . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 7 / 12
Preliminary Results Performance Preliminary Results: (1) Performance Data Layout Optmization The runtime bottleneck is Data layout optimization ⇒ FC layers . improves cache hit rate . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 7 / 12
Preliminary Results Performance Preliminary Results: (1) Performance Training Throughput Comparison on the MXNet Language Modeling Benchmark ✓ Up to 2 × faster than Default , and ✓ Up to 1 . 3 × faster than CuDNN . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 8 / 12
Preliminary Results Memory Capacity Preliminary Results: (2) Memory Capacity Memory Consumption Profile of the Machine Translation Model Machine Translation The memory bottleneck is Features Maps of Attention and RNN Layers . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 9 / 12
Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12
Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM , XLA ). B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12
Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM , XLA ). Memory Compression B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12
Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM , XLA ). Memory Compression ⇐ Gist (Jain et al., ISCA’18) B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12
Summary Summary Problem Statement ✖ Performance , ✖ Memory Capacity B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 11 / 12
Summary Summary Problem Statement ✖ Performance , ✖ Memory Capacity Key Observations Default suffers from cudaLaunch overhead ⇐ Kernel Fusion . CuDNN has low cache-utilization ⇐ Data Layout Optimization . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 11 / 12
Summary Summary Problem Statement ✖ Performance , ✖ Memory Capacity Key Observations Default suffers from cudaLaunch overhead ⇐ Kernel Fusion . CuDNN has low cache-utilization ⇐ Data Layout Optimization . Future Work Weight Parameter Reuse ⇐ Machine Learning Compilers The memory bottleneck in machine translation model is Feature Maps of Attention and RNN Layers ⇐ Gist . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 11 / 12
Summary Backup Slide Experimental Settings CUDA Toolkit 8, cuDNN 6, MXNet Ver. 0.11.0. DeepSpeech2 Training Throughput 4 Deep Speech 2 3 Throughput (MXNet) 2 1 0 0 1 2 3 4 5 Mini-batch size B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 12 / 12
Recommend
More recommend