EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng - PowerPoint PPT Presentation

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu EcoSystem Research Group, Department of Computer Science University of Toronto www.cs.toronto.edu/ecosystem The 51 st Annual IEEE/ACM International Symposium on Microarchitecture, 2018, Fukuoka, Japan B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 1 / 12

Background Sequence Learning Background: Sequence Learning Machine Translation B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 2 / 12

Background Sequence Learning Background: Sequence Learning Speech Recognition Machine Translation B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 2 / 12

Background LSTM RNN Background.Long-Short-Term-Memory (LSTM) Recurrent-Neural-Network (RNN) B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 3 / 12

Problem Statement Performance Problem Statement: (1) Performance ✖ Default has cudaLaunch overhead . ✖ CuDNN is closed-source , limits innovation . Reference: cuDNN LSTM RNN. Appleyard et al. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 4 / 12

Problem Statement Performance Problem Statement: (1) Performance ✖ Default has cudaLaunch overhead . ✖ CuDNN is closed-source , limits innovation . Reference: cuDNN LSTM RNN. Appleyard et al. Kernel Fusion B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 4 / 12

Problem Statement Memory Capacity Problem Statement: (2) Memory Capacity 100 400 Throughpt (samples/s) Throughpt (samples/s) 75 300 200 50 ResNet-50 (TF) ResNet-50 (MXNet) NMT (TF) 100 25 Sockeye (MXNet) ResNet-50 (CNTK) 0 0 4 8 16 32 64 128 4 8 16 32 64 Mini-batch size Mini-batch size Reference: TBD : DNN Training Benchmark Suite. Zhu et al. Training throughput in ResNet-50 saturates at large batch size. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 5 / 12

Problem Statement Memory Capacity Problem Statement: (2) Memory Capacity 100 400 Throughpt (samples/s) Throughpt (samples/s) 75 300 200 50 ResNet-50 (TF) ResNet-50 (MXNet) NMT (TF) 100 25 Sockeye (MXNet) ResNet-50 (CNTK) 0 0 4 8 16 32 64 128 4 8 16 32 64 Mini-batch size Mini-batch size Reference: TBD : DNN Training Benchmark Suite. Zhu et al. Training throughput in machine translation model increases almost linearly . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 5 / 12

Problem Statement Memory Capacity Problem Statement: (2) Memory Capacity 100 400 Throughpt (samples/s) Throughpt (samples/s) 75 300 200 50 ResNet-50 (TF) ResNet-50 (MXNet) NMT (TF) 100 25 Sockeye (MXNet) ResNet-50 (CNTK) 0 0 4 8 16 32 64 128 4 8 16 32 64 Mini-batch size Mini-batch size Reference: TBD : DNN Training Benchmark Suite. Zhu et al. ✖ RNN training is Memory Capacity -bounded. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 5 / 12

EcoRNN Full Vision EcoRNN Full Vision EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN . It has smaller memory footprint and supports auto-tuning . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 6 / 12

EcoRNN Full Vision EcoRNN Full Vision EcoRNN is a new open-source implementation that has performance comparable with or even better than CuDNN . It has smaller memory footprint and supports auto-tuning . All changes are transparent to the programmers. B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 6 / 12

Preliminary Results Performance Preliminary Results: (1) Performance The runtime bottleneck is FC layers . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 7 / 12

Preliminary Results Performance Preliminary Results: (1) Performance Data Layout Optmization The runtime bottleneck is Data layout optimization ⇒ FC layers . improves cache hit rate . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 7 / 12

Preliminary Results Performance Preliminary Results: (1) Performance Training Throughput Comparison on the MXNet Language Modeling Benchmark ✓ Up to 2 × faster than Default , and ✓ Up to 1 . 3 × faster than CuDNN . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 8 / 12

Preliminary Results Memory Capacity Preliminary Results: (2) Memory Capacity Memory Consumption Profile of the Machine Translation Model Machine Translation The memory bottleneck is Features Maps of Attention and RNN Layers . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 9 / 12

Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12

Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM , XLA ). B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12

Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM , XLA ). Memory Compression B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12

Future Work Future Work Weight Parameter Reuse Same observation made by Baidu Persistent RNN . ✖ Inflexibility : Difficult to port to new cell types and architectures ⇐ Machine Learning Compilers (e.g., TVM , XLA ). Memory Compression ⇐ Gist (Jain et al., ISCA’18) B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 10 / 12

Summary Summary Problem Statement ✖ Performance , ✖ Memory Capacity B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 11 / 12

Summary Summary Problem Statement ✖ Performance , ✖ Memory Capacity Key Observations Default suffers from cudaLaunch overhead ⇐ Kernel Fusion . CuDNN has low cache-utilization ⇐ Data Layout Optimization . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 11 / 12

Summary Summary Problem Statement ✖ Performance , ✖ Memory Capacity Key Observations Default suffers from cudaLaunch overhead ⇐ Kernel Fusion . CuDNN has low cache-utilization ⇐ Data Layout Optimization . Future Work Weight Parameter Reuse ⇐ Machine Learning Compilers The memory bottleneck in machine translation model is Feature Maps of Attention and RNN Layers ⇐ Gist . B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 11 / 12

Summary Backup Slide Experimental Settings CUDA Toolkit 8, cuDNN 6, MXNet Ver. 0.11.0. DeepSpeech2 Training Throughput 4 Deep Speech 2 3 Throughput (MXNet) 2 1 0 0 1 2 3 4 5 Mini-batch size B. Zheng, G. Pekhimenko (EcoSystem) EcoRNN MICRO 51 SRC 12 / 12

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng - PowerPoint PPT Presentation

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu EcoSystem Research Group, Department of Computer Science University of Toronto

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting LSTM

An Introduction to Neural Networks Long Short Term Memory (LSTM) and the Attention mechanism Ange

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks Ruben Zazo,

Lo Long-short term memory (L (LSTM) Jeong Min Lee CS3750, University of Pittsburgh Outline

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Bus Arrival Time Prediction with LSTM Neural Network A. Agafonov, A. Yumaganov Samara National

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret

Real-Time Decisions Using ML on the Google Cloud Platform Przemysaw Pastuszka & Carlos

Architecting to Support Machine Learning Humberto Cervantes, UAM Iurii Milovanov, SoftServe

Job Oriented Online Japanese Contents Benefits Contact About Training Program Batch Team

PAYE Modernisation PSDA Meeting 25 January 2018 Agenda PIT Online Payroll Administration

possiblY Big data analytics for music data conchita control management song upload provider

Object Detection in Recent 3 Years Beyond RetinaNet and Mask R-CNN Gang Yu

Affordable 3D LIDAR May14-08 Nicolas Cabeen Eric VanDenover Todd Wegter Xiang Peter Wang

The new KMT-CLS Steering Sensor Measurements at the original steering wheel of automobiles and

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng - PowerPoint PPT Presentation

EcoRNN : Efficient Computing of LSTM RNN on GPUs Bojian Zheng (Graduate Student), Gennady Pekhimenko (Advisor) bojian,pekhimenko@cs.toronto.edu EcoSystem Research Group, Department of Computer Science University of Toronto

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting LSTM

An Introduction to Neural Networks Long Short Term Memory (LSTM) and the Attention mechanism Ange

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks Ruben Zazo,

Lo Long-short term memory (L (LSTM) Jeong Min Lee CS3750, University of Pittsburgh Outline

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Bus Arrival Time Prediction with LSTM Neural Network A. Agafonov, A. Yumaganov Samara National

AMMI Introduction to Deep Learning 11.2. LSTM and GRU Fran cois Fleuret

Real-Time Decisions Using ML on the Google Cloud Platform Przemysaw Pastuszka &amp; Carlos

Architecting to Support Machine Learning Humberto Cervantes, UAM Iurii Milovanov, SoftServe

Job Oriented Online Japanese Contents Benefits Contact About Training Program Batch Team

PAYE Modernisation PSDA Meeting 25 January 2018 Agenda PIT Online Payroll Administration

possiblY Big data analytics for music data conchita control management song upload provider

Object Detection in Recent 3 Years Beyond RetinaNet and Mask R-CNN Gang Yu

Affordable 3D LIDAR May14-08 Nicolas Cabeen Eric VanDenover Todd Wegter Xiang Peter Wang

The new KMT-CLS Steering Sensor Measurements at the original steering wheel of automobiles and

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Real-Time Decisions Using ML on the Google Cloud Platform Przemysaw Pastuszka & Carlos