Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, - PowerPoint PPT Presentation

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim†, Hyoungwook Nam†, Wookeun Jung, and Jaejin Lee Department of Computer Science and Engineering Seoul National University, Korea http://aces.snu.ac.kr †The two authors contributed equally to this work as the first authors 1

Convolutional Neural Network Deep Learning Framework GPU Library 2

Motivation  Convolutional Neural Networks (CNN) have been successful in machine learning tasks such as visual recognition  Previous studies reveal performance differences among deep learning frameworks  However, those studies do not identify reasons for the differences 3

Caffe CNTK TensorFlow Theano Torch 0 100 200 300 400 500 600 Time (ms) 4

Goals  Analyze differences in the performance characteristics of the five deep learning frameworks in a single GPU context  Analyze scalability of the frameworks in the multiple GPU context  Analyze performance characteristics of different convolution algorithms for each layer 5

Outline  Convolutional Neural Network  Deep Learning Frameworks  Framework Comparison  Multi-GPU Comparison  Layer-wise Analysis of Convolution Algorithms  Conclusions 6

Convolutional Neural Network … … softmax conv 1 conv 2 conv n Inputs Outputs fc n fc 1 fc 2 Convolutional Fully-connected Feature Extractor Classifier 7

Computational Complexity of Convolution Conv2 layer C = 96 (input channel) [H,W] = [13, 13] (input dimension) [R,S] = [5, 5] (kernel dimension) K = 256 (output channel) N = 256 (batch size)  𝐷 × 𝐼𝑋 × 𝑆𝑇 × 𝐿 × 𝑂 × 2 (𝑛𝑣𝑚𝑢𝑗𝑞𝑚𝑧 𝑏𝑜𝑒 𝑏𝑒𝑒)  Ex) 96 × 27 × 27 × 5 × 5 × 256 × 256 × 2 = 229 𝐻𝑝𝑞𝑡 8

Convolution Algorithms for GPU  Direct Convolution • Straightforward, but hard to optimize  GEMM Convolution • Converts convolutions into matrix multiplications • Easier to optimize  FFT Convolution • Reduced computational complexity • 𝑃(𝐿𝑂) (Direct convolution)  𝑃(𝑂𝑚𝑝𝑕𝑂) (FFT convolution)  Winograd Convolution • Reduces the complexity of convolution like Strassen’s algorithm • Specific filtering algorithm is required for each kernel dimension 9

AlexNet Model  Winner of ILSVRC 2012 (ImageNet Challenge)  Commonly used CNN model for benchmarking  Includes various kinds of layers • 3x3 convolution, 5x5 convolution, fully connected layers, etc. 10

Training a CNN Gradient Data Output Loss Layer Layer Layer Layer Input Gradient Weight Weight Data Gradient Gradient Forward Backward Backward Update Data Gradient Parameters  1 forward computation and 2 backward computations  Forward and backward computations are symmetric and have the same computational cost 11

Five Deep Learning Frameworks Framework User Interface Data Parallelism Model Parallelism Caffe protobuf, C++, Python Yes Limited CNTK BrainScript, C++, C# Yes No TensorFlow Python, C++ Yes Yes Theano Python No No Torch LuaJIT Yes Yes  Popular frameworks chosen by GitHub stars  All five frameworks use cuDNN as backend  Theano only supports single GPU 13

cuDNN  Deep Neural Network library with NVIDIA CUDA  Provides DNN primitives • Convolution, pooling, normalization, activation, …  State-of-the-art performance  All five frameworks support use of cuDNN as a backend  Unfortunately, not open-source (distributed in binaries) 14

System Setup CPU 2 x Intel Xeon E5 2650@2.0GHz GPU 4 x NVIDIA Titan X (Maxwell) Main memory 128GB DDR3 GPU memory 4 x 12GB GDDR5 Operating system CentOS 7.2.1511 (Linux 3.10.0-327) 15

Execution Time Comparison (default setting) conv1f conv2f Caffe conv3f conv4f CNTK conv5f fc1f fc2f TensorFlow fc3f conv1b conv2b Theano conv3b conv4b Torch conv5b fc1b 0 100 200 300 400 500 600 fc2b Time (ms) fc3b  Convolution layers take up more than 70% of training time  f: forward computation, b: backward computation 17

Options for Convolution Algorithms Framework User Selectable Heuristic-based Profile-based Default Caffe No Yes No Heuristic-based CNTK No No Yes Profile-based TensorFlow No No No Heuristic-based † Theano Yes Yes Yes GEMM Torch Yes Yes Yes GEMM † TensorFlow uses its own heuristic algorithm  cuDNN Get API is a heuristic based approach to choose an algorithm  cuDNN Find API is a profile-based approach to choose an algorithm  By default, Torch and Theano use GEMM convolution 18

Options for Convolution Algorithms Theano Theano(FFT) Theano(Heuristic) Conv Forward FC Forward Theano(Profile) Conv Backward Torch FC Backward Torch(Profile) 0 100 200 300 400 500 600 Time (ms)  Up to 2x speedup by providing algorithm options 19

Data Layout transpose transpose cuDNN NHWC NCHW NCHW NHWC layout layout layout layout TensorFlow TensorFlow (NCHW) 0 50 100 150 200 250 300 Time (ms) For example, cuDNN’s FFT convolution only supports NCHW  If the user uses another layout, TensorFlow implicitly transposes  Changing the layout leads to 15% speedup in TensorFlow  20

Unnecessary Backpropagation Forward Backward Data Layer 3 Backward Gradient Layer 2  ‘Backward Data’ is unnecessary in the first layer . Layer 1  Caffe, CNTK, Theano • Automatically omitted.  Torch Layer 0 • User option (layer0.gradInput = nil) Unnecessary  TensorFlow Input • No options to users 21

Unnecessary Backpropagation Torch Torch (w/o first) 0 100 200 300 400 500 600 Time (ms)  Speedup in the backward computation of the first layer 22

Optimized Results Caffe CNTK TensorFlow TensorFlow (NCHW) Conv Forward Theano FC Forward Theano(Profile) Conv Backward Torch FC Backward Torch(Profile) 0 100 200 300 400 500 600 Time (ms)  Framework differences are not significant if carefully optimized  Remaining differences come from other operations, such as bias addition and ReLU activation 23

Data-parallel SGD Update Update Update Update Critical path : 2logN transfer CNN CNN CNN CNN Batch 0 Batch 1 Batch 2 Batch 3 GPU0 GPU1 GPU2 GPU3 25

Multi-GPU Scalability 2 2 2 2 1.8 1.8 1.8 1.8 1.6 1.6 1.6 1.6 1.4 1.4 1.4 1.4 Speedup Speedup Speedup Speedup 1.2 1.2 1.2 1.2 1GPU 1 1 1 1 0.8 0.8 0.8 0.8 2GPUs 0.6 0.6 0.6 0.6 4GPUs 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 0 0 0 128 256 512 128 256 512 128 256 512 128 256 512 Batch size Batch size Batch size Batch size Caffe TensorFlow CNTK Torch  With small batches , multi-GPU is worse than a single GPU  Even with large batches, 4GPUs’ speedup is only around 1.5x 26

Communication-Compute Overlapping Forward & Backward Transfer Transfer Transfer Transfer ~200ms with a batch size of 256 ~45ms (~250MB gradients, ~5GB/s) Forward Backward Transfer Transfer Transfer Transfer The last layer’s gradients are computed.  Transfer overhead is not negligible  Transfer as soon as gradients of each layer become available  TensorFlow is partly doing this 27

Reducing Amount of Data Transfer Forward & Backward Transfer Transfer Transfer Transfer Forward & Backward  Quantization methods 2.62 2 • CNTK’s 1bit-SGD (1/32 transfer) 1.5 Speedup  Avoid fully connected layers 1 • 90% of parameters reside in fully-connected 0.5 layers 0 • Use 1x1 convolution layers instead of fully- 128 256 512 connected layers (e.g. GoogLeNet) 1GPU 2GPUs 4GPUs CNTK 1bit-SGD 30

Direct Convolution Algorithm  Straightforward convolution algorithm  Not supported by cuDNN, thus we use cuda-convnet3 for testing  Easy to implement but hard to optimize  cuda-convnet requires CHWN tensor layout instead of NCHW  Computation time for forward and backward computations are not symmetric 32

GEMM Convolution Algorithm  Treat convolutions as vector dot products in matrix multiplication  Forward and backward computations are symmetric  Efficiently optimized, but tiling inserts unnecessary computations 33

FFT Convolution Algorithm  FFT  CGEMM  inverse FFT == Convolution  In 2D convolution, computational complexity reduces from O(𝐼𝑋𝑆𝑇) to O(𝐼𝑋 log 𝐼𝑋 )  Computational cost does not depend on kernel dimension  cuDNN FFT convolution does not support strides 250 Giga Operations 200 Direct 150 GEMM 100 FFT 50 Winograd 0 Theoretical conv1 conv2 conv3 conv4 conv5 Kernel operation counts for each convolution layer 34

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, - PowerPoint PPT Presentation

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee Department of Computer Science and Engineering Seoul National University, Korea http://aces.snu.ac.kr The two authors contributed

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

Establishing Performance Frameworks www.apse.org.uk Performance Frameworks Effective Process

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

Web Frameworks Web Frameworks Banned for homework assignments Now that you're starting

Re-thinking CNN Frameworks for Time- Sensitive Autonomous-Driving Applications: Addressing an

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Modern Hadron Spectroscopy : Challenges and Opportunities Adam Szczepaniak, Indiana

Forward-Backward Asymmetries in Top-Antitop Quark Pair Production at the Fermilab Tevatron Ziqing

Physics beyond Standard Model & Asymmetries at Hadron Colliders (S.H. Zhu)

Gold: end of era or fat pitch? Grants Fall 2013 Conference Trey Reik (203) 656 2400 Gold: End

Math for VGG 1 Intro I am writing this to help you understand what the code is doing. Its

The Future Security Challenges in RFID Gildas Avoine, UCL Belgium Third International Workshop on

Discussion of "Market Power Across the Channel: Are Continental European Gas Markets Isolated

A resource theory of quantum nonlocality (in space and time) Francesco Buscemi (Nagoya) Workshop

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, - PowerPoint PPT Presentation

Performance Analysis of CNN Frameworks for GPUs Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee Department of Computer Science and Engineering Seoul National University, Korea http://aces.snu.ac.kr The two authors contributed

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

Establishing Performance Frameworks www.apse.org.uk Performance Frameworks Effective Process

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

Web Frameworks Web Frameworks Banned for homework assignments Now that you're starting

Re-thinking CNN Frameworks for Time- Sensitive Autonomous-Driving Applications: Addressing an

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

Modern Hadron Spectroscopy : Challenges and Opportunities Adam Szczepaniak, Indiana

Forward-Backward Asymmetries in Top-Antitop Quark Pair Production at the Fermilab Tevatron Ziqing

Physics beyond Standard Model &amp; Asymmetries at Hadron Colliders (S.H. Zhu)

Gold: end of era or fat pitch? Grants Fall 2013 Conference Trey Reik (203) 656 2400 Gold: End

Math for VGG 1 Intro I am writing this to help you understand what the code is doing. Its

The Future Security Challenges in RFID Gildas Avoine, UCL Belgium Third International Workshop on

Discussion of &quot;Market Power Across the Channel: Are Continental European Gas Markets Isolated

A resource theory of quantum nonlocality (in space and time) Francesco Buscemi (Nagoya) Workshop

Physics beyond Standard Model & Asymmetries at Hadron Colliders (S.H. Zhu)

Discussion of "Market Power Across the Channel: Are Continental European Gas Markets Isolated