Less is More: Accelerating Deep Neural Networks with Micro-Batching - PowerPoint PPT Presentation

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter � 162 �� 2017/12/19

Background: cuDNN Convolution 2 ´ NVIDIA cuDNN : A deep learning kernel library for NVIDIA GPUs ´ Adopted by most of deep learning frameworks ´ Contains multiple convolution algorithms for CNNs ´ GEMM, direct, FFT, Winograd, … ´ Most algorithms use workspace : A buffer in GPU memory to store intermediate data N Y [ n , k , h , w ] = X Y Σ c , u , v W [ k , c , u , v ] * X [ n , c , h + u , w + v ]; K C C 1 Σ H W 1 H V 1 U W W 2D Convolution (forward)

Background: cuDNN Convolution 3 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

Background: cuDNN Convolution 4 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown if workspace If workspace limit < 323 MiB limit � 323 MiB 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

Background: cuDNN Convolution 5 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown if workspace If workspace limit < 323 MiB limit � 323 MiB 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 4.51x 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0

Background: cuDNN Convolution 6 ´ Observation: Less batch size for More executable performance ´ Faster algorithms can be enabled by dividing the mini-batch 80 Workspace size [MiB] ● ● 300 ● ● Images/time [ms − 1 ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● 20 ● ● ● Images/time [ms − 1 ] ● ● ● ● ● ● Workspace size [MiB] ● 0 0 0 50 100 150 200 250 Batch size Computation performance and workspace size of FFT_TILING of AlexNet conv2 (forward)

Background: cuDNN Convolution 7 ´ Observation: Less batch size for More executable performance ´ Faster algorithms can be enabled by dividing the mini-batch 93% of performance with 58% of workspace 80 Workspace size [MiB] ● ● 300 ● ● Images/time [ms − 1 ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● 20 ● ● ● Images/time [ms − 1 ] ● ● ● ● ● ● Workspace size [MiB] ● 0 0 0 50 100 150 200 250 Batch size Computation performance and workspace size of FFT_TILING of AlexNet conv2 (forward)

Approach and Contribution 8 ´ Approach: µ-cuDNN, a wrapper library for cuDNN ´ µ-cuDNN divides one mini-batch into more fine-grained batches (“micro-batches”) for cuDNN convolution ´ µ-cuDNN optimizes micro-batch sizes and algorithms using Dynamic Programming and Integer Linear Programming ´ Contribution: ´ µ-cuDNN on NVIDIA Tesla P100-SMX GPU achieves ´ up to 2.33x speedup for single convolution ´ up to 1.63x speedup for convolution of a CNN

Proposal: µ-cuDNN 9 ´ µ-cuDNN : a C++ transparent wrapper for cuDNN ´ is installed by replacing cudnnHandle_t to UcudnnHandle_t in deep learning frameworks ´ e.g. Caffe requires 3 lines of modification ´ overloads some of cuDNN functions ´ It internally divides cudnnConvolution* into multiple convolutions ´ It delegates other functions to cuDNN itself

Proposal: Workspace policies of µ-cuDNN 10 ´ µ-cuDNN supports two different workspace policies ´ Workspace Reuse (WR) ´ Each layer reuses a private workspace ´ Total workspace size is O(#layer) ´ Workspace Division (WD) ´ Each layer divides an unified workspace and use part of it ´ Total workspace size is constant Workspace Reuse (WR) Workspace Division (WD) Total WS size is up to [WS limit/layer] * [#layer] [total WS limit] µ-batch division is optimized by DP DP + ILP WS is managed by DL frameworks µ-cuDNN WS limit is passed by cuDNN interface An environment variable

Proposal: WR using Dynamic Programming 11 ´ Problem: Given a mini-batch size B and the fastest execution time T μ ( b ) for b =1, 2, …, B , compute � � T µ ( B ) T ( B ) = min min b =1 , 2 ,...,B − 1 T ( b ) + T ( B − b ) ´ and get the mini-batch division (“configuration” in this work) Time T (256) conv1 conv1 conv1 conv1 conv1 batch=60 batch=60 batch=60 batch=60 batch=16 T (60) = T μ (60) T (196)

� � � � � � � � � � � � � � � � � Proposal: WR using Dynamic Programming 12 1. for b � B policy ( B) , benchmarks the fastest execution time T μ ( b ) and its micro-configuration c μ ( b ) = ( a , b ) ´ where a is algorithm ID and B is mini-batch size ´ T μ ( b ) and a are obtain by cudnnFindConvolution*Algorithm ´ B all ( B ) = {1,2,…, B }, B powerOfTwo ( B ) = {2 0 ,2 1 ,…, B }, B undivided ( B ) = { B } 2. for b = 1, 2, …, B , computes b 1 , ˆ ˆ = argmin { T µ ( b 1 ) + T ( b 2 ) } b 2 b 1 + b 2 = b T µ ( ˆ b 1 ) + T ( ˆ T ( b ) = b 2 ) { c µ ( ˆ b 1 ) } ∪ c ( ˆ c ( b ) = b 2 ) 3. outputs configuration (a list of micro-configurations) c ( B )

Proposal: WR using Dynamic Programming 13 Time c (256) = {(4, 60), (4, 60), (4, 60), (4, 60), (0, 16)} T (256) conv1 conv1 conv1 conv1 conv1 c μ = (4, 60) c μ = (4, 60) c μ = (4, 60) c μ = (4, 60) c μ = (0, 16) T μ (60) T (196) c μ (60) = (4, 60) c (196) = {(4, 60), (4, 60), (4, 60), (0, 16)} T μ (60) T (136) c μ (60) = (4, 60) c (136) = {(4, 60), (4, 60), (0, 16)}

� � � � � � � Proposal: WD using Integer LP 14 ´ Problem: X X Total execution time min . T = T k ( c ) x k,c k 2 K c 2 C k Total workspace size should X X subject to . M k ( c ) x k,c ≤ M be less than M � k 2 K c 2 C k Exactly one configuration should X x k,c = 1 ( ∀ k ∈ K ) be selected for each kernel c 2 C k x k , c =1 � configuration c is x k,c ∈ { 0 , 1 } ( ∀ k ∈ K, ∀ c ∈ C k ) selected for kernel k ´ M : total workspace size ´ K : A set of convolution kernels ´ C k : A set of configurations for kernel k ´ T k ( c ), M k ( c ) : Execution time and workspace size for config. c

Proposal: WD using Integer LP 15 Total workspace size x k , c = 1 M conv k x 2, v = 1 c � C k x 1, u = 1 conv2 M 2 ( v ) v � C 2 conv1 T 2 ( v ) u � C 1 Time min. T c μ c μ c μ c μ

Proposal: WD using Integer LP 16 ´ Challenge: How to enumerate practical number of configurations (i.e. 0-1 variables) for each kernel � ´ The total number is Ω (|# algo | B ) ´ Solution: Pruning “undesirable” configurations ´ Definition. A configuration c is desirable among a set C � � c ’ � C , T ( c ) < T ( c ’) � M ( c ) < M ( c ’) T ( c ) c is undesirable, because it is slower than c ’ • c requires more memory than c ’ • c’ M ( c )

Proposal: WD using Integer LP 17 1. For each convolution kernel, list up all “desirable” configurations using the DP ´ We apply the pruning D ( C ) = { c � C | � c ’ � C , T ( c ) < T ( c ’) � M ( c ) < M ( c ’)} for each iteration 2. Pass the output (configurations) to the ILP problem 3. Solve the ILP problem ´ µ-cuDNN uses GNU Linear Programming Kit (GLPK) as a LP solver

Evaluation 18 ´ Software: Caffe 1.0, cuDNN 6.0, CUDA 8.0 ´ All CNN tensors are stored in float, NCHW format ´ Workspace size limit is set to 8, 64, 512 MiB ´ GPUs: ´ NVIDIA Tesla P100-SMX2 @ TSUBAME 3.0 ´ 10.6 SP TFlop/s ´ 24 GiB GDDR5 memory, 480 GiB/s bandwidth ´ NVIDIA Tesla K80 @ TSUBAME-KFC/DL ´ 8.73 SP TFlop/s ´ 16 GiB HBM2 memory, 732 GiB/s bandwidth

Evaluation: WR using Dynamic Programming 19 ´ µ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2 6 IMPLICIT_PRECOMP_GEMM FFT_TILING 5 WINOGRAD_NONFUSED Execution time [ms] 4 256 3 32 32 48 32 2 48 32 48 32 1 32 48 32 32 32 32 0 undivided powerOfTwo all cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SMX2 Workspace size of 64 MiB, mini-batch size of 256 Numbers on each rectangles represent micro-batch sizes

Less is More: Accelerating Deep Neural Networks with Micro-Batching - PowerPoint PPT Presentation

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Accelerating drug discovery with deep neural networks literature review Tobias Sikosek Senior

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Building Asynchronous SNMP Agents Presented by Ilya Etingof <etingof@gmail.com> Why SNMP

Reverse Engineering Internet MIBs J urgen Sch onw alder Computer Science Department

<stephan.steglich@fokus.fraunhofer.de> Fraunhofer Institute for Open Communication Systems |

CDC Coronavirus Disease 2019 Response Disparities in COVID-19 Incidence, Severity, and Outcomes

DAY TWO Developed by Kaseya University Powered by IT Scholars Kaseya Version 6.2 Last updated

Proio: YAIO! David Blyth Introduction A new IO scheme has been written, and its called proio .

RUXCON Courtesy of google images Metamorphic template with

Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta

Less is More: Accelerating Deep Neural Networks with Micro-Batching - PowerPoint PPT Presentation

Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter 162

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Optimizing Deep Neural Networks Leena Chennuru Vankadara 26-10-2015 Table of Contents Neural

Accelerating drug discovery with deep neural networks literature review Tobias Sikosek Senior

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

On the Expressive Power of Deep Neural Networks Maithra Raghu, Ben Poole, Jon Kleinberg, Surya

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Introduction to Deep Neural Networks 0. Logistics Spring 2020 1 Neural Networks are taking

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Building Asynchronous SNMP Agents Presented by Ilya Etingof &lt;etingof@gmail.com&gt; Why SNMP

Reverse Engineering Internet MIBs J urgen Sch onw alder Computer Science Department

&lt;stephan.steglich@fokus.fraunhofer.de&gt; Fraunhofer Institute for Open Communication Systems |

CDC Coronavirus Disease 2019 Response Disparities in COVID-19 Incidence, Severity, and Outcomes

DAY TWO Developed by Kaseya University Powered by IT Scholars Kaseya Version 6.2 Last updated

Proio: YAIO! David Blyth Introduction A new IO scheme has been written, and its called proio .

RUXCON Courtesy of google images Metamorphic template with

Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta

Building Asynchronous SNMP Agents Presented by Ilya Etingof <etingof@gmail.com> Why SNMP

<stephan.steglich@fokus.fraunhofer.de> Fraunhofer Institute for Open Communication Systems |