Less is More: Accelerating Deep Neural Networks with Micro-Batching 1 Yosuke Oyama 1, a , Tal Ben-Nun 2 , Torsten Hoefler 2 , Satoshi Matsuoka 1 1) Tokyo Institute of Technology, 2) ETH Zurich a) oyama.y.aa@m.titech.ac.jp, Presenter � 162 ������������������������ 2017/12/19
Background: cuDNN Convolution 2 ´ NVIDIA cuDNN : A deep learning kernel library for NVIDIA GPUs ´ Adopted by most of deep learning frameworks ´ Contains multiple convolution algorithms for CNNs ´ GEMM, direct, FFT, Winograd, … ´ Most algorithms use workspace : A buffer in GPU memory to store intermediate data N Y [ n , k , h , w ] = X Y Σ c , u , v W [ k , c , u , v ] * X [ n , c , h + u , w + v ]; K C C 1 Σ H W 1 H V 1 U W W 2D Convolution (forward)
Background: cuDNN Convolution 3 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0
Background: cuDNN Convolution 4 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown if workspace If workspace limit < 323 MiB limit � 323 MiB 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0
Background: cuDNN Convolution 5 ´ Concern: There are considerable performance gaps (w.r.t. time and workspace size) among convolution algorithms ´ e.g. Inappropriate workspace limit on AlexNet leads ~4.51x slowdown if workspace If workspace limit < 323 MiB limit � 323 MiB 2 20 Execution time [ms] 0 15 0 IMPLICIT_GEMM 1 IMPLICIT_PRECOMP_GEMM 1 10 2 GEMM 4.51x 3 DIRECT 4 FFT 5 5 FFT_TILING ● 6 WINOGRAD 7 5 4 7 WINOGRAD_NONFUSED 0 0 200 400 600 800 1000 Workspace size [MiB] Execution time vs. workspace size of AlexNet conv2 (forward) Mini-batch size of 256, NVIDIA P100-SMX2, cuDNN 7.0
Background: cuDNN Convolution 6 ´ Observation: Less batch size for More executable performance ´ Faster algorithms can be enabled by dividing the mini-batch 80 Workspace size [MiB] ● ● 300 ● ● Images/time [ms − 1 ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● 20 ● ● ● Images/time [ms − 1 ] ● ● ● ● ● ● Workspace size [MiB] ● 0 0 0 50 100 150 200 250 Batch size Computation performance and workspace size of FFT_TILING of AlexNet conv2 (forward)
Background: cuDNN Convolution 7 ´ Observation: Less batch size for More executable performance ´ Faster algorithms can be enabled by dividing the mini-batch 93% of performance with 58% of workspace 80 Workspace size [MiB] ● ● 300 ● ● Images/time [ms − 1 ] ● ● ● ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● ● 200 ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● 20 ● ● ● Images/time [ms − 1 ] ● ● ● ● ● ● Workspace size [MiB] ● 0 0 0 50 100 150 200 250 Batch size Computation performance and workspace size of FFT_TILING of AlexNet conv2 (forward)
Approach and Contribution 8 ´ Approach: µ-cuDNN, a wrapper library for cuDNN ´ µ-cuDNN divides one mini-batch into more fine-grained batches (“micro-batches”) for cuDNN convolution ´ µ-cuDNN optimizes micro-batch sizes and algorithms using Dynamic Programming and Integer Linear Programming ´ Contribution: ´ µ-cuDNN on NVIDIA Tesla P100-SMX GPU achieves ´ up to 2.33x speedup for single convolution ´ up to 1.63x speedup for convolution of a CNN
Proposal: µ-cuDNN 9 ´ µ-cuDNN : a C++ transparent wrapper for cuDNN ´ is installed by replacing cudnnHandle_t to UcudnnHandle_t in deep learning frameworks ´ e.g. Caffe requires 3 lines of modification ´ overloads some of cuDNN functions ´ It internally divides cudnnConvolution* into multiple convolutions ´ It delegates other functions to cuDNN itself
Proposal: Workspace policies of µ-cuDNN 10 ´ µ-cuDNN supports two different workspace policies ´ Workspace Reuse (WR) ´ Each layer reuses a private workspace ´ Total workspace size is O(#layer) ´ Workspace Division (WD) ´ Each layer divides an unified workspace and use part of it ´ Total workspace size is constant Workspace Reuse (WR) Workspace Division (WD) Total WS size is up to [WS limit/layer] * [#layer] [total WS limit] µ-batch division is optimized by DP DP + ILP WS is managed by DL frameworks µ-cuDNN WS limit is passed by cuDNN interface An environment variable
Proposal: WR using Dynamic Programming 11 ´ Problem: Given a mini-batch size B and the fastest execution time T μ ( b ) for b =1, 2, …, B , compute � � T µ ( B ) T ( B ) = min min b =1 , 2 ,...,B − 1 T ( b ) + T ( B − b ) ´ and get the mini-batch division (“configuration” in this work) Time T (256) conv1 conv1 conv1 conv1 conv1 batch=60 batch=60 batch=60 batch=60 batch=16 T (60) = T μ (60) T (196)
� � � � � � � � � � � � � � � � � Proposal: WR using Dynamic Programming 12 1. for b � B policy ( B) , benchmarks the fastest execution time T μ ( b ) and its micro-configuration c μ ( b ) = ( a , b ) ´ where a is algorithm ID and B is mini-batch size ´ T μ ( b ) and a are obtain by cudnnFindConvolution*Algorithm ´ B all ( B ) = {1,2,…, B }, B powerOfTwo ( B ) = {2 0 ,2 1 ,…, B }, B undivided ( B ) = { B } 2. for b = 1, 2, …, B , computes b 1 , ˆ ˆ = argmin { T µ ( b 1 ) + T ( b 2 ) } b 2 b 1 + b 2 = b T µ ( ˆ b 1 ) + T ( ˆ T ( b ) = b 2 ) { c µ ( ˆ b 1 ) } ∪ c ( ˆ c ( b ) = b 2 ) 3. outputs configuration (a list of micro-configurations) c ( B )
Proposal: WR using Dynamic Programming 13 Time c (256) = {(4, 60), (4, 60), (4, 60), (4, 60), (0, 16)} T (256) conv1 conv1 conv1 conv1 conv1 c μ = (4, 60) c μ = (4, 60) c μ = (4, 60) c μ = (4, 60) c μ = (0, 16) T μ (60) T (196) c μ (60) = (4, 60) c (196) = {(4, 60), (4, 60), (4, 60), (0, 16)} T μ (60) T (136) c μ (60) = (4, 60) c (136) = {(4, 60), (4, 60), (0, 16)}
� � � � � � � Proposal: WD using Integer LP 14 ´ Problem: X X Total execution time min . T = T k ( c ) x k,c k 2 K c 2 C k Total workspace size should X X subject to . M k ( c ) x k,c ≤ M be less than M � k 2 K c 2 C k Exactly one configuration should X x k,c = 1 ( ∀ k ∈ K ) be selected for each kernel c 2 C k x k , c =1 � configuration c is x k,c ∈ { 0 , 1 } ( ∀ k ∈ K, ∀ c ∈ C k ) selected for kernel k ´ M : total workspace size ´ K : A set of convolution kernels ´ C k : A set of configurations for kernel k ´ T k ( c ), M k ( c ) : Execution time and workspace size for config. c
Proposal: WD using Integer LP 15 Total workspace size x k , c = 1 M conv k x 2, v = 1 c � C k x 1, u = 1 conv2 M 2 ( v ) v � C 2 conv1 T 2 ( v ) u � C 1 Time min. T c μ c μ c μ c μ
Proposal: WD using Integer LP 16 ´ Challenge: How to enumerate practical number of configurations (i.e. 0-1 variables) for each kernel � ´ The total number is Ω (|# algo | B ) ´ Solution: Pruning “undesirable” configurations ´ Definition. A configuration c is desirable among a set C � � c ’ � C , T ( c ) < T ( c ’) � M ( c ) < M ( c ’) T ( c ) c is undesirable, because it is slower than c ’ • c requires more memory than c ’ • c’ M ( c )
Proposal: WD using Integer LP 17 1. For each convolution kernel, list up all “desirable” configurations using the DP ´ We apply the pruning D ( C ) = { c � C | � c ’ � C , T ( c ) < T ( c ’) � M ( c ) < M ( c ’)} for each iteration 2. Pass the output (configurations) to the ILP problem 3. Solve the ILP problem ´ µ-cuDNN uses GNU Linear Programming Kit (GLPK) as a LP solver
Evaluation 18 ´ Software: Caffe 1.0, cuDNN 6.0, CUDA 8.0 ´ All CNN tensors are stored in float, NCHW format ´ Workspace size limit is set to 8, 64, 512 MiB ´ GPUs: ´ NVIDIA Tesla P100-SMX2 @ TSUBAME 3.0 ´ 10.6 SP TFlop/s ´ 24 GiB GDDR5 memory, 480 GiB/s bandwidth ´ NVIDIA Tesla K80 @ TSUBAME-KFC/DL ´ 8.73 SP TFlop/s ´ 16 GiB HBM2 memory, 732 GiB/s bandwidth
Evaluation: WR using Dynamic Programming 19 ´ µ-cuDNN achieved 2.33x speedup on forward convolution of AlexNet conv2 6 IMPLICIT_PRECOMP_GEMM FFT_TILING 5 WINOGRAD_NONFUSED Execution time [ms] 4 256 3 32 32 48 32 2 48 32 48 32 1 32 48 32 32 32 32 0 undivided powerOfTwo all cudnnConvolutionForward of AlexNet conv2 on NVIDIA Tesla P100-SMX2 Workspace size of 64 MiB, mini-batch size of 256 Numbers on each rectangles represent micro-batch sizes
Recommend
More recommend