Accelerating Deep Learning Frameworks with Micro-batches Yosuke Oyama 1 * Tal Ben-Nun 2 Torsten Hoefler 2 Satoshi Matsuoka 3 1 September 13, 2018 1 Tokyo Institute of Technology 2 ETH Zurich 3 RIKEN Center for Computational Science 1/26 * oyama.y.aa@m.titech.ac.jp , Presenter
Background
Background µ -cuDNN Performance evaluation Background • Convolution is one of the key operations in Convolutional Neural Networks (CNNs) W W ′ C Algorithm 1 Pseudo-code of two-dimensional convolution. C ′ 1: for( n = 0 ; n < N ; n ++) // Mini-batch loop U C H ′ for( k = 0 ; k < K ; k ++) // Output channel loop 2: for( h = 0 ; h < H ; h ++) // Height loop 3: H V for( w = 0 ; w < W ; w ++) // Width loop 4: W for( c = 0 ; c < C ; c ++) // Input channel loop 5: Y for( v = 0 ; v < V ; v ++) // Kernel width loop 6: X for( u = 0 ; u < U ; u ++) // Kernel height loop 7: Figure 1: 2D convolution. Y [ n, k, h, w ] += W [ k, c, v, u ] × X [ n, c, h + v, w + u ] ; 8: 2/26
Background µ -cuDNN Performance evaluation Background • NVIDIA cuDNN library provides deep learning primitives for GPUs • cuDNN provides several equivalent convolution algorithms GEMM-based Wingorad FFT-based X W Y X W Y X W Y im2col F G B T A T F F − 1 = = = X ′ · ˜ ◦ ˜ ˜ ˆ ◦ ˆ ˆ W Y X W Y X W Y Winograd domain Frequency domain : Workspace Figure 2: Three different convolution algorithms. 3/26
Background µ -cuDNN Performance evaluation Background • Problem statement : cuDNN may require a workspace as large as the network itself to use efficient convolution algorithms! 800 IMPLICIT PRECOMP GEMM Data 300 FFT Weights FFT TILING Workspace WINOGRAD 250 600 Memory [MiB] (Total time) Time [ms] 200 400 150 100 200 50 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 conv fc conv fc conv fc (8 MiB) (64 MiB) (512 MiB) Figure 3: Memory consumption (bars) and computation time (line/points) of 4/26 AlexNet on P100-SXM2 with different workspace sizes (8, 64, 512 MiB).
Background µ -cuDNN Performance evaluation Background • Idea : Loop splitting for the convolution’s outermost loop decreases workspace size (as well as computation efficiency) Algorithm 2 Pseudo-code of two-dimensional convolution. 1: for( n = 0 ; n < N ; n ++) // Mini-batch loop for( k = 0 ; k < K ; k ++) // Output channel loop 2: for( h = 0 ; h < H ; h ++) // Height loop 3: for( w = 0 ; w < W ; w ++) // Width loop 4: for( c = 0 ; c < C ; c ++) // Input channel loop 5: for( v = 0 ; v < V ; v ++) // Kernel width loop 6: for( u = 0 ; u < U ; u ++) // Kernel height loop 7: Y [ n, k, h, w ] += W [ k, c, v, u ] × X [ n, c, h + v, w + u ] ; 8: 5/26
Background µ -cuDNN Performance evaluation Approach and Contribution • Approach : µ -cuDNN, a thin wrapper library for cuDNN, which • divides a mini-batch into “micro-batches” by applying loop splitting • is based on Dynamic Programming (DP) and Integer Lienar Programming (ILP) • provides a Python interface for high-level optimization Time Using GEMM-based convolution conv1 relu1 pool1 conv2 cuDNN N = 256 N = 256 N = 256 N = 256 Using FFT-based convolution conv1 conv1 relu1 pool1 conv2 µ -cuDNN N = 128 N = 128 N = 256 N = 256 N = 64 • Contribution : • 1.60x speedup for AlexNet on V100-SXM2 GPU • up to 4.54x speedup (1.60x on average) for DeepBench on V100-SXM2 GPU 6/26
µ -cuDNN
Background µ -cuDNN Performance evaluation µ -cuDNN - Software stack • µ -cuDNN is a wrapper library for cuDNN, which can be called by 1. a DL framework as low-level performance tuning library 2. its dedicated Python frontend for high-level performance analysis User code Python etc. 1. 2. DL Framework Python C/C++ µ -cuDNN µ -cuDNN C (N)FS cuDNN CUDA NVIDIA GPU File-based DB Figure 4: µ -cuDNN software stack. 7/26
Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t µ -cuDNN DL Framework for(i = 1..L) { ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); optimization } Dynamic Programming // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26
Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); optimization } Dynamic Programming // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26
Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); UcudnnConvolution* } Figure 5: Workflow of µ -cuDNN. 8/26
Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 3. The framework calls cudnnConvolution* with the workspace size 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); Metadata UcudnnConvolution* } WS pointer 3. Figure 5: Workflow of µ -cuDNN. 8/26
Background µ -cuDNN Performance evaluation µ -cuDNN - Methodology • µ -cuDNN is enabled by replacing cuDNN handle type cudnnHandle t 1. The DL framework passes layer’s metadata via cudnnGetConvolution*Algorithm 2. µ -cuDNN runs ILP (or DP optimizer) and returns resulting workspace size 3. The framework calls cudnnConvolution* with the workspace size 4. µ -cuDNN internally calls the convolution function one or more times 1. µ -cuDNN DL Framework for(i = 1..L) { Metadata ILP Optimizer cudnnGetConvolution*Algorithm( · · · ); In-memory cudaMalloc(&ws[i], · · · ); WS size optimization } Dynamic Programming 2. // Training loop result cache for( · · · ) { Optimizer for(i = 1..L) cudnnConvolution*( · · · , ws[i], · · · ); Metadata UcudnnConvolution* 4. } WS pointer 3. Figure 5: Workflow of µ -cuDNN. 8/26
Background µ -cuDNN Performance evaluation Workspace policies • µ -cuDNN employs one of two workspace utilization policies • Workspace Reuse (WR) : Each layer reuses a private workspace • Workspace Division (WD) : Each layer uses a part of an unified workspace WR WD Maximum total WS size O ( # of layer ) constant Optimizer DP DP+ILP WS owner DL framework µ -cuDNN 9/26
µ -cuDNN WR User code Python etc. Python DL Framework C/C++ µ -cuDNN µ -cuDNN C (N)FS cuDNN CUDA NVIDIA GPU File-based DB
Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) 10/26
Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (60) conv1 b = 60 T µ (60) Time 10/26
Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (120) conv1 conv1 b = 60 b = 60 T µ (60) T µ (60) Time 10/26
Background µ -cuDNN Performance evaluation Workspace policies - WR • Problem : Given a mini-batch size B and the fastest execution time T µ ( b ) ( b = 1 , 2 , · · · , B ) , compute T ( B ) where { } T µ ( b ) , T ( b ) = min min b ′ =1 , 2 ,...,b − 1 T ( b ′ ) + T ( b − b ′ ) T (256) conv1 conv1 conv1 conv1 conv1 b = 60 b = 60 b = 60 b = 60 b = 16 Time 10/26
Recommend
More recommend