MAKING MACHINES SMARTER.™ High-Performance GPU Programming for Deep Learning 7 April 2016 Scott Gray Nervana Systems
High-Performance GPU kernels for deep learning • Fast matrix multiply for small minibatches • Direct convolution leveraging GEMM advances • Even faster convolution with Winograd ner va na 2 Proprietary and confidential. Do not distribute.
GEMM: Basics C = AB ner va na 3 Proprietary and confidential. Do not distribute.
GEMM: Memory Load Outer product contiguous Outer product strided single tile threads memory load batched GEMM ner va na 4 Proprietary and confidential. Do not distribute.
GEMM: Tile sizes Batched GEMM tiles 32 x 32 GEMM tile 32 x 64 GEMM tile 32 x 32 threads shared memory load ner va na 5 Proprietary and confidential. Do not distribute.
hGEMM Results - NN Nx3072x3072 NN op Nervana 32x32 cuBLAS 128x64 6000 4500 GFLOPS 3000 1500 0 32 64 96 128 Batch Size (N) ner va na 6 Proprietary and confidential. Do not distribute.
hGEMM Results - TN Nx3072x3072 TN op Nervana 32x32 cuBLAS 128x64 6000 4500 GFLOPS 3000 1500 0 32 64 96 128 Batch Size (N) ner va na 7 Proprietary and confidential. Do not distribute.
Direct convolution is still relevant • Striding • Odd-size filters • Placeholder until faster algo can be implemented • Often faster for single image or first small C layer ner va na 8 Proprietary and confidential. Do not distribute.
Direct convolution: implementation details • Batched GEMM for efficient transpose and higher occupancy • Compound outer product block remapping • Square wave pattern for P,Q block mapping • Slicing: shared memory lookup + integer division • N vs C contiguous • Single P,Q vs tiled P,Q • Bprop as upside down fprop • Update specific optimizations ner va na 9 Proprietary and confidential. Do not distribute.
Winograd: input transform 4x4 stride 2 • Input transform • 2D Winograd is a nested product of 1D transforms • Transforms can be Input Feature Map simplified to remove zeros ner va na 10 Proprietary and confidential. Do not distribute.
Winograd: filter transform • Filter transform • Same as input but with different coefficients • Transform each feature map independently ner va na 11 Proprietary and confidential. Do not distribute.
Winograd: batched GEMM ner va na 12 Proprietary and confidential. Do not distribute.
Winograd: output transform • Output transform • Same as input and filter • Transform back to pixel space to obtain 2x2 output tile Output Feature Map ner va na 13 Proprietary and confidential. Do not distribute.
Performance: VGG Winograd fp32 fprop Winograd fp32 bprop Winograd fp32 update VGG fp32 - Totals by operation cuDNN fp32 fprop 2 cuDNN fp32 bprop Algorithmic Speedup cuDNN fp32 update 1.5 1 0.5 0 64 32 16 8 4 2 1 Batch Size ner va na 14 Proprietary and confidential. Do not distribute.
Performance: Alexnet convolutional layers Alexnet Totals 2 Nervana fp16 Nervana fp32 CuBLAS fp16 Algorithmic Speedup 1.5 CuBLAS fp32 1 0.5 0 128 64 32 16 8 4 ner va na Batch Size 15 Proprietary and confidential. Do not distribute.
Compounding Compounding inside of GEMM and conv for free: • alpha / beta • bias • relu, prelu, tanh, … • bprop relu, … • bprop bias • batchnorm mean ner va na 16 Proprietary and confidential. Do not distribute.
Summary • Nervana has the fastest tools for deep learning • neon with state-of-the-art Maxwell kernels • Nervana Cloud with multi-GPU training • Watch for Nervana Engine, our deep learning processor ner va na 17 Proprietary and confidential. Do not distribute.
<extra plots>
VGG 2 Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32 1.5 Algorithmic Speedup 1 0.5 0 64 32 16 8 4 2 1 Batch Size
GoogLeNetv2 - Totals: 2 Winograd fp16 Winograd fp32 cuDNN fp16 cuDNN fp32 1.5 Algorithmic Speedup 1 0.5 0 64 32 16 8 4 2 1 Batch Size
MSRA - Totals: 2 Winograd fp16 Winograd fp32 Algorithmic Speedup cuDNN fp16 1.5 cuDNN fp32 1 0.5 0 64 32 16 8 4 2 1 Batch Size
About nervana • A platform for machine intelligence • enable deep learning at scale • optimized from algorithms to silicon X ner va na 22 Proprietary and confidential. Do not distribute.
GEMM • Matrix multiply is the fundamental operation of deep learning • Used in fully connected layers and as basis for convolution • Full utilization is hard to achieve for small mini-batches (tall and skinny matrices) • Carefully optimized memory access patterns ner va na 23 Proprietary and confidential. Do not distribute.
Winograd Convolution • Optimizations: • Similar to FFT: • Kernels for 3x3 pixel filters, • Transform image tile 2x2 and 4x4 output tile size • Transform filter • External vs internal fused tra • EW mutiply the two • Transform output back • The transforms are defined in terms of matrix multiplies ner va na 24 Proprietary and confidential. Do not distribute.
Recommend
More recommend