extremely low bit convolution optimization for
play

Extremely Low-bit Convolution Optimization for Quantized Neural - PowerPoint PPT Presentation

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures Qingchang Han 1,2 , Yongmin Hu 1 , Fengwei Yu 2 , Hailong Yang 1 , Bing Liu 2 , Peng Hu 1,2 , Ruihao Gong 1,2 , Yanfei Wang 2 , Rui Wang 1 ,


  1. Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures Qingchang Han 1,2 , Yongmin Hu 1 , Fengwei Yu 2 , Hailong Yang 1 , Bing Liu 2 , Peng Hu 1,2 , Ruihao Gong 1,2 , Yanfei Wang 2 , Rui Wang 1 , Zhongzhi Luan 1 , Depei Qian 1 School of Computer Science and Engineering Beihang University 1 , Beijing, China SenseTime Research 2

  2. Outline ◼ Background & Motivation ◼ CNN & Quantized Neural Network ◼ Low-bit Computation on Modern Computer Architectures ◼ Optimization Methods ◼ Low-bit Convolution on ARM CPU ◼ Low-bit Convolution on NVIDIA GPU ◼ Evaluation ◼ Experiment Setup ◼ Performance Analysis ◼ Conclusion

  3. Outline ◼ Background & Motivation ◼ CNN & Quantized Neural Network ◼ Low-bit Computation on Modern Computer Architectures ◼ Optimization Methods ◼ Low-bit Convolution on ARM CPU ◼ Low-bit Convolution on NVIDIA GPU ◼ Evaluation ◼ Experiment Setup ◼ Performance Analysis ◼ Conclusion

  4. Convolutional Neural Network Speech Recognition Computer Vision Automatic Driving Recommendation System Convolutional Neural Network Output Input Flatten Convolution Pooling Convolution Pooling FC The computation complexity and memory footprint of CNNs need to be optimized ◼ Convolution layers take 90% - 99% of computation and runtime [Chen et al., ISSCC’16] ◼

  5. Model Compression Sign(1-bit) Exponent(8-bits) Mantissa(23-bits) ◼ Model compression reduces computation complexity with acceptable accuracy FP32 ◼ Network Pruning Quantize Dequantize ◼ Model Quantization 𝑦 𝑗𝑜𝑢 = 𝑠𝑝𝑣𝑜𝑒(𝑦 𝑔 /𝑡𝑑𝑏𝑚𝑓) ◼ Model Quantization 𝑦 𝑔 = 𝑡𝑑𝑏𝑚𝑓 × 𝑦 𝑟 ◼ Mapping data to a smaller set of numerical representation 𝑦 𝑟 = 𝑑𝑚𝑗𝑞(127, −128, 𝑦 𝑗𝑜𝑢 ) INT8 ◼ Improve the performance and reduce memory footprint while preserving accuracy Sign(1-bit) Mantissa(7-bits) ◼ Example: int8 Conv2d quantization

  6. Accuracy of Quantized Neural Network Accuracy Comparison of Low-bit QNNs on ImageNet [Esser et al., ICLR’20] ◼ Recent works have proved the accuracy of quantized neural network ◼ 8-bit quantized model can almost reach the same accuracy as the full-precision one ◼ Lower-bit quantized models (e.g., 2 ∼ 4-bit) only loss the accuracy slightly compared to the full-precision ones ◼ However, achieving the optimal performance of QNNs across different computer architectures is challenging and less studied in literatures

  7. The Target Architectures for Optimization ◼ Most widely used architectures for CNN inference ◼ Edge devices – ARM CPU ◼ Cloud accelerators – NVIDIA GPU The shipments of ARM-based chips to date ◼ Provide architecture support for low-bit arithmetic instructions ◼ ARM CPU: MLA / SMLAL ◼ NVIDIA GPU: dp4a / mma (Tensor Core) The share of types with Cloud Accelerators

  8. Low-bit Computation Support on ARM CPU ◼ Low-bit arithmetic instruction Multiply-Accumulate 16x8bit ( SMLAL ) … 8x16bit … 16x8bit … Multiply-Accumulate 16x8bit ( MLA ) … 16x8bit … 16x8bit … Add Wide 8x16bit 16x8bit ARMv8.1 architecture ( SADDW ) … … 8x16bit 4x32bit …

  9. Low-bit Computation Support on NVIDIA GPU ◼ Tensor Core Warp Scheduler Warp Scheduler ◼ Natively support mixed-precision GEMM Register File Register File ◼ INT8/INT4/INT1 for Turing Tensor Cores ◼ Powerful inference performance CUDA Tensor CUDA Tensor ◼ RTX 2080 Ti delivers up to 215.2 TOPS of INT8 Cores Cores Cores Cores inference performance Warp Scheduler Warp Scheduler Register File Register File INT32 INT8/INT4 INT8/INT4 INT32 CUDA Tensor CUDA Tensor ◼ Use of Tensor Core Cores Cores Cores Cores ◼ WMMA API ◼ PTX mma instructions(e.g. mma.m8n8k16) ◼ Vendor libraries: cuBLAS/cuDNN (only fp16 now) L1 Data Cache / Shared Memory

  10. Existing Framework/Library Supporting Low-bit Conv2d ARM CPU NVIDIA GPU ◼ ncnn: 8-bit Conv2d(GEMM-based & Winograd) ◼ cuDNN: 8-bit Conv2d( dp4a )/16-bit Conv2d( Tensor Core ) ◼ QNNPACK: 8-bit Conv2d(indirect convolution) ◼ TensorRT: 8-bit Conv2d( Tensor Core ) ◼ TFLite: 8-bit Conv2d ◼ CUTLASS: 1/4/8-bit GEMM( Tensor Core ) ◼ TVM: 1/2-bit Conv2d( popcount )/8-bit Conv2d(spatial pack) ◼ There is no public work that can support extremely low-bit convolution covering a wide range of bit width on ARM CPU (2 ∼ 8-bit) and NVIDIA GPU (4-bit/8-bit) ◼ The missing support for extremely low-bit convolution motivates us to provide efficient implementations on ARM CPU and NVIDIA GPU

  11. Outline ◼ Background & Motivation ◼ CNN & Quantized Neural Network ◼ Low-bit Computation on Modern Computer Architectures ◼ Optimization Methods ◼ Low-bit Convolution on ARM CPU ◼ Low-bit Convolution on NVIDIA GPU ◼ Evaluation ◼ Experiment Setup ◼ Performance Analysis ◼ Conclusion

  12. Re-designing GEMM Computation on ARM CPU ◼ Re-design GEMM micro-kernel 1. Load one column of Matrix A into Buffer A 2. Load one row of Matrix B info Buffer B, and replicate it into each row of Buffer B 3. Perform element-wise multiplication between Buffer A and each column-vector of Buffer B, and store the results to Buffer C 4. After all the calculations are done, copy the data of Buffer C into Matrix C Memory × = 4 Matrix A Matrix B Matrix C 1 2 Registers 3 × Element-wise Multiplication Buffer A Buffer B Buffer C

  13. Re-designing GEMM Computation on ARM CPU ◼ Data padding and packing optimization ◼ Perform zero-padding when the dimension of data is not a multiple of the required dimension ◼ Perform data packing to enable continuous data access A11 A12 A13 A14 B11 B12 B13 0 Zero-padding A21 A22 A23 A24 B21 B22 B23 0 Matrix A Matrix B A31 A32 A33 A34 B31 B32 B33 0 Zero-padding 0 0 0 0 B41 B42 B43 0 Padding and Packing Padding and Packing A11 A21 A31 0 A12 A22 A32 0 A13 A23 A33 0 A14 A24 A34 0 Packed Matrix A Packed Matrix B B11 B12 B13 0 B21 B22 B23 0 B31 B32 B33 0 B41 B42 B43 0

  14. Instruction and Register Allocation Optimization on ARM CPU ◼ Optimized instruction schemes for GEMM ◼ For 4 to 8-bit GEMM, we choose SMLAL and SADDW instructions until overflow until overflow 1 2 SADDW 16x8bit SMLAL 8x16bit 4~8-bit 4x32bit … … 4~8-bit 16x8bit … ◼ For 2 to 3-bit GEMM, we choose MLA and SADDW instructions 4x32bit until overflow 3 16x8bit 2~3-bit until overflow until overflow 1 2 SADDW … MLA SADDW 8x16bit 2~3-bit 16x8bit 16x8bit … … …

  15. Instruction and Register Allocation Optimization on ARM CPU ◼ Register allocation optimization ◼ For 4~8-bit input data Double Buffer A 𝑤 0 Buffer 16-bit 𝑤 2 ~𝑤 5 Buffer B SMLAL SADDW 𝑤 18 ~𝑤 31 𝑤 10 ~𝑤 17 𝑦 0 ~𝑦 3 𝑤 1 Buffer A Temporary Results Buffer C (16-bit) (32-bit) 𝑤 6 ~𝑤 9 Buffer B ◼ For 2~3-bit input data 8-bit 16-bit 𝑤 0 ~𝑤 3 Buffer A MLA SADDW SADDW 𝑤 20 ~𝑤 31 𝑤 8 ~𝑤 11 𝑤 12 ~𝑤 19 𝑦 0 ~𝑦 7 𝑤 4 ~𝑤 7 Buffer B Temporary Results Temporary Results Buffer C (8-bit) (16-bit) (32-bit)

  16. Winograd Optimization on ARM CPU ◼ Winograd method ◼ Achieve acceleration by reducing the number of multiplications ◼ Converts convolution computation to the following form: : ◼ Apply F(2x2, 3x3) to 4~6-bit convolution ◼ Ensure the transformed data in the range of 8-bit precision For more details, please refer to our paper. ◼ F(2x2, 3x3): No more than 6 bits ◼ F(4x4, 3x3): Unacceptable increment of numerical range ◼ 2 to 3-bit convolution? ◼ The maximum theoretical speedup of F(2x2, 3x3) is 2.25 × , however MLA instruction is 2 × faster than SMLAL instruction ◼ Offset the performance advantage of Winograd method

  17. Implicit-precomp GEMM Method on GPU ◼ Implicit GEMM ◼ Avoid global matrix transformation and reducing memory footprint ◼ Precomputed Buffer ◼ Store the offsets of elements in precomputed buffer Offsets in INPUT K M IM2COL M Precomputed Buffer INPUT: N*IH*IW*IC Matrix A(Implicit) M = (N*OH*OW) K = (KH*KW*IC) N = OC

  18. Data Partition along with Thread Hierarchy on GPU (a) Grid-Level ◼ Divide the matrix A, B and C into tiles by MTile , NTile , KTile NFrag N NTile KStep K KTile B_Tile Matrix B B_Fragment (SMEM) (GMEM) (Register) K N NFrag KStep KTile NTile M M MFrag MFrag MTile MTile C_Fragment Matrix A Matrix C A_Tile C_Tile A_Fragment (Register) (GMEM) (GMEM) (SMEM) (Register) (Register) (a) Grid-Level (b) Block-Level (c) Warp-Level M = (N*OH*OW) K = (KH*KW*IC) N = OC

  19. Data Partition along Thread Hierarchy on GPU (b) Block-Level ◼ Divide C_Tile, A_Tile, B_Tile into fragments by blockRowWarpNum , blockColWarpNum ◼ Split the KTile loop by KStep NFrag N NTile KStep K KTile B_Tile Matrix B B_Fragment (SMEM) (GMEM) (Register) K N NFrag KStep KTile NTile M M MFrag MFrag MTile MTile C_Fragment Matrix A Matrix C A_Tile C_Tile A_Fragment (Register) (GMEM) (GMEM) (SMEM) (Register) (Register) (a) Grid-Level (b) Block-Level (c) Warp-Level M = (N*OH*OW) K = (KH*KW*IC) N = OC

Recommend


More recommend