ceng5030 part 2 4 cnn inaccurate speedup 2 quantization
play

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu - PowerPoint PPT Presentation

CENG5030 Part 2-4: CNN Inaccurate Speedup-2 - Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9 These slides contain/adapt materials developed by Suyog Gupta et al. (2015). Deep learning with limited numerical


  1. CENG5030 Part 2-4: CNN Inaccurate Speedup-2 —- Quantization Bei Yu (Latest update: March 25, 2019) Spring 2019 1 / 9

  2. These slides contain/adapt materials developed by ◮ Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746 ◮ Ritchie Zhao et al. (2017). “Accelerating binarized convolutional neural networks with software-programmable FPGAs”. In: Proc. FPGA , pp. 15–24 ◮ Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542 2 / 9

  3. 3 / 9

  4. What'should'I'learn' to'do'well'in' computer'vision' I'want'to'research' research?' on'a'topic'with'DEAP' LEARNING'in'it?' 3 / 9

  5. DEEP'LEARNING' 3 / 9

  6. GPU$ Server$ 3 / 9

  7. Ohhh'No!!!' 3 / 9

  8. State of the art recognition methods • 'Very'Expensive'' • Memory' • ComputaIon' • Power' 3 / 9

  9. Overview Fixed-Point Representation Binary/Ternary Network Reading List 4 / 9

  10. Overview Fixed-Point Representation Binary/Ternary Network Reading List 5 / 9

  11. Fixed-Point v.s. Floating-Point 5 / 9

  12. Fixed-Point v.s. Floating-Point 5 / 9

  13. Fixed-Point v.s. Floating-Point 5 / 9

  14. Fixed-Point Arithmetic Number representation Granularity 7 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  15. Fixed-Point Arithmetic Number representation M ultiply-and- ACC umulate WL-bit multiplier Granularity (48-bits) 8 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  16. Fixed-Point Arithmetic: Rounding Modes Round-to-nearest 9 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  17. Fixed-Point Arithmetic: Rounding Modes Round-to-nearest Stochastic rounding Non-zero probability of rounding to � either or Unbiased rounding scheme: � expected rounding error is zero 10 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  18. MNIST: Fully-connected DNNs Lower precision Lower precision FL 8 FL 10 FL 14 Float 11 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  19. MNIST: Fully-connected DNNs Lower precision Lower precision FL 8 FL 10 FL 14 Float For small fractional lengths (FL < 12), a large majority of weight updates are � rounded to zero when using the round-to-nearest scheme. Convergence slows down � � For FL < 12, there is a noticeable degradation in the classification accuracy 12 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  20. MNIST: Fully-connected DNNs Stochastic rounding preserves gradient information (statistically) � No degradation in convergence properties � Test error nearly equal to that obtained using 32-bit floats � 13 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  21. FPGA prototyping: GEMM with stochastic rounding 8GB DDR3 SO-DIMM Input FIFOs: Matrix B FIFO FIFO FIFO n x n AXI Interface to Top Systolic DDR3 Controller Array n DSP DSP DSP FIFO MACC MACC MACC READ WRITE Systolic Array Input FIFOs: Matrix A (SA) of Computation DSP DSP DSP FIFO M ultiple-and- MACC MACC MACC ACC umulate L2 (MACC) units Cache L2-to-SA MACC units n (BRAM) (n x n array) Xilinx Kintex K325T FPGA DSP DSP DSP FIFO MACC MACC MACC Communication Top-level controller and memory hierarchy Wavefront systolic array for computing designed to maximize data reuse matrix product AB. Arrows indicate dataflow 21 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  22. Maximizing data reuse Matrix A Matrix B [ N x K ] [ K x M ] Inner Loop: p.n rows Cycle through columns of Matrix B ( M/n iterations) n columns Outer Loop: Cycle through rows of Matrix A ( K/p.n iterations) Re-use factor for Matrix A: M times MUX MUX Re-use factor for Matrix B: p.n times L2- n : dimension of the systolic array A-cache B-cache p : parameter chosen based on available ( p.n rows) Cache ( n cols) BRAM resources 22 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  23. Stochastic rounding Output C FIFOs FIFO FIFO FIFO FIFO DSP DSP ROUND ROUND DSP DSP FIFO MACC MACC DSP DSP FIFO MACC MACC Local registers Output path 23 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  24. Stochastic rounding DSP ROUND Output C FIFOs FIFO FIFO Accumulated result FIFO FIFO • DSP DSP LSBs to be rounded-off ROUND ROUND DSP DSP FIFO Pseudo-random number MACC MACC + generated using LFSR DSP DSP FIFO MACC MACC Truncate LSBs, and saturate to limits if result exceeds range These operations can be implemented Local registers Output path efficiently using a single DSP unit 24 1 1 Suyog Gupta et al. (2015). “Deep learning with limited numerical precision”. In: Proc. ICML , pp. 1737–1746. 6 / 9

  25. Overview Fixed-Point Representation Binary/Ternary Network Reading List 7 / 9

  26. � Binarized Neural Networks (BNN) CNN Key Differences 1. Inputs are binarized ( − 1 or +1) 2.4 6.2 … 5.0 9.1 … ∗ 0.8 0.1 3.3 1.8 4.3 7.8 = 2. Weights are binarized ( − 1 or +1) 0.3 0.8 … … 3. Results are binarized after Weights batch normalization Input Map Output Map BNN Batch Normalization 4 23 = 1 23 − 5 : + < 1 −1 … 1 −3 … 1 −1 … 6 7 − 8 ∗ 1 −1 1 1 3 −7 1 −1 = → 1 −1 … … … = 23 = >+1 if 4 23 ≥ 0 Weights 1 23 Input Map −1 otherwise Output Map (Binary) (Binary) (Binary) (Integer) Binarization 6 7 / 9

  27. BNN CIFAR-10 Architecture [2] Feature map 32x32 dimensions 16x16 8x8 4x4 10 512 256 512 128 256 3 128 Number of feature maps 1024 1024 � 6 conv layers, 3 dense layers, 3 max pooling layers � All conv filters are 3x3 � First conv layer takes in floating-point input � 13.4 Mbits total model size (after hardware optimizations) [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 , Feb 2016. 7 7 / 9

  28. Advantages of BNN 1. Floating point ops replaced with binary logic ops b 1 b 2 b 1 1 ⨯ ⨯ b 2 b 1 b 2 b 1 1 XO XOR b 2 +1 +1 +1 0 0 0 +1 −1 −1 0 1 1 −1 +1 −1 1 0 1 −1 −1 +1 1 1 0 – Encode {+1, − 1} as {0,1} à multiplies become XORs – Conv/dense layers do dot products à XOR and popcount – Operations can map to LUT fabric as opposed to DSPs 2. Binarized weights may reduce total model size – Fewer bits per weight may be offset by having more weights 8 7 / 9

  29. BNN vs CNN Parameter Efficiency Architecture Depth Param Bits Param Bits Error Rate (Float) (Fixed-Point) (%) ResNet [3] 164 51.9M 13.0M* 11.26 (CIFAR-10) BNN [2] 9 - 13.4M 11.40 * Assuming each float param can be quantized to 8-bit fixed-point � Comparison: – Conservative assumption: ResNet can use 8-bit weights – BNN is based on VGG (less advanced architecture) – BNN seems to hold promise! [2] M. Courbariaux et al. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 . arXiv:1602.02830 , Feb 2016. [3] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. 9 7 / 9

  30. I ∗ OperaIons' Memory' ComputaIon' I ∗ +''−''×' 1x' 1x' R R I ∗ +''−''' ~32x' ~2x' R R B Binary'Weight'Networks' XNOR' I ∗ ~32x' ~58x' R B R B BitXcount' XNORXNetworks' 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  31. I ∗ OperaIons' Memory' ComputaIon' I ∗ +''−''×' 1x' 1x' R R I ∗ +''−''' ~32x' ~2x' R R B XNOR' I ∗ ~32x' ~58x' R B R B BitXcount' 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  32. I ∗ ≈ )) � )) � B R R R gn( X T W B gn( X T ∗ W ≈ R B W B ∗ W W B = sign(W) 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  33. Quantization Error W B = sign(W) W B ∗ W _' ≈ 0.75 R B 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

  34. Optimal Scaling Factor ≈ computing α R B W B ∗ W α ∗ , W B ∗ = arg min W B ,α {|| W − α W B || 2 } W B ∗ = sign( W ) α ∗ | = 1 n k W k ` 1 2 2 Mohammad Rastegari et al. (2016). “XNOR-NET: Imagenet classification using binary convolutional neural networks”. In: Proc. ECCV , pp. 525–542. 8 / 9

Recommend


More recommend