ultra low bit neural network quantization
play

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - PowerPoint PPT Presentation

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1


  1. Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1 peisong.wang@nlpr.ia.ac.cn

  2. Background: Deep Learning From: Russ Salakhutdinov 6/5/20 peisong.wang@nlpr.ia.ac.cn 2

  3. Background: Application of CNNs • 01 Classification Detection Segmentation Convolutional Neural Networks 6/5/20 peisong.wang@nlpr.ia.ac.cn 3

  4. Background: Training Train ResNet50 from several days to: • Facebook: 1 hour • Fast.ai: 18 min • Tencent: 6.6 min • Sony: 3.7 min • Google: 2.2 min • SenseTime: 1.5 min 6/5/20 peisong.wang@nlpr.ia.ac.cn 4

  5. Background: Real World Applications Ø Low inference speed Ø Large memory/storage AR/VR Ø High power consumption Self-Driving Intelligent Surveillance Car Intelligent Robot Face Unlock 6/5/20 peisong.wang@nlpr.ia.ac.cn 5

  6. Network Acceleration and Compression • Low-rank Decomposition • Sparse/Pruning • Quantization • Knowledge Distillation • …… 6/5/20 peisong.wang@nlpr.ia.ac.cn 6

  7. Fixed-point representation 𝑻 : sign M : Mantissa 𝑭 : Exponent 1 8 23 −𝟐 𝑻 ×1.M× 𝟑 𝑭 FP32 S E M 1 7 Int8 S M −𝟐 𝑻 ×M 1 3 Int4 S M 6/5/20 peisong.wang@nlpr.ia.ac.cn 7

  8. Why Fixed-point quantization? • Saving memory • Saving energy • Saving time • Saving area Mark Horowitz , Computing’s Energy Problem . ISSCC 2014. 6/5/20 peisong.wang@nlpr.ia.ac.cn 8

  9. Type of quantization 2 # 𝑤𝑏𝑚𝑣𝑓𝑡 : 000…000 ~ 111…111 𝑂 − bit Non-uniform Uniform Logarithmic 𝐷₋0 0 0 0…000 Non-uniform Quantization 𝐷₋1 1 1 0…001 𝐷 ₋ 2 2 2 0…010 Uniform Quantization 𝐷 ₋ 3 3 4 0…011 𝐷 ₋ 4 4 8 0…100 𝐷 ₋ 5 5 16 0…101 Logarithmic Quantization 𝐷 ₋ 6 6 32 0…110 . . . . . . . . . . . . Scalar quantization with/without constrains 𝐷₋(2 ! − 1) 2 ! − 1 2 " ! #" 1…111 6/5/20 peisong.wang@nlpr.ia.ac.cn 9

  10. Contents • Sparsity-inducing Binarized Neural Networks. AAAI, 2020. • Soft Threshold Ternary Networks. IJCAI, 2020. • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020. • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML, 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 10

  11. Contents • Sparsity-inducing Binarized Neural Networks. AAAI, 2020. • Soft Threshold Ternary Networks. IJCAI, 2020. • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020. • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML, 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 11

  12. Binary: Sparsity-inducing BNN Previous binary approach: -1 1 -1 1 -1 1 -1 1 Binary == -1/+1 -1 1 Binary = Two States ( 𝒃 𝟏 𝒃 𝟐 ) 0 0 0 1 -1 1 Which two states to use ? Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020. 6/5/20 peisong.wang@nlpr.ia.ac.cn 12

  13. Sparsity-inducing BNN How to accelerate BNN with 0/1 activations ? (-1, +1) Reparameterization with affine transformation ( 𝒃 𝟏 𝒃 𝟐 ) 6/5/20 peisong.wang@nlpr.ia.ac.cn 13

  14. Sparsity-inducing BNN How to determine the threshold of 0/1 binarization? v Binarization at zero-point v Normal distribution v Large quantization error v Binarization at 𝜄 v The mutual information 𝐽(𝑦; C 𝑧) of two discrete random variables x and C 𝑧 can be defined as He Z , Fan D . Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation. CVPR 2019. 6/5/20 peisong.wang@nlpr.ia.ac.cn 14

  15. Sparsity-inducing BNN How to determine the threshold of 0/1 binarization? Mutual Information can be formulated as the function of 𝑞 𝑦 = 0 = 𝑞 Ablation study on the selection of threshold on AlexNet 6/5/20 peisong.wang@nlpr.ia.ac.cn 15

  16. Sparsity-inducing BNN AlexNet Experiments: v Extend our methods to other network structures v Without bells and whistles Comparison with 2-bit method ResNet-18 6/5/20 peisong.wang@nlpr.ia.ac.cn 16

  17. Sparsity-inducing BNN Run-time speedup: Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low-Power Inference of Binary Neural Networks on CPUs. ACM MM 2018. 6/5/20 peisong.wang@nlpr.ia.ac.cn 17

  18. Contents • Sparsity-inducing Binarized Neural Networks. AAAI, 2020. • Soft Threshold Ternary Networks. IJCAI, 2020. • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020. • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML, 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 18

  19. Soft Threshold Ternary Networks Previous ternary problem: From hard to soft threshold: Binary + Binary = Ternary −1 0 +1 −Δ Δ Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020. 6/5/20 peisong.wang@nlpr.ia.ac.cn 19

  20. Soft Threshold Ternary Networks — Ternarize both weights and activations — Without constraint of ∆ — Soft threshold 6/5/20 peisong.wang@nlpr.ia.ac.cn 20

  21. Soft Threshold Ternary Networks ImageNet Results: 6/5/20 peisong.wang@nlpr.ia.ac.cn 21

  22. Contents • Sparsity-inducing Binarized Neural Networks. AAAI, 2020. • Soft Threshold Ternary Networks. IJCAI, 2020. • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020. • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML, 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 22

  23. One-hot Networks To obtain more efficient quantizer: 8-bit, both activations and weights -128~127 INT-8 S 7-hot -128~127 S 6-hot INT-7 -64~63 S S -127~126 INT-6 S -32~31 Bit-width Non-zeros … INT-5 S -16~15 -96~96 Two-hot S INT-4 S -8~7 INT-3 S One-hot S -64~64 -4~3 Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 23

  24. One-hot Networks One-hot weight (logarithmic) One-hot weight + One-hot Activation § only one non-zero bit in weights § only one non-zero bit in weights/activations § multiplication -> bit shift of activation § multiplication -> addition + encoding [2] Effectual bits: exponent bits + sign bit, § 8bit -> 3+1bit § [1] H. Tann, S. Hashemi, R. I. Bahar, S. Reda, “HardwareSoftware Codesign of Highly Accurate, Multiplier-free Deep Neural Networks”, DAC'17 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration”, ISCA'19 6/5/20 peisong.wang@nlpr.ia.ac.cn 24

  25. One-hot Networks Baseline: 16/16 DaDianNao [1], 8/8 Laconic [2] Xilinx ZC706 Dev Board Vivado HLS 2018.2 [1] Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” MICRO'14 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration,” ISCA'19 6/5/20 peisong.wang@nlpr.ia.ac.cn 25

  26. Contents • Sparsity-inducing Binarized Neural Networks. AAAI, 2020. • Soft Threshold Ternary Networks. IJCAI, 2020. • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020. • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML, 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 26

  27. Bit-Split for Post-training Network Quantization Training-aware quantization Post-training Quantization Pre-trained Model Pre-trained Data-free Model BP-free Network Quantization Hyper-parameter free Easy to use Network Quantization Finetune using data/labels Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng. Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 27

  28. Bit-Split for Post-training Network Quantization Post-training quantization Min-Max Min-Max with clip Minimize the Di Distance Problem: Szymon Migacz. 8-bit Inference with TensorRT. GTC 2017 6/5/20 peisong.wang@nlpr.ia.ac.cn 28

  29. Bit-Split for Post-training Network Quantization Problem: Optimization: 6/5/20 peisong.wang@nlpr.ia.ac.cn 29

  30. Bit-Split for Post-training Network Quantization Weight Quantization: Weight and Activation Quantization: 6/5/20 peisong.wang@nlpr.ia.ac.cn 30

Recommend


More recommend