Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - PowerPoint PPT Presentation

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1 peisong.wang@nlpr.ia.ac.cn

Background: Deep Learning From: Russ Salakhutdinov 6/5/20 peisong.wang@nlpr.ia.ac.cn 2

Background: Application of CNNs • 01 Classification Detection Segmentation Convolutional Neural Networks 6/5/20 peisong.wang@nlpr.ia.ac.cn 3

Background: Training Train ResNet50 from several days to: • Facebook: 1 hour • Fast.ai: 18 min • Tencent: 6.6 min • Sony: 3.7 min • Google: 2.2 min • SenseTime: 1.5 min 6/5/20 peisong.wang@nlpr.ia.ac.cn 4

Background: Real World Applications Ø Low inference speed Ø Large memory/storage AR/VR Ø High power consumption Self-Driving Intelligent Surveillance Car Intelligent Robot Face Unlock 6/5/20 peisong.wang@nlpr.ia.ac.cn 5

Network Acceleration and Compression • Low-rank Decomposition • Sparse/Pruning • Quantization • Knowledge Distillation • …… 6/5/20 peisong.wang@nlpr.ia.ac.cn 6

Fixed-point representation 𝑻 : sign M : Mantissa 𝑭 : Exponent 1 8 23 −𝟐 𝑻 ×1.M× 𝟑 𝑭 FP32 S E M 1 7 Int8 S M −𝟐 𝑻 ×M 1 3 Int4 S M 6/5/20 peisong.wang@nlpr.ia.ac.cn 7

Why Fixed-point quantization? • Saving memory • Saving energy • Saving time • Saving area Mark Horowitz , Computing’s Energy Problem . ISSCC 2014. 6/5/20 peisong.wang@nlpr.ia.ac.cn 8

Type of quantization 2 # 𝑤𝑏𝑚𝑣𝑓𝑡 : 000…000 ~ 111…111 𝑂 − bit Non-uniform Uniform Logarithmic 𝐷₋0 0 0 0…000 Non-uniform Quantization 𝐷₋1 1 1 0…001 𝐷 ₋ 2 2 2 0…010 Uniform Quantization 𝐷 ₋ 3 3 4 0…011 𝐷 ₋ 4 4 8 0…100 𝐷 ₋ 5 5 16 0…101 Logarithmic Quantization 𝐷 ₋ 6 6 32 0…110 . . . . . . . . . . . . Scalar quantization with/without constrains 𝐷₋(2 ! − 1) 2 ! − 1 2 " ! #" 1…111 6/5/20 peisong.wang@nlpr.ia.ac.cn 9

Contents • Sparsity-inducing Binarized Neural Networks. AAAI, 2020. • Soft Threshold Ternary Networks. IJCAI, 2020. • Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020. • Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML, 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 10

Binary: Sparsity-inducing BNN Previous binary approach: -1 1 -1 1 -1 1 -1 1 Binary == -1/+1 -1 1 Binary = Two States ( 𝒃 𝟏 𝒃 𝟐 ) 0 0 0 1 -1 1 Which two states to use ? Peisong Wang, Xiangyu He, Gang Li, Tianli Zhao and Jian Cheng, “Sparsity-inducing Binarized Neural Networks”, AAAI, 2020. 6/5/20 peisong.wang@nlpr.ia.ac.cn 12

Sparsity-inducing BNN How to accelerate BNN with 0/1 activations ? (-1, +1) Reparameterization with affine transformation ( 𝒃 𝟏 𝒃 𝟐 ) 6/5/20 peisong.wang@nlpr.ia.ac.cn 13

Sparsity-inducing BNN How to determine the threshold of 0/1 binarization? v Binarization at zero-point v Normal distribution v Large quantization error v Binarization at 𝜄 v The mutual information 𝐽(𝑦; C 𝑧) of two discrete random variables x and C 𝑧 can be defined as He Z , Fan D . Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation. CVPR 2019. 6/5/20 peisong.wang@nlpr.ia.ac.cn 14

Sparsity-inducing BNN How to determine the threshold of 0/1 binarization? Mutual Information can be formulated as the function of 𝑞 𝑦 = 0 = 𝑞 Ablation study on the selection of threshold on AlexNet 6/5/20 peisong.wang@nlpr.ia.ac.cn 15

Sparsity-inducing BNN AlexNet Experiments: v Extend our methods to other network structures v Without bells and whistles Comparison with 2-bit method ResNet-18 6/5/20 peisong.wang@nlpr.ia.ac.cn 16

Sparsity-inducing BNN Run-time speedup: Tianli Zhao, Xiangyu He, Jian Cheng. BitStream: Efficient Computing Architecture for Real-Time Low-Power Inference of Binary Neural Networks on CPUs. ACM MM 2018. 6/5/20 peisong.wang@nlpr.ia.ac.cn 17

Soft Threshold Ternary Networks Previous ternary problem: From hard to soft threshold: Binary + Binary = Ternary −1 0 +1 −Δ Δ Weixiang Xu, Xiangyu He, Tianli Zhao, Qinghao Hu, Peisong Wang and Jian Cheng. “Soft Threshold Ternary Networks”, IJCAI, 2020. 6/5/20 peisong.wang@nlpr.ia.ac.cn 19

Soft Threshold Ternary Networks Ternarize both weights and activations Without constraint of ∆ Soft threshold 6/5/20 peisong.wang@nlpr.ia.ac.cn 20

Soft Threshold Ternary Networks ImageNet Results: 6/5/20 peisong.wang@nlpr.ia.ac.cn 21

One-hot Networks To obtain more efficient quantizer: 8-bit, both activations and weights -128~127 INT-8 S 7-hot -128~127 S 6-hot INT-7 -64~63 S S -127~126 INT-6 S -32~31 Bit-width Non-zeros … INT-5 S -16~15 -96~96 Two-hot S INT-4 S -8~7 INT-3 S One-hot S -64~64 -4~3 Gang Li, Peisong Wang, Zejian Liu, Cong Leng, Jian Cheng. Hardware Acceleration of CNN with One-Hot Quantization of Weights and Activations. DATE 2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 23

One-hot Networks One-hot weight (logarithmic) One-hot weight + One-hot Activation § only one non-zero bit in weights § only one non-zero bit in weights/activations § multiplication -> bit shift of activation § multiplication -> addition + encoding [2] Effectual bits: exponent bits + sign bit, § 8bit -> 3+1bit § [1] H. Tann, S. Hashemi, R. I. Bahar, S. Reda, “HardwareSoftware Codesign of Highly Accurate, Multiplier-free Deep Neural Networks”, DAC'17 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration”, ISCA'19 6/5/20 peisong.wang@nlpr.ia.ac.cn 24

One-hot Networks Baseline: 16/16 DaDianNao [1], 8/8 Laconic [2] Xilinx ZC706 Dev Board Vivado HLS 2018.2 [1] Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” MICRO'14 [2] S. Sharify et al., “Laconic Deep Learning Inference Acceleration,” ISCA'19 6/5/20 peisong.wang@nlpr.ia.ac.cn 25

Bit-Split for Post-training Network Quantization Training-aware quantization Post-training Quantization Pre-trained Model Pre-trained Data-free Model BP-free Network Quantization Hyper-parameter free Easy to use Network Quantization Finetune using data/labels Peisong Wang, Qiang Chen, Xiangyu He, Jian Cheng. Towards Accurate Post-training Network Quantization via Bit-Split and Stitching. ICML2020 6/5/20 peisong.wang@nlpr.ia.ac.cn 27

Bit-Split for Post-training Network Quantization Post-training quantization Min-Max Min-Max with clip Minimize the Di Distance Problem: Szymon Migacz. 8-bit Inference with TensorRT. GTC 2017 6/5/20 peisong.wang@nlpr.ia.ac.cn 28

Bit-Split for Post-training Network Quantization Problem: Optimization: 6/5/20 peisong.wang@nlpr.ia.ac.cn 29

Bit-Split for Post-training Network Quantization Weight Quantization: Weight and Activation Quantization: 6/5/20 peisong.wang@nlpr.ia.ac.cn 30

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - PowerPoint PPT Presentation

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1

Same, Same But Different Recovering Neural Network Quantization Error Through Weight

Customer Presentation 16-bit Ultra Low Power Microcontroller The eCOG1, 16 Bit Ultra Low Power

Quantization, after Souriau Prequantization Quantization? Group algebra Classical Franois

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

LOW PRECISION INFERENCE ON GPU Hao Wu, NVIDIA OUTLINE Performance motivation for quantization

A/D Conversion and A/D Conversion Filtering for Ultra Low Filtering for Ultra Low A/D

ChemBioDraw Today & Tomorrow Mark L. Olson, PhD Vice-President, Software Development

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Learning Accurate Low-bit Deep Neural Networks with Stochastic Quantization Yinpeng Dong 1 ,

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance

Strategic Integration of Ultra Low Strategic Integration of Ultra Low Power Technologies g

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Presentation GWT TM Ultra Filtration Systems GWT Ultra filtration systems incorporate advanced

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

Obstacles to the quantization of general relativity using symplectic structures Tom McClain

Implementing DNNs What this lecture is about: on Embedded Overview of frameworks for

A First Course in Digital Communications Ha H. Nguyen and E. Shwedyk February 2009 A First

Generalization Error Analysis of Quantized Compressive Learning Xiaoyun Li Ping Li Department

QUANTIZED SYSTEMS AND CONTROL Daniel Liberzon Coordinated Science Laboratory and Dept. of

Light-front quantization From the White Paper by the Board of Directors of ILCAC, Inc. John R.

MHV amplitudes in N=4 SUSY Yang-Mills theory and quantum geometry of the momentum space Alexander

Introduction Today we move on to the final section of material on quantum optical communica-

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of - PowerPoint PPT Presentation

Ultra-low-bit Neural Network Quantization Peisong Wang Institute of Automation, Chinese Academy of Sciences 2020.06.03 Collaborators: Weixiang Xu Tianli Zhao Fanrong Li Xiangyu He Gang Li Jian Cheng Cong Leng 6/5/20 1

Same, Same But Different Recovering Neural Network Quantization Error Through Weight

Customer Presentation 16-bit Ultra Low Power Microcontroller The eCOG1, 16 Bit Ultra Low Power

Quantization, after Souriau Prequantization Quantization? Group algebra Classical Franois

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

LOW PRECISION INFERENCE ON GPU Hao Wu, NVIDIA OUTLINE Performance motivation for quantization

A/D Conversion and A/D Conversion Filtering for Ultra Low Filtering for Ultra Low A/D

ChemBioDraw Today &amp; Tomorrow Mark L. Olson, PhD Vice-President, Software Development

Listing Bit Strings List all bit strings of length 3. Listing Bit Strings List all bit strings

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Lecture 13 : Lecture 13 : Special Bit Instructions Todays Goals L Learn bit-set and

Learning Accurate Low-bit Deep Neural Networks with Stochastic Quantization Yinpeng Dong 1 ,

Innovative Power Control for Ultra Low-Power and High- Ultra Low Power and High Performance

Strategic Integration of Ultra Low Strategic Integration of Ultra Low Power Technologies g

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

Presentation GWT TM Ultra Filtration Systems GWT Ultra filtration systems incorporate advanced

Bit Basics Eric McCreath Bit Basics A bit (Binary digIT) is single unit of binary storage. A bit

Obstacles to the quantization of general relativity using symplectic structures Tom McClain

Implementing DNNs What this lecture is about: on Embedded Overview of frameworks for

A First Course in Digital Communications Ha H. Nguyen and E. Shwedyk February 2009 A First

Generalization Error Analysis of Quantized Compressive Learning Xiaoyun Li Ping Li Department

QUANTIZED SYSTEMS AND CONTROL Daniel Liberzon Coordinated Science Laboratory and Dept. of

Light-front quantization From the White Paper by the Board of Directors of ILCAC, Inc. John R.

MHV amplitudes in N=4 SUSY Yang-Mills theory and quantum geometry of the momentum space Alexander

Introduction Today we move on to the final section of material on quantum optical communica-

ChemBioDraw Today & Tomorrow Mark L. Olson, PhD Vice-President, Software Development