Training of Convolutional Neural Networks (CNNs) Typical Datasets - PowerPoint PPT Presentation

Multi-Precision Policy Enforced Training (MuPPET) Multi-Precision Policy Enforced Training (MuPPET) A precision-switching strategy for quantised fixed-point training of CNNs A precision-switching strategy for quantised fixed-point training of CNNs Aditya Rajagopal, Diederik Adriaan Vink Aditya Rajagopal, Diederik Adriaan Vink Stylianos I. Venieris, Christos-Savvas Bouganis Stylianos I. Venieris, Christos-Savvas Bouganis i ntelligent D igital S ystems n te llig e n t ig ita l y s te m s a b D Dept. of Electrical and Electronic e p t. of E le c tric a l a n d E le c tron ic E n g in e e rin g L ab Engineering www.imperial.ac.uk/idsl www .imperial.ac.uk/idsl i ntelligent D igital S ystems i n te llig e n t ig ita l y s te m s a b

Training of Convolutional Neural Networks (CNNs) Typical Datasets Typical Networks • CIFAR10 10 categories - 60000 images - • CIFAR100 100 categories - 60000 images - • ImageNet Dataset 1000 categories - 1.2 million images - i ntelligent D igital S ystems

Motivation Training Time Power Consumption • • Enable wider experimentation Reduce cost of training in with training e.g. Neural large data centers • Architecture Search Perform training on edge • Increase productivity of deep devices learning practitioners Exploit low-precision hardware capabilities • NVIDIA Turing Architecture (GPU) • Microsoft Brainwave (FPGA) • Google TPU (ASIC) i ntelligent D igital S ystems

Goal Perform quantised training of CNNs while maintaining FP32 accuracy and producing a model that performs inference at FP32 i ntelligent D igital S ystems

Contributions of this paper • Generalisable policy that decides at run time appropriate points to increase the precision of the training process without impacting final test accuracy - Datasets : CIFAR10, CIFAR100, ImageNet - Networks : AlexNet, ResNet, GoogLeNet - Up to 1.84x training time improvement with negligible loss in accuracy • Extending training to bit-widths as low as 8-bit to leverage the low-precision capabilities of modern processing systems • Open source PyTorch implementation of the MuPPET framework with emulated quantised computations i ntelligent D igital S ystems

Background: Mixed Precision Training • Current state-of-the-art: Mixed-precision training (Micikevicius et al., 2018) [6] Maintains master copy of the weights at - FP32 Quantises weights and activations to - FP16 for all computations Accumulates FP16 gradients into FP32 - master copy of the weights • Incurs accuracy drop if precision below FP16 is utilised [6] Micikevicius, P. et.al.. Mixed Precision Training. In International Conference on Learning Representations (ICLR), 2018 i ntelligent D igital S ystems

Multilevel optimisation formulation • Hierarchical formulation that progressively increases precision of computations min 𝑥 ( 𝑟 ¿¿ 𝑂 ) ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ¿ ¿¿ FP32 Master Copy of Weights min Proposed policy decides at run time 𝑥 ( 𝑟 ¿¿ 𝑂 − 1 ) ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ¿ ¿¿ the epochs at which these changes need to be made FP32 Master Copy of Weights min 𝑥 ( 𝑟 ¿¿ 𝑂 − 2 ) ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ¿ ¿ ¿ ⋱ 𝑥 𝐺𝑄 32 ∈ ℝ 𝐸 𝑀𝑝𝑡𝑡 ( 𝑔 ( 𝑥 𝐺𝑄 32 ) ) min i ntelligent D igital S ystems

Background: Gradient Diversity • Yin et al. 2018 computes diversity between minibatches within an epoch Gra dient of weights for minibatch i 𝑜 ‖ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 𝑜 ‖ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 2 2 ∆ 𝑇 ( 𝑥 ) = ∑ ∑ 2 2 𝑗 = 1 𝑗 = 1 2 = 𝑜 𝑜 ‖ ∑ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 2 + ∑ ‖ ∇ 𝑔 𝑗 ( 𝑥 ) ‖ 𝑗 ≠ 𝑘 ¿ ∇ 𝑔 𝑗 ( 𝑥 ) , ∇ 𝑔 𝑘 ( 𝑥 ) >¿ ¿ ∑ 2 2 𝑗 = 1 𝑗 = 1 • Modified for MuPPET to compute diversity between minibatches across epochs Gra dient of last minibatch for layer l in epoch k 1 ∆ 𝑇 ( 𝑥 ) 𝑘 = 𝑘 2 𝑙 ( 𝑥 ) ‖ ‖ ∇ 𝑔 𝑚 ∑ 2 𝑙 = 𝑘 − 𝑠 ¿ ℒ ∨¿ ∑ 2 ¿ 𝑘 ‖ ∑ 𝑙 ( 𝑥 ) ‖ ∀ 𝑚 ∈ ℒ ∇ 𝑔 𝑚 2 𝑙 = 𝑘 − 𝑠 Resolution of r epochs ≤ epoch j Average gradient diversity across all layers from last r epochs i ntelligent D igital S ystems

Precision Switching Policy: Methodology • Every r epochs: - The inter-epoch gradient diversity is calculated - Given an epoch e when the precision switched from level q n-1 to q n , and current epoch j 𝑇 ( 𝑘 ) ={ ∆ 𝑇 ( 𝑥 ) 𝑗 ∀ 𝑓 ≤ 𝑗 ≤ 𝑘 } 𝑞 = max 𝑇 ( 𝑘 ) 𝑘 ∆ 𝑇 ( 𝑥 ) 𝑈 = 𝛽 + 𝛾𝑓 − 𝜇 𝑘 • Empirically chosen decaying threshold placed on p: • If p violates T more than times, a precision switch is triggered and i ntelligent D igital S ystems

Precision Switching Policy: Hypotheses • Intuition - Low gradient diversity increases value of p - The likelihood of observing r gradients across r epochs that have low diversity at early stages of training is low - If this happens, may imply that information is being lost due to quantisation ( high p value ) • Generalisability Generalisability across epochs 𝑞 = max 𝑇 ( 𝑘 ) Generalisability across networks and datasets 𝑘 ∆ 𝑇 ( 𝑥 ) i ntelligent D igital S ystems

Precision Switching Policy: Generalisability • Similar values across various networks and datasets • Decaying threshold accounts for volatility in early stages of training i ntelligent D igital S ystems

Precision Switching Policy: Adaptability • • Is it better than randomly switching? Does it tailor to network and dataset? i ntelligent D igital S ystems

Precision Switching Policy: Performance (Accuracy) • Nets - AlexNet, ResNet18/20, GoogLeNet • Datasets CIFAR10, CIFAR100 (Hyperparameter Tuning), ImageNet (Application) - • Precisions 8-, 12-, 14-, 16-bit Dynamic Fixed -Point (Emulated) and 32-bit Floating -Point - • Training with MuPPET matches accuracy of standard FP32 training when trained with identical SGD hyperparameters i ntelligent D igital S ystems

Quantised Training i ntelligent D igital S ystems

Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value i ntelligent D igital S ystems

Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights i ntelligent D igital S ystems

Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value Wordlength lower bound Wordlength upper bound 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights i ntelligent D igital S ystems

Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value Wordlength lower bound Wordlength upper bound 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights Quantisation configuration and i ntelligent D igital S ystems

Quantisation Quantised signed INT { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ 𝑦 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ∙ ¿ ¿ 𝑦 𝑟𝑣𝑏𝑜𝑢 Original value Wordlength lower bound Wordlength upper bound 𝑉𝐶 + 0.5 𝑀𝐶 − 0.5 Activations { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } = ⌊ log 2 ( min { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } ) ) ⌋ ❑ ( 𝑡 { 𝑥𝑓𝑗𝑕h𝑢𝑡 , 𝑏𝑑𝑢 } , Scale factor 𝑌 𝑛𝑏𝑦 𝑌 𝑛𝑗𝑜 Weights Quantisation configuration Network layers Optimisation level and Network word length Current layer i ntelligent D igital S ystems

Quantisation Emulation • No ML framework support for reduced precision hardware - e.g. NVIDIA Turing architecture • GEMM profiled using NVIDIA's CUTLASS Library • Training profiled through PyTorch - Quantisation of weights, activations and gradients - All gradient diversity calculations • 12- and 14-bit fixed profiled as 16-bit fixed point - Included for future custom precision hardware i ntelligent D igital S ystems

Performance (Wall-clock time) Current Implementation i ntelligent D igital S ystems

Performance (Wall-clock time) Ideal i ntelligent D igital S ystems

Training of Convolutional Neural Networks (CNNs) Typical Datasets - PowerPoint PPT Presentation

Multi-Precision Policy Enforced Training (MuPPET) Multi-Precision Policy Enforced Training (MuPPET) A precision-switching strategy for quantised fixed-point training of CNNs A precision-switching strategy for quantised fixed-point training of

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) CMSC 678 UMBC Recap

Convolutional Kuan-Ting Lai 2020/3/31 Neural Network Convolutional Neural Networks (CNN)

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi

MICROBOONE Taritree Wongjirad DPF 2017 Tufts/MIT Outline Convolutional neural networks

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Convolutional Neural Nets 4-25-16 Reading Quiz Convolutional neural networks are most commonly

Deep Learning for Geometry Processing 3D Representations View-Based and Volumetric CNNs 3D

Convolutional Neural Networks Hwann-Tzong Chen Naitonal Tsing Hua University 3 Januray 2017 1 /

and Inference for Convolutional Neural Networks 1 2 FFT IFFT 3 4 Mathieu et al.: Fast

TUPOLLA: Travelling through the NFC Way Ricardo J. Rodr guez All wrongs reversed

Public Wireless Internet - An Introduction to MIAKO.NET http://www.miako.net Graduate School of

Internet Privacy Proxies, VPNs and Tor for private communications in the public Internet

Wbone: WLAN Roaming Based on Deep Security Zagreb, May 22 nd , 2003 Carsten Bormann

Stochastic arithmetic in multiprecision Stef Graillat Joint work with Fabienne Jzquel and

n -nucleus modeling: priorities for T2K/T2HK (my personal point of view) S.Bolognesi (IRFU, CEA)

Singlet Assisted Electroweak Phase Transitions and Precision Higgs Studies Peter Winslow Based

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

Training of Convolutional Neural Networks (CNNs) Typical Datasets - PowerPoint PPT Presentation

Multi-Precision Policy Enforced Training (MuPPET) Multi-Precision Policy Enforced Training (MuPPET) A precision-switching strategy for quantised fixed-point training of CNNs A precision-switching strategy for quantised fixed-point training of

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Convolutional Neural Networks ---- Off the shelf top notch performances Convolutional Neural

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0

Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) CMSC 678 UMBC Recap

Convolutional Kuan-Ting Lai 2020/3/31 Neural Network Convolutional Neural Networks (CNN)

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Convolutional Neural Networks in Speech Lecture 20 CS 753 Instructor: Preethi Jyothi

MICROBOONE Taritree Wongjirad DPF 2017 Tufts/MIT Outline Convolutional neural networks

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Convolutional Neural Networks 08, 10 &amp; 17 Nov, 2016 J. Ezequiel Soto S. Image Processing

Convolutional Neural Nets 4-25-16 Reading Quiz Convolutional neural networks are most commonly

Deep Learning for Geometry Processing 3D Representations View-Based and Volumetric CNNs 3D

Convolutional Neural Networks Hwann-Tzong Chen Naitonal Tsing Hua University 3 Januray 2017 1 /

and Inference for Convolutional Neural Networks 1 2 FFT IFFT 3 4 Mathieu et al.: Fast

TUPOLLA: Travelling through the NFC Way Ricardo J. Rodr guez All wrongs reversed

Public Wireless Internet - An Introduction to MIAKO.NET http://www.miako.net Graduate School of

Internet Privacy Proxies, VPNs and Tor for private communications in the public Internet

Wbone: WLAN Roaming Based on Deep Security Zagreb, May 22 nd , 2003 Carsten Bormann

Stochastic arithmetic in multiprecision Stef Graillat Joint work with Fabienne Jzquel and

n -nucleus modeling: priorities for T2K/T2HK (my personal point of view) S.Bolognesi (IRFU, CEA)

Singlet Assisted Electroweak Phase Transitions and Precision Higgs Studies Peter Winslow Based

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

Convolutional Neural Networks 08, 10 & 17 Nov, 2016 J. Ezequiel Soto S. Image Processing