networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” – DEI – Università di Bologna

Microcontrollers for smart sensors 2 M.Rusci - MLSys2020 Austin

Microcontrollers for smart sensors ❑ Low-power (<10-100mW) & low- cost ❑ Smart device are battery- operated ❑ Highly-flexible (SW programmable) ❑ But limited resources(!) ❑ few MB of memories ❑ single RISC core up to few 100s MHZ (STM32H7: 400MHz) with DSP SIMD instructions and optional FPU ❑ Currently, tiny visual DL tasks on MCUs (visual wake words, CIFAR10) Source: STM32H7 datasheet Challenge : Run ‘complex’ and ‘big’ ( Imagenet-size) DL inference on MCU ? 3 M.Rusci - MLSys2020 Austin

Deep Learning for microcontrollers “Efficient” topologies: Accuracy vs MAC vs Memory But quantization is also essential… Source: https://towardsdatascience.com/neural- network-architectures-156e5bad51ba a 0 a 1 a 2 a 3 Reducing bit Accuracy Compute Memory FP32 : 4 instr + 32 bytes precision w 0 w 1 w 2 w 3 INT16 : 2 instr + 16 bytes INT8 : 1 instr + 8 bytes + (if ISA MAC SIMD available) Issue1 : Integer-only model needed for deployment on low-power MCUs Issue2 : 8-16 bit are not sufficient to bring ‘complex’ models on MCUs (memory!!) 4 M.Rusci - MLSys2020 Austin

Memory-Driven Mixed-Precision Quantization z Using less than Best Top1: 70.1% 8 bits… Best Mixed: 68% Best Top4 Fit 60.5% Best Top1 Fit: 48% y x still margin apply minimum tensor- wise quantization ≤8bit to fit the memory constraints with very-low accuracy drop ➢ Challenges : – How to define the quantization policy – Combine quantization flow this with integer only transformation 5 M.Rusci - MLSys2020 Austin

End-to-end Flow & Contributions Goal : Define a design flow to bring Imagenet-size models into an MCU device while paying a low accuracy drop. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Fine- Optim Generator & Training Tuning Memory Constraints Device-aware Fine-Tuning Deployment on MCU We define a rule-based methodology to A latency-accuracy tradeoff on iso-memory determine the mixed-precision quantization mixed-precision networks belonging to the policy driven by a memory objective function. Imagenet MobilenetV1 family when running on a STM32H7 MCU. Graph Optimization We introduce the Integer Channel-Normalization (ICN) activation layer to generate an integer-only deployment graph when applying uniform sub-byte quantization . 6 M.Rusci - MLSys2020 Austin

DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Graph Optimization INTEGER-ONLY W/ SUB-BYTE QUANTIZATION 7 M.Rusci - MLSys2020 Austin

State of the Art ❑ Inference with Integer-only arithmetic (Jacob, 2018) ❑ Affine transformation between real value and (uniform) quantized parameters ❑ Quantization-aware retraining ❑ Folding of batch norm into conv weights + rounding of per-layer scaling parameters quantized tensor (INT-Q) real value tensor or sub- 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) tensor ☺ Almost lossless with 8 bit on Image classification and detection problems. Used by TF Lite. Integer-Only MobilenetV1_224_1.0 Quantization Top1 Weights  4 bit MobilnetV1: Training collapse when folding Method (MB) batch norm into convolution weights Full-Precision 70.9 16.8  Does not support Per-Channel (PC) weight w8a8 70.1 4.06 quantization w4a4 0.1 2.05 (Jacob, 2018) Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." CVPR 2018 8 M.Rusci - MLSys2020 Austin

Integer-Channel Normalization (ICN) 𝜚 = ∑𝑥 ⋅ 𝑦 Fake- 𝜚 − 𝜈 Quantized 𝑍 𝑟 = 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 ⋅ 𝛿 + 𝛾 X q 𝜏 Sub-Graph 𝜈, 𝜏, 𝛿, 𝛾 are channel-wise batchnorm parameters Conv2D Φ 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) Replacing BatchNorm Φ = ∑(𝑋 𝑟 − 𝑎 𝑥 ) ⋅ (𝑌 𝑟 − 𝑎 𝑦 ) 𝑇 𝑥 is scalar if PL, else array 𝑇 𝑗 , 𝑇 𝑝 are scalar Activation 𝑇 𝑗 𝑇 𝑥 𝛿 1 𝐶 − 𝜈 + 𝛾 𝜏 QuantAct 𝑍 𝑟 = 𝑎 𝑧 + 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 𝜏 ( Φ + ) 𝑇 𝑝 𝑇 𝑗 𝑇 𝑥 𝛿 Y q 𝑁 0 2 𝑂 0 ( Φ + 𝐶 𝑟 ) 𝑁 0 , 𝑂 0 , 𝐶 𝑟 are channel- wise integer params Integer-Only MobilenetV1_224_1.0 Integer Channel-Normalization (ICN) Quantization Top1 Weights activation function Method (MB) ➢ holds either for PL or PC quantization Full-Precision 70.9 16.8 of weights PL+ICN w4a4 61.75 2.10 9 PC+ICN w4a4 66.41 2.12 M.Rusci - MLSys2020 Austin

DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Device-aware Fine-Tuning MIXED-PRECISION QUANTIZATION POLICY 10 M.Rusci - MLSys2020 Austin

Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our Input Data MCU device? Weight Parameters weight 1 conv 1 Output Data conv 2 weight 2 M ROM M RAM conv 3 weight 3 add 0 conv 4 weight 4 11 M.Rusci - MLSys2020 Austin

Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our MCU device? weight 1 conv 1 Weight Parameters conv 2 weight 2 Read-only memory M ROM for static parameters Read-write conv 3 weight 3 Input Data memory M RAM for dynamic Output Data values add 0 conv 4 weight 4 12 M.Rusci - MLSys2020 Austin

Deployment of an integer-only graph [M1] weight 0 conv 0 𝑀−1 𝑗 ෍ 𝑛𝑓𝑛 𝑋 𝑗 , 𝑅 𝑥 + 𝑛𝑓𝑛 𝑁 0 , 𝑂0, 𝐶 𝑟 < 𝑁 𝑆𝑃𝑁 𝑗=0 [M1] weights [M2] must fit 𝑗 𝑗 m𝑓𝑛 𝑌 𝑗 , 𝑅 𝑦 + 𝑛𝑓𝑛 𝑍 𝑗 , 𝑅 𝑧 < 𝑁 𝑆𝐵𝑁 , ∀𝑗 M ROM weight 1 conv 1 [M2] I/O of a conv 2 weight 2 node must fit M RAM conv 3 weight 3 Problem Formulation 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 Find the quantization policy 𝑅 𝑦 to satisfy [M1] and [M2] add 0 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 ∈ 2,4,8 bits 𝑅 𝑦 conv 4 weight 4 13 M.Rusci - MLSys2020 Austin

Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 𝑆 = max 𝑠 𝑗 2 = 8 𝑅 𝑥 w2 conv 2 22% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 8 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% Weights Quantization Policy 14 M.Rusci - MLSys2020 Austin

Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 17% 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 8 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 22% 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% 33% Weights Cut layer 3! Quantization Policy 15 M.Rusci - MLSys2020 Austin

Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 17% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 4 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 33% Weights Cut layer 2! Quantization Policy 16 M.Rusci - MLSys2020 Austin

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci, Alessandro Capotondi, Luca Benini manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di

to of Microcontrollers ECE Senior Design 9 February 2017 Popular Microcontrollers 8051

AVR Microcontrollers- Introduction AVR Microcontrollers Widely-used microcontroller

Building fault models for microcontrollers Albert Spruyt aspruyt@os3.nl University of Amsterdam

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

microMedic 2013 National Contest Microcontrollers, rapid prototyping and open-source licensing

Microcontrollers for IOT Prototyping Part 2 V. Oree, EEE Dept, UoM 1 Introduction The

Microcontroller Programming Beginning with Arduino Charlie Mooney Microcontrollers Tiny,

EFM32 Embedded Software for the worlds most engineer friendly microcontrollers Why software is

Running Deep Learning in less than 100KB on Microcontrollers Pete Warden Engineer, TensorFlow

SYSC3601 Microprocessor Systems Unit 14: Microcontrollers Topics/Reading 1. Microcontroller

EFM32 ...the worlds most energy friendly microcontrollers How to use this presentation Click

MICROCONTROLLERS Nicolas Moro 1,3 , Amine Dehbaoui 2 , Karine Heydemann 3 , Bruno Robisson 1 ,

Detecting Similar Code Segments through Side Channel Leakage in Microcontrollers Peter Samarin 1 ,

Cyber-Physical Systems Embedded Architecture ICEN 553/453 Fall 2018 Prof. Dola Saha 1

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Hardware-Software Co-design of Slimmed Optical Neural Networks Zheng Zhao 1 , Derong Liu 1 , Meng

SERP-Based Conversations Maarten de Rijke University of Amsterdam derijke@uva.nl Work in

Purpose of the Working Group To analyse the need for new/adapted standards for CO 2 quality, to

Public Transit Backbone Prepared by Robert B. Case, PhD, PE For HRT-Participating Cities Transit

Robus bust Inference nce vi via Gene nerative Cl Classifiers for r Handl ndling ng Noisy

Neuroimaging Inflammation in Clinical Depression and Obsessive Compulsive Disorder Dr. Jeffrey

May 30, 2019 Novice Infection Preventionists Infection Prevention Boot Camp May 3031, 2019

AFTER CCU: Whats Next in Greater Cincinnati? David Sandy Val & Adam Alex Anna

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di

to of Microcontrollers ECE Senior Design 9 February 2017 Popular Microcontrollers 8051

AVR Microcontrollers- Introduction AVR Microcontrollers Widely-used microcontroller

Building fault models for microcontrollers Albert Spruyt aspruyt@os3.nl University of Amsterdam

P2P Networks as Content P2P Networks as Content Delivery Networks Delivery Networks FINAL

Current Network Structure for Pediatrics Hospital Networks Country, state, regional, Academic

microMedic 2013 National Contest Microcontrollers, rapid prototyping and open-source licensing

Microcontrollers for IOT Prototyping Part 2 V. Oree, EEE Dept, UoM 1 Introduction The

Microcontroller Programming Beginning with Arduino Charlie Mooney Microcontrollers Tiny,

EFM32 Embedded Software for the worlds most engineer friendly microcontrollers Why software is

Running Deep Learning in less than 100KB on Microcontrollers Pete Warden Engineer, TensorFlow

SYSC3601 Microprocessor Systems Unit 14: Microcontrollers Topics/Reading 1. Microcontroller

EFM32 ...the worlds most energy friendly microcontrollers How to use this presentation Click

MICROCONTROLLERS Nicolas Moro 1,3 , Amine Dehbaoui 2 , Karine Heydemann 3 , Bruno Robisson 1 ,

Detecting Similar Code Segments through Side Channel Leakage in Microcontrollers Peter Samarin 1 ,

Cyber-Physical Systems Embedded Architecture ICEN 553/453 Fall 2018 Prof. Dola Saha 1

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Hardware-Software Co-design of Slimmed Optical Neural Networks Zheng Zhao 1 , Derong Liu 1 , Meng

SERP-Based Conversations Maarten de Rijke University of Amsterdam derijke@uva.nl Work in

Purpose of the Working Group To analyse the need for new/adapted standards for CO 2 quality, to

Public Transit Backbone Prepared by Robert B. Case, PhD, PE For HRT-Participating Cities Transit

Robus bust Inference nce vi via Gene nerative Cl Classifiers for r Handl ndling ng Noisy

Neuroimaging Inflammation in Clinical Depression and Obsessive Compulsive Disorder Dr. Jeffrey

May 30, 2019 Novice Infection Preventionists Infection Prevention Boot Camp May 3031, 2019

AFTER CCU: Whats Next in Greater Cincinnati? David Sandy Val &amp; Adam Alex Anna

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci, Alessandro Capotondi, Luca Benini manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di

AFTER CCU: Whats Next in Greater Cincinnati? David Sandy Val & Adam Alex Anna