networks on microcontrollers
play

networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, - PowerPoint PPT Presentation

Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di


  1. Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers Manuele Rusci*, Alessandro Capotondi, Luca Benini *manuele.rusci@unibo.it Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” – DEI – Università di Bologna

  2. Microcontrollers for smart sensors 2 M.Rusci - MLSys2020 Austin

  3. Microcontrollers for smart sensors ❑ Low-power (<10-100mW) & low- cost ❑ Smart device are battery- operated ❑ Highly-flexible (SW programmable) ❑ But limited resources(!) ❑ few MB of memories ❑ single RISC core up to few 100s MHZ (STM32H7: 400MHz) with DSP SIMD instructions and optional FPU ❑ Currently, tiny visual DL tasks on MCUs (visual wake words, CIFAR10) Source: STM32H7 datasheet Challenge : Run ‘complex’ and ‘big’ ( Imagenet-size) DL inference on MCU ? 3 M.Rusci - MLSys2020 Austin

  4. Deep Learning for microcontrollers “Efficient” topologies: Accuracy vs MAC vs Memory But quantization is also essential… Source: https://towardsdatascience.com/neural- network-architectures-156e5bad51ba a 0 a 1 a 2 a 3 Reducing bit Accuracy Compute Memory FP32 : 4 instr + 32 bytes precision w 0 w 1 w 2 w 3 INT16 : 2 instr + 16 bytes INT8 : 1 instr + 8 bytes + (if ISA MAC SIMD available) Issue1 : Integer-only model needed for deployment on low-power MCUs Issue2 : 8-16 bit are not sufficient to bring ‘complex’ models on MCUs (memory!!) 4 M.Rusci - MLSys2020 Austin

  5. Memory-Driven Mixed-Precision Quantization z Using less than Best Top1: 70.1% 8 bits… Best Mixed: 68% Best Top4 Fit 60.5% Best Top1 Fit: 48% y x still margin apply minimum tensor- wise quantization ≤8bit to fit the memory constraints with very-low accuracy drop ➢ Challenges : – How to define the quantization policy – Combine quantization flow this with integer only transformation 5 M.Rusci - MLSys2020 Austin

  6. End-to-end Flow & Contributions Goal : Define a design flow to bring Imagenet-size models into an MCU device while paying a low accuracy drop. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Fine- Optim Generator & Training Tuning Memory Constraints Device-aware Fine-Tuning Deployment on MCU We define a rule-based methodology to A latency-accuracy tradeoff on iso-memory determine the mixed-precision quantization mixed-precision networks belonging to the policy driven by a memory objective function. Imagenet MobilenetV1 family when running on a STM32H7 MCU. Graph Optimization We introduce the Integer Channel-Normalization (ICN) activation layer to generate an integer-only deployment graph when applying uniform sub-byte quantization . 6 M.Rusci - MLSys2020 Austin

  7. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Graph Optimization INTEGER-ONLY W/ SUB-BYTE QUANTIZATION 7 M.Rusci - MLSys2020 Austin

  8. State of the Art ❑ Inference with Integer-only arithmetic (Jacob, 2018) ❑ Affine transformation between real value and (uniform) quantized parameters ❑ Quantization-aware retraining ❑ Folding of batch norm into conv weights + rounding of per-layer scaling parameters quantized tensor (INT-Q) real value tensor or sub- 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) tensor ☺ Almost lossless with 8 bit on Image classification and detection problems. Used by TF Lite. Integer-Only MobilenetV1_224_1.0 Quantization Top1 Weights  4 bit MobilnetV1: Training collapse when folding Method (MB) batch norm into convolution weights Full-Precision 70.9 16.8  Does not support Per-Channel (PC) weight w8a8 70.1 4.06 quantization w4a4 0.1 2.05 (Jacob, 2018) Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." CVPR 2018 8 M.Rusci - MLSys2020 Austin

  9. Integer-Channel Normalization (ICN) 𝜚 = ∑𝑥 ⋅ 𝑦 Fake- 𝜚 − 𝜈 Quantized 𝑍 𝑟 = 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 ⋅ 𝛿 + 𝛾 X q 𝜏 Sub-Graph 𝜈, 𝜏, 𝛿, 𝛾 are channel-wise batchnorm parameters Conv2D Φ 𝑢 = 𝑇 𝑢 × (𝑈 𝑟 − 𝑎 𝑢 ) Replacing BatchNorm Φ = ∑(𝑋 𝑟 − 𝑎 𝑥 ) ⋅ (𝑌 𝑟 − 𝑎 𝑦 ) 𝑇 𝑥 is scalar if PL, else array 𝑇 𝑗 , 𝑇 𝑝 are scalar Activation 𝑇 𝑗 𝑇 𝑥 𝛿 1 𝐶 − 𝜈 + 𝛾 𝜏 QuantAct 𝑍 𝑟 = 𝑎 𝑧 + 𝑟𝑣𝑏𝑜𝑢 𝑏𝑑𝑢 𝜏 ( Φ + ) 𝑇 𝑝 𝑇 𝑗 𝑇 𝑥 𝛿 Y q 𝑁 0 2 𝑂 0 ( Φ + 𝐶 𝑟 ) 𝑁 0 , 𝑂 0 , 𝐶 𝑟 are channel- wise integer params Integer-Only MobilenetV1_224_1.0 Integer Channel-Normalization (ICN) Quantization Top1 Weights activation function Method (MB) ➢ holds either for PL or PC quantization Full-Precision 70.9 16.8 of weights PL+ICN w4a4 61.75 2.10 9 PC+ICN w4a4 66.41 2.12 M.Rusci - MLSys2020 Austin

  10. DNN Development Flow for microcontrollers Full- Fake- Deployment Microcontroller precision quantized Integer-only deployment C model model Device- model code f(x) Model g(x) g’(x) aware Graph Code Selection Generator Fine- Optim & Training Tuning Memory Constraints Device-aware Fine-Tuning MIXED-PRECISION QUANTIZATION POLICY 10 M.Rusci - MLSys2020 Austin

  11. Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our Input Data MCU device? Weight Parameters weight 1 conv 1 Output Data conv 2 weight 2 M ROM M RAM conv 3 weight 3 add 0 conv 4 weight 4 11 M.Rusci - MLSys2020 Austin

  12. Deployment of an integer-only graph Problem weight 0 conv 0 Can this graph fit the memory constraints of our MCU device? weight 1 conv 1 Weight Parameters conv 2 weight 2 Read-only memory M ROM for static parameters Read-write conv 3 weight 3 Input Data memory M RAM for dynamic Output Data values add 0 conv 4 weight 4 12 M.Rusci - MLSys2020 Austin

  13. Deployment of an integer-only graph [M1] weight 0 conv 0 𝑀−1 𝑗 ෍ 𝑛𝑓𝑛 𝑋 𝑗 , 𝑅 𝑥 + 𝑛𝑓𝑛 𝑁 0 , 𝑂0, 𝐶 𝑟 < 𝑁 𝑆𝑃𝑁 𝑗=0 [M1] weights [M2] must fit 𝑗 𝑗 m𝑓𝑛 𝑌 𝑗 , 𝑅 𝑦 + 𝑛𝑓𝑛 𝑍 𝑗 , 𝑅 𝑧 < 𝑁 𝑆𝐵𝑁 , ∀𝑗 M ROM weight 1 conv 1 [M2] I/O of a conv 2 weight 2 node must fit M RAM conv 3 weight 3 Problem Formulation 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 Find the quantization policy 𝑅 𝑦 to satisfy [M1] and [M2] add 0 𝑗 , 𝑅 𝑧 𝑗 , 𝑅 𝑥 𝑗 ∈ 2,4,8 bits 𝑅 𝑦 conv 4 weight 4 13 M.Rusci - MLSys2020 Austin

  14. Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 𝑆 = max 𝑠 𝑗 2 = 8 𝑅 𝑥 w2 conv 2 22% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 8 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% Weights Quantization Policy 14 M.Rusci - MLSys2020 Austin

  15. Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 13% no 17% 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 15% 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 8 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 22% 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 50% 33% Weights Cut layer 3! Quantization Policy 15 M.Rusci - MLSys2020 Austin

  16. Rule-Based Mixed-Precision [M1] : size(w0) + size(w1) + 𝑗 Set 𝑅 𝑥 = 8 size (w2) + size(w3) < 𝑁 𝑆𝑃𝑁 Goal 𝜀 = 5% Maximize memory [M1] yes satisfied utilization 0 = 8 ? 𝑅 𝑥 w0 conv 0 17% no 1 = 8 𝑅 𝑥 Compute mem occupation 𝑗 ) r i = 𝑛𝑓𝑛(𝑥 𝑗 , 𝑅 𝑥 20% conv 1 w1 𝑢𝑝𝑢 𝑁𝐹𝑁 Any cut reduces 𝑆 = max 𝑠 𝑗 the bit precision by 2 = 4 𝑅 𝑥 one step: 8→4, 4→2 w2 conv 2 30% 𝑗 of the lower layer Cut 𝑅 𝑥 3 = 4 with a mem occupation 𝑅 𝑥 𝑠 𝑗 > 𝑆 − 𝜀 conv 3 w3 33% Weights Cut layer 2! Quantization Policy 16 M.Rusci - MLSys2020 Austin

Recommend


More recommend