Deep-Learning Oriented Smart Sensing for the Next Generation of Embedded Applications Manuele Rusci, Francesco Conti , Alessandro Capotondi, Luca Benini Energy-Efficient Embedded Systems Laboratory Dipartimento di Ingegneria dell’Energia Elettrica e dell’Informazione “Guglielmo Marconi” IWES18 – Siena, 14 Settembre 2018
From data collectors… Node average power budget Wireless Sensing Wireless Power Sensor MCU Sensing Unit TX/RX Unit Sensing Analog A/D External Element Chain Conv Memory [Alioto, Massimo. "IoT: Bird’s Eye View, Megatrends and Perspectives." Enabling the Internet of Things . Springer International Publishing, 2017. 1-45.] 2 M. Rusci, F. Conti, A. Capotondi, L. Benini
...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 3 M. Rusci, F. Conti, A. Capotondi, L. Benini
...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 1. low-power “feature” / event extraction on sensor 4 M. Rusci, F. Conti, A. Capotondi, L. Benini
...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 1. low-power “feature” / event extraction on sensor 2. event-based near-sensor processing 5 M. Rusci, F. Conti, A. Capotondi, L. Benini
...to always-ON smart sensors Challenge: bringing intelligence in-the-node at mW cost Smart Sensing Power System Processing Unit TX/RX Sensing Unit Unit Core Peripheral Subsystem Region Sensing Analog A/D Element Chain Conv External Memory Subsystem Memory 1. low-power “feature” / event extraction on sensor 2. event-based near-sensor processing 3. “slim” and uncommon transmission of high-level features 6 M. Rusci, F. Conti, A. Capotondi, L. Benini
Ultra-Low Power Imaging (GrainCam) Focal Plane Processing . Moving an early computation stage into the sensor die to reduce the power costs of the imaging task. Per-pixel circitut for filtering and Gradient binarization extraction V res V res to pixel PN PN Imager performing spatial V EDGE to pixel PE PN PE V Q PO filtering and binarization PO comp2 PE V th Contrast Spatial- on the sensor die through Block PO contrast QO comp1 mixed-signal sensing ! V Q QN QE Adpating exposure ‘Moving’ pixel window PN PO PE Traditional Camera Graincam w/ motion detection 7 M. Rusci, F. Conti, A. Capotondi, L. Benini
Event-Based Paradigm Ultra-Low Power Consumption <100uW Event-based sensing : output frame data bandwidth depends on the external context- activity <10x wrt SoA imagers {x 0 ,y 0 } {x 1 ,y 1 } Frame- Event- {x 2 ,y 2 } based based {x 3 ,y 3 } Event-Based Data Processing {x n-1 ,y n-1 } idle Readout modes : Detection of relevant data transfer data processing information by the sensor IDLE : readout the counter of asserted pixels ~10mW power ACTIVE : sending the addresses of asserted Absence of significant ~100uW information from the sensor pixels (Address-Coded Representation, AER) M. Rusci et al. "A sub-mW IoT-endnode for always-on visual monitoring and smart triggering," in IEEE Internet of Things Journal, 2017 8 M. Rusci, F. Conti, A. Capotondi, L. Benini
Deep Learning at the Edge Convolutional Neural Networks are state-of-the art for visual recognition, detection and classification tasks Inference Engine Multi-Dimensional Imager Data Output Class Label bike How to exploit CNNs on always-on devices with a power envelope of few mWs or sub-mW ? Issues: Large memory footprint to store weights (the ‘program’) and intermediate results (up to hundreds of MBs), greater than memory footprint available on ultra-low power engines (100’s kBs) High-complexity CNN implementation, demanding floating-point precision Imager Power costs of tens to hundreds of mWs 9 M. Rusci, F. Conti, A. Capotondi, L. Benini
Deep Learning at the Edge “Extreme” example: ResNet-34 classifies 224x224 images into 1000 classes ~ trained human-level performance ~ 21M parameters ~ 3.6G MAC Performance for 1 fps: ~3.6 GMAC/s Energy efficiency for 1 fps @ 20 mW: ~180 GMAC/s/W = ~5pJ/MAC Quantization Specialized HW Precision Accuracy loss parallelism and HW acceleration full precision / 8bit 0 are key paradigms to achieve 6bit -1.3% low energy 4bit -3.3% VGG-16 @ CIFAR-10 10 M. Rusci, F. Conti, A. Capotondi, L. Benini
Quantization: no free lunch Running INT-Q convolution on a ARM Cortex-M7 core -> huge opportunity for HW/SW codesign Lower power consumption when fitting into L1 thanks lower bandwidth from L2-SRAM to compression impacts on low-bitwidth precision overhead for casting INT-4/2 to INT-16 for 2x16bit vectorized MAC instructions INT-1 kernel exploits bitwise operations and does not pay casting overhead because Open Source: XNOR convolutions are https://github.com/EEESlab/CMSIS_NN-INTQ supported by the ISA 11 M. Rusci, F. Conti, A. Capotondi, L. Benini
Quantization + Acceleration = ❤ More efficient than any ULP MCU… Bubble size = pJ/op (smaller is better) F. Conti et al., https://arxiv.org/abs/1612.05974 12 M. Rusci, F. Conti, A. Capotondi, L. Benini
Quantization + Acceleration = ❤ … and even more 865 6 pJ/op if compared to a pJ/op commercial high-perf MCU 23 pJ/op 143 pJ/op Bubble size = pJ/op 50 pJ/op (smaller is better) 11 pJ/op 1000 0.001 F. Conti et al., https://arxiv.org/abs/1612.05974 13 M. Rusci, F. Conti, A. Capotondi, L. Benini
Flying a Drone with DL ( in <10mW ) DroNet : a ResNet-based CNN to drive a drone in the environment • original implementation: 20fps on external CPU, requires a big drone (e.g. DJI, Parrot) GAP8 – GAP8 – 8 Cores HWCE (200MHz) (200MHz) FPS 32 fps 51 fps DroNet on GAP8/PULP: - Fixed-Point 16bit (Q3.13) - Removed Batch Normalization - Max Pooling layer 2x2 - Striding support in HW - Support for HWCE - Comparable accuracy w.r.t. baseline Example nano-drone from D. Palossi et al., https://arxiv.org/abs/1805.01831 14 F. Conti, M. Rusci, A. Capotondi, D. Rossi, L. Benini
Thanks for your attention. Questions? Special acks to: Davide Rossi (UNIBO), Daniele Palossi (ETHZ), Eric Flamand (GreenWaves Technologies), all the PULP team https:// github.com/pulp-platform Twitter @pulp_platform 15 M. Rusci, F. Conti, A. Capotondi, L. Benini
Recommend
More recommend