Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies
Wh Who a o are w we? • French based startup created in 2015 • First product, GAP8, launched in Feb 2018 21/3/2019 Tiny ML Summit. March 2019 2
Ou Our M Market Vi Vision Market Demand Rich sensor data The IoT pipe NB-IoT, LTE-M, Sigfox, Linear PCM = Keyword Spotting 1.4 Mbit/s LoRa, etc. Beam forming Speech pre-processing 24-bit @ Vibration analysis 50kHz = 1.2 Fault detection Mbit/s 8-bit, 160x120 Face detection B/day to kB/day @ 10 fps = Presence detection 4.6 Mbit/s Battery operated Counting sensors Emotion detection 21/3/2019 Tiny ML Summit. March 2019 3
Ou Our M Market Vi Vision mW class radio with duty cycling capability Market Demand Rich sensor data The IoT pipe Market demand NB-IoT, LTE-M, Sigfox, Linear PCM = CNN + 1.4 Mbit/s LoRa, etc. SVM Low operation cost Bayesian Issue: way more MIPS than + Boosting an MCU can deliver but still Low deployment cost 24-bit @ Cepstral need to be 50kHz = 1.2 + analysis Mbit/s within an MCU power Low installation cost envelope ? = 8-bit, 160x120 Massive deployment of B/day to kB/day @ 10 fps = B/day to kB/day intelligent rich data sensors 4.6 Mbit/s Battery operated sensors mW class sensors available for sound, image, radar, … 21/3/2019 Tiny ML Summit. March 2019 4
GAP8 An I IOT A Applic licatio ion P Proc ocessor or Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V MCU Function Cluster clock & voltage domain FC clock & voltage domain Extended RISC-V core LVDS Extensive I/O set Cluster Shared L1 Memory Serial I/Q DMA Micro DMA L2 Embedded DC/DC converters UART Memory Micro DMA Secured execution / e-fuses Logarithmic Interconnect HW SPI Sync Computation engine function I2C I $ 8 extended RISC-V cores I2S HWCE Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Fully programmable CPI Fabric Efficient parallelization Controller HyperBus Shared instruction cache Multi channel DMA L1 GPIO / PWM HW synchronization Shared Instruction Cache PMU RTC Debug ROM Debug HW convolution Engine An integrated, hierarchical architecture Inference TSMC 55LP Deep sleep Retentive Pre-analysis 1.0V to 1.2V few 10mWs 1uA 1µA+x*8µA 1mWs Max Freq: 133 MHz to 250 MHz 21/3/2019 Tiny ML Summit. March 2019 5 Up to 12.8 Gops-1
Gap8 T The o e open en s source h her eritage GreenWaves - Best in class Instruction Set - Open Source Computing Platform created by -Innovating on Risc-V and PULP Architecture (ISA) ETHZ and UniBo -Proprietary balanced system solution (SOC) - UC Berkeley originated - Permissive license (solderpad) based on PULP open source elements plus - Multiple tape outs GWT proprietary elements both on HW and - GWT Member of RiscV - GWT contributes to PULP SW/Tools side Foundation 21/3/2019 Tiny ML Summit. March 2019 6
How to w to o optimize e e ener ergy gy e efficien ency • Being strictly energy proportional to the demand • Light weight ISA specialization • Going parallel • Hardwired accelerator • Explicit memory management 21/3/2019 Tiny ML Summit. March 2019 7
Being p g proporti tional to to t the e dem emand MCU sleep mode 1 to 50 uW Low quiescent LDO Duty Cycling Real Time Clock 32KHz only L2 Memory partially retentive MCU active mode 0.5 to 5 mW Embedded DC/DC, high current Voltage can dynamically change Coarse Grain One clock gen active, frequency can Classification dynamically change Systematic clock gating MCU + Parallel processor active mode 5 to 50 mW Embedded DC/DC, high current Full Blown Voltage can dynamically change Analysis Two clock gen active, frequencies can dynamically change Systematic Clock Gating Ultra fast switching time from one mode to another Highly optimized system Ultra fast voltage and frequency change time level power consumption 21/3/2019 Tiny ML Summit. March 2019 8
Light W Weig ight ISA s specializ ializatio ion • We started from RiscV ISA (IMC) and boosted core performances for • DSP kernels • Linear Algebra • SIMD type vectorization • Datapath gate count increased by approx. 30% 21/3/2019 Tiny ML Summit. March 2019 9
Light W Weig ight ISA s specializ ializatio ion Extended ISA - Cycle Count Speedup Extended ISA - Energy Improvement 8.0 8.0 7.1 6.8 6.8 7.0 7.0 6.1 6.0 6.0 5.0 5.0 3.9 3.9 3.9 3.9 3.6 3.8 3.8 3.8 3.8 [VALEUR] 4.0 4.0 3.4 3.4 3.2 3.1 2.8 2.8 2.8 2.8 2.6 2.6 2.6 2.6 2.6 3.0 3.0 2.2 2.1 2.2 2.3 2.0 2.0 2.0 1.8 1.8 1.7 1.7 1.6 1.6 2.0 2.0 1.3 1.4 1.3 1.3 1.0 1.0 0.0 0.0 Rv to Gap8 wo Vect Rv To Gap w Vect Rv to Gap8 wo Vect Rv To Gap w Vect 21/3/2019 Tiny ML Summit. March 2019 10
Going P g Parallel el • Goal: quasi perfect performance scaling as a function of the number of cores involved • Obvious gain: • Power = V 2 * Freq => Scale down voltage • Less Obvious: • Make sure synchronization is not visible since it is serial by nature (Ahmdal) • Maximize instruction cache reuse in a context where we perform lot of partial evaluation => Shared Instruction Cache with broadcast capability 21/3/2019 Tiny ML Summit. March 2019 11
Goin ing P Paralle llel l – Syn ynchron oniz ization ion • Master core wants to dispatch function Foo with arguments on a group of cores • All cores blocked on a synchronization barrier are instantly clock gated 21/3/2019 Tiny ML Summit. March 2019 12
Goin ing P Paralle llel l – Perfor orman ance S Scalin aling Quasi Perfect Scaling 21/3/2019 Tiny ML Summit. March 2019 13
Goin ing P Paralle llel l – Energy S y Scalin aling Convolution: 80% of CNN workload Average Extension’s Energy Gain: 3.4 Amplified by Parallelism: 7.4 21/3/2019 Tiny ML Summit. March 2019 14
Putting E g Ever eryth thing g Together v vs MCU Running CIFAR10, same network, same precision What Freq MHz Exec Time ms Cycles Power mW 40nm Dual Issue MCU 216 99.1 21 400 000 60 16 X GAP8 @1.0V 15.4 99.1 11 X 1 500 000 3.7 GAP8 @1.2V 17.5 8.7 1 500 000 70 GAP8 @1.0V w HWCE 4.7 99.1 460 000 0.8 21/3/2019 Tiny ML Summit. March 2019 15
Explicit M Memory M Managem emen ent • Gap8 is not equipped with data caches External L3 • Silicon area (Ram/Flash) • More important: energy efficiency mostly due to hit ratio • We can turn this weakness into an (energy) benefit if we can automate data transfers L2 uDMA • In practice a vast majority of traffic is predictable => We have a way to optimize memory allocation and bandwidth Automatic data tiling and pipelined memory Shared L1 DMA transfer interleaved with parallel call to compute kernel is solved by our “Autotiler” tool 1 8 Exec L2 to L1 L3 to L2 21/3/2019 Tiny ML Summit. March 2019 16
Explicit M Memory M Managem emen ent: t: AutoT oTile iler How to handle a parametric tile Basic Kernels • Vectorization + Parallelization • No assumption on where actual data are located Usually seen as libraries Passing actual data to basic kernels and having data circulating between them • A multi dimensional iteration space (2D; 3D; 4D; 5D. ..) and a traversal order • Each argument is a sub space of the iteration space and has actual dimensions, location (L2, external) and properties. Order may differ from User Kernels the one of the iteration space • Given a memory budget the auto tiler “tiles” each argument and generates Can be grouped and a fully pipelined implementation interleaving processing and data transfers organized as generators • Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …) • Generated tiles are passed to Basic Kernels Connected User Kernels, constants, in and out features Graph • Optimal static memory allocation for all dynamic objects CNN + Pre/Post Processing 21/3/2019 Tiny ML Summit. March 2019 17
Explicit M Memory M Managem emen ent: t: AutoT oTile iler User Kernels Group of User Kernels Autotiler Library Generators BasicKernels Graph C Libraries (Constraints Solver, C Code Generator) C Programs, calls to Autotiler’s Model API #include "AutoTilerLib.h" #include "CNN_Generator.h" void Mnist() { Compile & Run on PC CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0); } C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores. Working set is tiled in a way that maximize reuse at minimum distance from data path 21/3/2019 Tiny ML Summit. March 2019 18
Recommend
More recommend