ultra l low power er infer eren ence a e at the v very e
play

Ultra L Low Power er Infer eren ence a e at the v very e edge - PowerPoint PPT Presentation

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies Wh Who a o are w we? French based startup


  1. Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork Tiny ML Summit March 20-21 2019 Sunnyvale Eric Flamand, CTO & CoFounder of Greenwaves Technologies

  2. Wh Who a o are w we? • French based startup created in 2015 • First product, GAP8, launched in Feb 2018 21/3/2019 Tiny ML Summit. March 2019 2

  3. Ou Our M Market Vi Vision Market Demand Rich sensor data The IoT pipe NB-IoT, LTE-M, Sigfox, Linear PCM = Keyword Spotting 1.4 Mbit/s LoRa, etc. Beam forming Speech pre-processing 24-bit @ Vibration analysis 50kHz = 1.2 Fault detection Mbit/s 8-bit, 160x120 Face detection B/day to kB/day @ 10 fps = Presence detection 4.6 Mbit/s Battery operated Counting sensors Emotion detection 21/3/2019 Tiny ML Summit. March 2019 3

  4. Ou Our M Market Vi Vision mW class radio with duty cycling capability Market Demand Rich sensor data The IoT pipe Market demand NB-IoT, LTE-M, Sigfox, Linear PCM = CNN + 1.4 Mbit/s LoRa, etc. SVM Low operation cost Bayesian Issue: way more MIPS than + Boosting an MCU can deliver but still Low deployment cost 24-bit @ Cepstral need to be 50kHz = 1.2 + analysis Mbit/s within an MCU power Low installation cost envelope ? = 8-bit, 160x120 Massive deployment of B/day to kB/day @ 10 fps = B/day to kB/day intelligent rich data sensors 4.6 Mbit/s Battery operated sensors mW class sensors available for sound, image, radar, … 21/3/2019 Tiny ML Summit. March 2019 4

  5. GAP8 An I IOT A Applic licatio ion P Proc ocessor or Two independent clock and voltage domains, from 0-133MHz/1V up to 0-250MHz/1.2V MCU Function Cluster clock & voltage domain FC clock & voltage domain Extended RISC-V core LVDS Extensive I/O set Cluster Shared L1 Memory Serial I/Q DMA Micro DMA L2 Embedded DC/DC converters UART Memory Micro DMA Secured execution / e-fuses Logarithmic Interconnect HW SPI Sync Computation engine function I2C I $ 8 extended RISC-V cores I2S HWCE Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Fully programmable CPI Fabric Efficient parallelization Controller HyperBus Shared instruction cache Multi channel DMA L1 GPIO / PWM HW synchronization Shared Instruction Cache PMU RTC Debug ROM Debug HW convolution Engine An integrated, hierarchical architecture Inference TSMC 55LP Deep sleep Retentive Pre-analysis 1.0V to 1.2V few 10mWs 1uA 1µA+x*8µA 1mWs Max Freq: 133 MHz to 250 MHz 21/3/2019 Tiny ML Summit. March 2019 5 Up to 12.8 Gops-1

  6. Gap8 T The o e open en s source h her eritage GreenWaves - Best in class Instruction Set - Open Source Computing Platform created by -Innovating on Risc-V and PULP Architecture (ISA) ETHZ and UniBo -Proprietary balanced system solution (SOC) - UC Berkeley originated - Permissive license (solderpad) based on PULP open source elements plus - Multiple tape outs GWT proprietary elements both on HW and - GWT Member of RiscV - GWT contributes to PULP SW/Tools side Foundation 21/3/2019 Tiny ML Summit. March 2019 6

  7. How to w to o optimize e e ener ergy gy e efficien ency • Being strictly energy proportional to the demand • Light weight ISA specialization • Going parallel • Hardwired accelerator • Explicit memory management 21/3/2019 Tiny ML Summit. March 2019 7

  8. Being p g proporti tional to to t the e dem emand MCU sleep mode 1 to 50 uW Low quiescent LDO  Duty Cycling Real Time Clock 32KHz only  L2 Memory partially retentive  MCU active mode 0.5 to 5 mW Embedded DC/DC, high current  Voltage can dynamically change Coarse Grain  One clock gen active, frequency can Classification  dynamically change Systematic clock gating  MCU + Parallel processor active mode 5 to 50 mW Embedded DC/DC, high current  Full Blown Voltage can dynamically change  Analysis Two clock gen active, frequencies can  dynamically change Systematic Clock Gating  Ultra fast switching time from one mode to another Highly optimized system Ultra fast voltage and frequency change time level power consumption 21/3/2019 Tiny ML Summit. March 2019 8

  9. Light W Weig ight ISA s specializ ializatio ion • We started from RiscV ISA (IMC) and boosted core performances for • DSP kernels • Linear Algebra • SIMD type vectorization • Datapath gate count increased by approx. 30% 21/3/2019 Tiny ML Summit. March 2019 9

  10. Light W Weig ight ISA s specializ ializatio ion Extended ISA - Cycle Count Speedup Extended ISA - Energy Improvement 8.0 8.0 7.1 6.8 6.8 7.0 7.0 6.1 6.0 6.0 5.0 5.0 3.9 3.9 3.9 3.9 3.6 3.8 3.8 3.8 3.8 [VALEUR] 4.0 4.0 3.4 3.4 3.2 3.1 2.8 2.8 2.8 2.8 2.6 2.6 2.6 2.6 2.6 3.0 3.0 2.2 2.1 2.2 2.3 2.0 2.0 2.0 1.8 1.8 1.7 1.7 1.6 1.6 2.0 2.0 1.3 1.4 1.3 1.3 1.0 1.0 0.0 0.0 Rv to Gap8 wo Vect Rv To Gap w Vect Rv to Gap8 wo Vect Rv To Gap w Vect 21/3/2019 Tiny ML Summit. March 2019 10

  11. Going P g Parallel el • Goal: quasi perfect performance scaling as a function of the number of cores involved • Obvious gain: • Power = V 2 * Freq => Scale down voltage • Less Obvious: • Make sure synchronization is not visible since it is serial by nature (Ahmdal) • Maximize instruction cache reuse in a context where we perform lot of partial evaluation => Shared Instruction Cache with broadcast capability 21/3/2019 Tiny ML Summit. March 2019 11

  12. Goin ing P Paralle llel l – Syn ynchron oniz ization ion • Master core wants to dispatch function Foo with arguments on a group of cores • All cores blocked on a synchronization barrier are instantly clock gated 21/3/2019 Tiny ML Summit. March 2019 12

  13. Goin ing P Paralle llel l – Perfor orman ance S Scalin aling Quasi Perfect Scaling 21/3/2019 Tiny ML Summit. March 2019 13

  14. Goin ing P Paralle llel l – Energy S y Scalin aling Convolution: 80% of CNN workload Average Extension’s Energy Gain: 3.4 Amplified by Parallelism: 7.4 21/3/2019 Tiny ML Summit. March 2019 14

  15. Putting E g Ever eryth thing g Together v vs MCU Running CIFAR10, same network, same precision What Freq MHz Exec Time ms Cycles Power mW 40nm Dual Issue MCU 216 99.1 21 400 000 60 16 X GAP8 @1.0V 15.4 99.1 11 X 1 500 000 3.7 GAP8 @1.2V 17.5 8.7 1 500 000 70 GAP8 @1.0V w HWCE 4.7 99.1 460 000 0.8 21/3/2019 Tiny ML Summit. March 2019 15

  16. Explicit M Memory M Managem emen ent • Gap8 is not equipped with data caches External L3 • Silicon area (Ram/Flash) • More important: energy efficiency mostly due to hit ratio • We can turn this weakness into an (energy) benefit if we can automate data transfers L2 uDMA • In practice a vast majority of traffic is predictable => We have a way to optimize memory allocation and bandwidth Automatic data tiling and pipelined memory Shared L1 DMA transfer interleaved with parallel call to compute kernel is solved by our “Autotiler” tool 1 8 Exec L2 to L1 L3 to L2 21/3/2019 Tiny ML Summit. March 2019 16

  17. Explicit M Memory M Managem emen ent: t: AutoT oTile iler How to handle a parametric tile Basic Kernels • Vectorization + Parallelization • No assumption on where actual data are located Usually seen as libraries Passing actual data to basic kernels and having data circulating between them • A multi dimensional iteration space (2D; 3D; 4D; 5D. ..) and a traversal order • Each argument is a sub space of the iteration space and has actual dimensions, location (L2, external) and properties. Order may differ from User Kernels the one of the iteration space • Given a memory budget the auto tiler “tiles” each argument and generates Can be grouped and a fully pipelined implementation interleaving processing and data transfers organized as generators • Basic Kernels are inserted at defined locations in the iteration space (prologue, body, epilog, …) • Generated tiles are passed to Basic Kernels Connected User Kernels, constants, in and out features Graph • Optimal static memory allocation for all dynamic objects CNN + Pre/Post Processing 21/3/2019 Tiny ML Summit. March 2019 17

  18. Explicit M Memory M Managem emen ent: t: AutoT oTile iler User Kernels Group of User Kernels Autotiler Library Generators BasicKernels Graph C Libraries (Constraints Solver, C Code Generator) C Programs, calls to Autotiler’s Model API #include "AutoTilerLib.h" #include "CNN_Generator.h" void Mnist() { Compile & Run on PC CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_0", 5, 1, 32, 28, 28, 1); CNN_TiledConvNxNReLUPool2x2_SW_fp("Conv5x5RLMP_1", 5, 32, 64, 12, 12, 1); CNN_TiledLinearLayer ("LinearLayerRL_2", 64, 4, 4, 10, 1, 0, 0); } C code for the target handling data transfers and Basic Kernels dispatch on cluster’s cores. Working set is tiled in a way that maximize reuse at minimum distance from data path 21/3/2019 Tiny ML Summit. March 2019 18

Recommend


More recommend