from small to tiny how to co design ml models
play

From Small to Tiny: How to Co-design ML Models, Computational - PowerPoint PPT Presentation

From Small to Tiny: How to Co-design ML Models, Computational Precision and Circuits in the Energy-Accuracy Trade-off Space Marian Verhelst Marian.Verhelst@kuleuven.be 1 Embedded Deep Neural Networks Keyword and Augmented Face and owner


  1. From Small to Tiny: How to Co-design ML Models, Computational Precision and Circuits in the Energy-Accuracy Trade-off Space Marian Verhelst Marian.Verhelst@kuleuven.be 1

  2. Embedded Deep Neural Networks Keyword and Augmented Face and owner speaker recognition reality recognition Raw Data CLOUD GPU Information 2

  3. Embedded Deep Neural Networks Keyword and Augmented Face and owner speaker recognition reality recognition Local Processing 3

  4. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Application … without giving up flexibility! 4

  5. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level TOPs/Watt!? Application … without giving up flexibility! 5

  6. Circuit level choices analog or 1-bit or multi-precision MAC digital MAC digital MAC (2-16bit) [Moons, CICC18] [Moons, ISSCC17] [Bankman, ISSCC18] X + MAC = multiply-accumulate 6

  7. Circuit level implications analog or 1-bit or multi-precision MAC digital MAC digital MAC With Stanford (Murmann) [Bankman, [Moons, [Moons, ISSCC18] CICC18] ISSCC17] Area large small medium Energy 500TOPS/W 200TOPs/W 16b: 0.5 TOPs/W 8b: 1 TOPs/W 4b: 5 TOPs/W 2b: 10 TOPs/W Flexibility low medium high Accuracy Best one? 7

  8. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Analog or Application digital? Optimal precision? … without giving up flexibility! 8

  9. Architecture level choices or programmable Configurable processor (ASIP) systolic accelerator Weight memory Cntr Activation Activation mem mem MAC array [Moons, CICC18/ [Moons, Weight memory ISSCC18] ISSCC17] Area small large(r) Energy eff. high lower Flexibility (util.) low high mem comp 9

  10. Architecture level choices (2) or programmable processor (ASIP) Spend area on – More MACs in parallel? – Larger memory? – Local or global memories? Best one? 10

  11. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Analog or Data Application digital? parallelism? Optimal Memory precision? hierarchy? … without giving up flexibility! 11

  12. Algorithm level choices Same task can be implemented with many network topologies netw ork vary w idth Layer1 Layer2 Layer3 LayerN vary netw ork depth vary layer topology 12

  13. [Moons, Asilomar17] Algorithm level choices: implications Graph for CIF AR-10 Every parameter combination = 1 dot Pareto optimal curve 13

  14. Algorithm level choices: precision Same task can be implemented with many network topologies vary com putational netw ork precision vary w idth Layer1 Layer2 Layer3 LayerN vary netw ork depth vary layer topology 14

  15. [Moons, Asilomar17] Algorithm level choices: implications Graph for CIF AR-10 Int1-2-… nets need more operations! Int1-2-… nets need simpler operations! Int1-2-… nets need more, but smaller, memory accesses! Impact on parallelism and data reuse? Impact on compute vs memory cost? Most energy efficient network? 15

  16. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Optimize ACROSS all levels Network depth & width? Analog or Data Application digital? parallelism? Layer topology? Optimal Memory precision? hierarchy? Bit resolution? … without giving up flexibility! 16

  17. [Moons, Asilomar17] Parametrized HW energy/latency/area model DRAM access: SRAM access: Multiply-accum: Energy model parametrized across circuit & architecture options Similar approach for latency/delay/throughput, resp. area 17

  18. [Moons, Asilomar17] Energy-based cross-layer optimization Graphs for CIF AR-10 HW model Jointly determine most energy efficient network, and circuit parameters – 4-bit! But … Varies over accuracies and applications  flexible hardware! – Similar study [Moons, Asilomar17] for optimum memory vs. datapath size; optimum layer topology ; … 18

  19. Needs for flexible systems with cross-layer framework Envision : Precision- Scalable CNN processor, Power consumption gen2 (10-100mW) HW models [VLSI’16, ISSCC’17] Binareye : Machine learned wake-up image Cross-layer processor (~1mW) optimization [ISSCC’18, CICC’18] LSTMacc : Machine-learned HW configuration wake-up audio processor (~10uW) NN topology [ESSCIRC’18] Structural and precision scalability 19

  20. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Network depth Optimize ACROSS all levels Face & width? Analog or Data Adapt dynamically (data dependent) recognition digital? parallelism? Layer topology? Optimal Memory precision? hierarchy? Bit resolution? … without giving up flexibility! 20

  21. Cascaded networks for efficient face recognition Face Owner Face Detection detection detection Algorithmic level … … n? … y? Architecture level binary, binary, 6-bit, 125MMACs/f 2GMACs/f 15GMACs/f 17 kB 260 kB 15MB Run on Run on Circuit Binareye Envision level accelerator processor <1mWatt average 21

  22. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Network depth Optimize ACROSS all levels Keyword & width? Analog or Data Adapt dynamically (data dependent) & speaker digital? parallelism? recognition Layer topology? Optimal Memory precision? hierarchy? Bit resolution? … without giving up flexibility! 22

  23. Cascaded ML models for efficient keyword & speaker recognition Voice Keyword Speaker Detection detection identification Algorithmic level Speech rec … y? y? y? … Architecture level 1-4-bit, 4-8bit LSTM, 8-bit, GMM 40kMACs/sec 2MMACs/sec 70MMACs/sec ~2kB 64kB 500kB Circuit Run on cascade of embedded accelerators level <20uWatt average [VLSI2019] 23

  24. Towards embedded Deep Neural Networks Minimize TOTAL energy @ target performance By innovating at the … Circuit Architecture Algorithmic level level level Optimize ACROSS all levels System Adapt dynamically (data dependent) matters, not TOPs/W! 24

  25. Contact: Marian.Verhelst@kuleuven.be

Recommend


More recommend