flexible and scalable acceleration techniques for low
play

Flexible and Scalable Acceleration Techniques for Low-Power Edge - PowerPoint PPT Presentation

Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop on Embedded Systems 8.9.2017 Universit degli Studi di Roma La Sapienza Francesco Conti 1,2 , Davide Rossi 1 , Luca Benini 1,2


  1. Flexible and Scalable Acceleration Techniques for Low-Power Edge Computing 2nd Italian Workshop on Embedded Systems 8.9.2017 Università degli Studi di Roma “La Sapienza„ Francesco Conti 1,2 , Davide Rossi 1 , Luca Benini 1,2 f.conti@unibo.it 1 Energy Efficient Embedded Systems Laboratory 2 Integrated Systems Laboratory

  2. Computing for the Internet of Things Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 2

  3. Computing for the Internet of Things Sense MEMS IMU MEMS Microphone ULP Imager EMG/ECG/EIT 100 µW ÷ 2 mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 3

  4. Computing for the Internet of Things Analyze and Classify Sense MEMS IMU µ Controller MEMS Microphone e.g. CortexM ULP Imager IOs EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 100 µW ÷ 2 mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 4

  5. Computing for the Internet of Things Analyze and Classify Transmit Sense MEMS IMU Short range, medium BW µ Controller MEMS Microphone e.g. CortexM ULP Imager Low rate (periodic) data IOs 1 ÷ 25 MOPS SW update, commands EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 1 ÷ 10 mW Long range, low BW Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 50mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 5

  6. The Road to Efficiency near-threshold normal 65nm CMOS, 50° C 10 1 450 Microprocessors, Communications of the ACM, May 2011 Active Leakage Power 400 Energy E ffj ciency 350 Subthreshold Region [Gop/s/W] 300 1 [mW] Adapted from Borkar and Chien, The Future of 250 200 10 –1 150 100 50 320mV 10 –2 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 10 2 parallel computing 10 4 Maximum Frequency performance constraint 10 1 10 3 Total Power [MHz] [W] 10 2 1 10 –1 10 1 Parallel computing is particularly 320mV attractive for analytics workloads, 10 –2 which often expose natural parallelism, 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 and is naturally coupled with Supply Voltage near-threshold computing [V] F.Conti @ IWES 2017 | 08/09/17 | 6

  7. Computing for the Internet of Things Analyze and Classify Transmit Sense MEMS IMU Short range, medium BW µ Controller L2 Memory MEMS Microphone e.g. CortexM ULP Imager Low rate (periodic) data IOs 1 ÷ 2000 MOPS SW update, commands EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 1 ÷ 10 mW Long range, low BW Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 50mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 7

  8. Computing for the Internet of Things Analyze and Classify Transmit Sense MEMS IMU Short range, medium BW µ Controller L2 Memory MEMS Microphone e.g. CortexM ULP Imager Low rate (periodic) data IOs 1 ÷ 2000 MOPS SW update, commands EMG/ECG/EIT 1 ÷ 25 MOPS 1 ÷ 10 mW 1 ÷ 10 mW Long range, low BW Idle: ~1µW 100 µW ÷ 2 mW Active: ~ 50mW Battery + Harvesting powered à a few mW power envelope F.Conti @ IWES 2017 | 08/09/17 | 8

  9. PULP architecture outline Mem Mem Mem Mem Mem Mem Mem Mem Bank Bank Bank Bank Bank Bank Bank Bank TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution Core #1 Core #2 Core #3 Core #N • parallel computing • architecture targeted at low power Targeting 100-1000 GOPS/W of Instruction Instruction Instruction Instruction performance/Watt (> 100x of current Cache Cache Cache Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Parallel access to shared memory à Flexibility F.Conti @ IWES 2017 | 08/09/17 | 9

  10. PULP architecture outline Mem Mem Mem Mem Mem Mem Mem Mem Bank Bank Bank Bank Bank Bank Bank Bank TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution Core #1 Core #2 Core #3 Core #N • parallel computing • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared Shared performance/Watt (> 100x of current Instruction Cache Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Shared I$ + L0 fetch buffer à Efficiency F.Conti @ IWES 2017 | 08/09/17 | 10

  11. PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution Core #1 Core #2 Core #3 Core #N • parallel computing • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared performance/Watt (> 100x of current Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Hybrid memory: SRAM+SCM à can work at very low Vdd F.Conti @ IWES 2017 | 08/09/17 | 11

  12. PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution HW Core #1 Core #2 Core #3 Core #N • parallel computing Synch • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared performance/Watt (> 100x of current Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. HW Synch à Faster core shutdown + parallelism F.Conti @ IWES 2017 | 08/09/17 | 12

  13. PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM TCDM Logarithmic Interconnect P arallel U ltra L ow P ower in a nutshell: energy efficiency for the IoT through • near-threshold ULP execution HW Core #1 Core #2 Core #3 Core #N • parallel computing Synch • architecture targeted at low power L0 L0 L0 L0 Targeting 100-1000 GOPS/W of Shared performance/Watt (> 100x of current Instruction Cache MCUs) A joint effort of University of Bologna , ETH Zurich and other academic and industrial partners. Fine-grain Clk-Gating + Body-Bias à Less Power F.Conti @ IWES 2017 | 08/09/17 | 13

  14. PULP architecture outline SCM SCM SCM SCM SCM SCM SCM SCM L2 Memory SRAM SRAM SRAM SRAM SRAM SRAM SRAM SRAM DMA TCDM Logarithmic Interconnect Cluster Bus QSPI Master HW Core #1 Core #2 Core #3 Core #N Synch L0 L0 L0 L0 Shared QSPI Instruction Cache Slave Bus Instruction Bus Adapter Add infrastructure to access off-cluster memory F.Conti @ IWES 2017 | 08/09/17 | 14

  15. How to get even more efficient? 65nm CMOS, 50° C 10 1 450 Microprocessors, Communications of the ACM, May 2011 Active Leakage Power 400 Energy E ffj ciency 350 Subthreshold Region [Gop/s/W] 300 1 [mW] Adapted from Borkar and Chien, The Future of 250 200 10 –1 150 100 50 320mV 10 –2 0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 10 2 parallel computing 10 4 heterogeneous computing Maximum Frequency 10 1 10 3 Total Power [MHz] [W] 10 2 1 10 –1 10 1 320mV 10 –2 1 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Supply Voltage [V] F.Conti @ IWES 2017 | 08/09/17 | 15

  16. HW Acceleration in Tightly-Coupled Clusters Mem Mem Mem Mem Mem Mem Mem Mem L2 Bank Bank Bank Bank Bank Bank Bank Bank Memory DMA TCDM Logarithmic Interconnect Cluster Bus Host Processor Core #1 Core #2 Core #3 Core #N Instruction Instruction Instruction Instruction Cluster Cache Cache Cache Cache Interface Bus Instruction Bus Adapter A host processor outside the cluster F.Conti @ IWES 2017 | 08/09/17 | 16

  17. HW Acceleration in Tightly-Coupled Clusters Mem Mem Mem Mem Mem Mem Mem Mem L2 Bank Bank Bank Bank Bank Bank Bank Bank Memory DMA TCDM Logarithmic Interconnect Cluster Bus Host HW Processor Processing Core #1 Core #2 Core #3 Core #N Engine Instruction Instruction Instruction Instruction Cluster Cache Cache Cache Cache Interface Bus Instruction Bus Adapter HW Processing Engines inside the cluster F.Conti @ IWES 2017 | 08/09/17 | 17

  18. HW Acceleration in Tightly-Coupled Clusters Mem Mem Mem Mem Mem Mem Mem Mem L2 Bank Bank Bank Bank Bank Bank Bank Bank Memory DMA TCDM Logarithmic Interconnect Cluster Bus Host HW Processor Processing Core #1 Core #2 Core #3 Core #N Engine Instruction Instruction Instruction Instruction Cluster Cache Cache Cache Cache Interface Bus Instruction Bus Adapter F.Conti @ IWES 2017 | 08/09/17 | 18

  19. HW Processing Engines on the data plane, cores see HW = “Virtual” “Virtual” “Virtual” HWPE s as a set of SW cores Core #1 Processing Core #N+1 Core #N+2 Core #N+3 Engine on the control plane, cores HW control HWPE s as a memory = “Virtual” Core #1 Processing mapped peripheral (e.g. a DMA) DMA Engine F.Conti @ IWES 2017 | 08/09/17 | 19

Recommend


More recommend