exploring computation communication tradeo ff s in camera
play

Exploring Computation- Communication Tradeo ff s in Camera Systems - PowerPoint PPT Presentation

Exploring Computation- Communication Tradeo ff s in Camera Systems Amrita Mazumdar Armin Alaghi Thierry Moreau Luis Ceze Sung Kim Mark Oskin Meghan Cowan Visvesh Sathe IISWC 2017 1 Camera applications are a prominent workload with tight


  1. Exploring Computation- Communication Tradeo ff s in Camera Systems Amrita Mazumdar Armin Alaghi Thierry Moreau Luis Ceze Sung Kim Mark Oskin Meghan Cowan 
 Visvesh Sathe IISWC 2017 1

  2. Camera applications are a prominent workload with tight constraints real-time low-power light weight processing augmented energy reality glasses low-power harvesting camera light weight real-time processing real-time processing large data video size 3D-360 virtual surveillance large data reality camera cameras size rig 2

  3. Hardware implementations compound the camera system design space camera system implementation constraint ASIC FPGA power bandwidth GPU DSP time size CPU DogChat™ 3

  4. We can represent camera applications as 
 camera processing pipelines 
 to clarify design space exploration sensor block 1 block 2 block 3 block 4 functions in the application 4

  5. We can represent camera applications as 
 camera processing pipelines 
 to clarify design space exploration image face feature image sensor processing detection tracking rendering DogChat™ 5

  6. Developers can trade o ff between computation and communication costs image face feature image sensor processing detection tracking rendering o ffl oaded to cloud DogChat™ 6

  7. Developers can trade o ff between computation and communication costs image face feature image sensor processing detection tracking rendering in-camera processing o ffl oaded to cloud DogChat™ 7

  8. Optional and required blocks in camera pipelines introduce more tradeo ff s edge motion detection tracking motion detection image face feature image sensor processing detection tracking rendering required optional 8

  9. Custom hardware platforms explode the camera system design space GPU edge motion DSP ASIC detection tracking motion detection FPGA image face feature image sensor processing detection tracking rendering DSP CPU FPGA required optional 9

  10. Custom hardware platforms explode the camera system design space GPU edge motion DSP ASIC detection tracking motion detection FPGA In-camera processing pipelines can help us evaluate these tradeo ff s! image face feature image sensor processing detection tracking rendering DSP CPU FPGA required optional 10

  11. 
 Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design 
 motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration 
 stitch prep align depth 11

  12. 
 Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design 
 motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration 
 stitch prep align depth 12

  13. Face authentication with energy harvesting cameras WISP Cam energy-harvesting camera powered by RF 1 frame / second ~1 mW processing / frame 13

  14. Face authentication with energy harvesting cameras Is this Armin? ✅ 14

  15. CPU-based face authentication neural networks can exceed WISPcam power budgets other neural sensor application network functions on-chip CPU cloud 15

  16. CPU-based face authentication neural networks can exceed WISPcam power budgets other motion face neural sensor application detection detection network functions on-chip ASIC hardware cloud circuit adding optional blocks can reduce power consumption for a neural network 16

  17. Exploring design tradeo ff s in ASIC accelerators neural network face detection pixels in integral accumulator VJ SNNAP PE0 PE1 PE2 PE3 1 1 1 1 2 3 += PU input row + + + 2 6 7 integral image accumulator weight weight weight weight control d_in 1 4 4 integral row DMA Master output previous row 8 8 8 8 8 8 8 8 PE classifier unit offset MUL MUL MUL MUL acc ... window buffer SRAM feature unit 26 16 16 16 16 8 Bus 16 + a d 26 26 26 26 PE acc. stage unit - x 26 ADD ADD ADD ADD + Scheduler b c acc fifo threshold unit weight1 + + 26 26 26 26 SIG + a d - x threshold sig. many more details + sigmoid unit b c feature unit fifo > weight2 8 + a d - x d_out + b c ‘yes’ weight weight3 ‘no’ weight in paper! Streaming face detection Evaluated NN topology and hardware accelerator impact on energy and accuracy Selected a 400-8-1 network topology Explored classifier and other and used 8-bit datapaths for optimal algorithm parameters to optimize energy/accuracy point energy optimality 17

  18. Evaluation Which pipeline achieves the lowest overall power? Synthesized ASIC accelerators in Synopsys Constructed simulator to evaluate power consumption on real-world video input Computed power for computation and transfer of resulting data for each pipeline configuration 18

  19. Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% 132 sensor motion face detect >99% <1% 257,236 sensor motion NN >99% <1% 419 sensor face detect NN >99% <1% 160 face detect NN sensor motion >99% <1% 1 1000 1000000 log Power (µW) 19

  20. Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% prefilters reduce 132 sensor motion face detect >99% <1% overall power 257,236 sensor motion NN >99% <1% 419 sensor face detect NN >99% <1% 160 face detect NN sensor motion >99% <1% 1 1000 1000000 log Power (µW) 20

  21. Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% just using NN 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% 132 sensor motion face detect >99% <1% prefilters with NN use 257,236 sensor motion NN >99% <1% less power 419 sensor face detect NN >99% <1% 160 face detect NN sensor motion >99% <1% 1 1000 1000000 log Power (µW) 21

  22. Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% most power- 132 sensor motion face detect >99% <1% e ffi cient 257,236 sensor motion NN >99% <1% 419 sensor face detect NN >99% <1% most power- e ffi cient with 160 face detect NN sensor motion >99% <1% on-chip NN 1 1000 1000000 log Power (µW) 22

  23. In-camera processing for face authentication motion face neural detection detection network In isolation, even well-designed hardware 
 can show sub-optimal performance Optional blocks can improve the overall cost, 
 if they balance compute and communication 
 better than the original design 23

  24. 
 Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design 
 motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration 
 stitch prep align depth 24

  25. 
 Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design 
 motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration 
 stitch prep align depth 25

  26. Producing real-time VR video from a camera rig Goal: 30 fps 3D-360 stereo video 1.8 GB/s output 16 GoPro cameras 4K-30 fps 3.6 GB/s raw video 26

  27. Producing real-time VR video from a camera rig Goal: cloud processing 30 fps prevents real- 3D-360 stereo video time video 1.8 GB/s output 16 GoPro cameras 4K-30 fps 3.6 GB/s raw video 27

  28. VR pipeline is usually o ffl oaded to perform heavy computation o ffl oaded to cloud image depth image stream sensor prep align from flow stitch to viewer 5% 20% 70% 5% processing time need to accelerate “depth from flow” to achieve high performance 28

  29. O ffl oading before the costly step doesn’t avoid compute-communication tradeo ff s 600 image alignment step Video Frame Size (MB) produces significant 450 intermediate data 300 o ffl oading early on is 150 still 2x final output size 0 image depth image stream sensor prep align from flow stitch to viewer 29

  30. Evaluation Which pipeline achieves the highest frame rate? Designed a simple parallel accelerator for Xilinx implementation Zynq SoC, simulated for Virtex UltraScale+ details in paper Evaluated against CPU and GPU implementations in Halide Assumed 2GB/s network link for communication 30

Recommend


More recommend