Exploring Computation- Communication Tradeo ff s in Camera Systems Amrita Mazumdar Armin Alaghi Thierry Moreau Luis Ceze Sung Kim Mark Oskin Meghan Cowan Visvesh Sathe IISWC 2017 1
Camera applications are a prominent workload with tight constraints real-time low-power light weight processing augmented energy reality glasses low-power harvesting camera light weight real-time processing real-time processing large data video size 3D-360 virtual surveillance large data reality camera cameras size rig 2
Hardware implementations compound the camera system design space camera system implementation constraint ASIC FPGA power bandwidth GPU DSP time size CPU DogChat™ 3
We can represent camera applications as camera processing pipelines to clarify design space exploration sensor block 1 block 2 block 3 block 4 functions in the application 4
We can represent camera applications as camera processing pipelines to clarify design space exploration image face feature image sensor processing detection tracking rendering DogChat™ 5
Developers can trade o ff between computation and communication costs image face feature image sensor processing detection tracking rendering o ffl oaded to cloud DogChat™ 6
Developers can trade o ff between computation and communication costs image face feature image sensor processing detection tracking rendering in-camera processing o ffl oaded to cloud DogChat™ 7
Optional and required blocks in camera pipelines introduce more tradeo ff s edge motion detection tracking motion detection image face feature image sensor processing detection tracking rendering required optional 8
Custom hardware platforms explode the camera system design space GPU edge motion DSP ASIC detection tracking motion detection FPGA image face feature image sensor processing detection tracking rendering DSP CPU FPGA required optional 9
Custom hardware platforms explode the camera system design space GPU edge motion DSP ASIC detection tracking motion detection FPGA In-camera processing pipelines can help us evaluate these tradeo ff s! image face feature image sensor processing detection tracking rendering DSP CPU FPGA required optional 10
Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 11
Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 12
Face authentication with energy harvesting cameras WISP Cam energy-harvesting camera powered by RF 1 frame / second ~1 mW processing / frame 13
Face authentication with energy harvesting cameras Is this Armin? ✅ 14
CPU-based face authentication neural networks can exceed WISPcam power budgets other neural sensor application network functions on-chip CPU cloud 15
CPU-based face authentication neural networks can exceed WISPcam power budgets other motion face neural sensor application detection detection network functions on-chip ASIC hardware cloud circuit adding optional blocks can reduce power consumption for a neural network 16
Exploring design tradeo ff s in ASIC accelerators neural network face detection pixels in integral accumulator VJ SNNAP PE0 PE1 PE2 PE3 1 1 1 1 2 3 += PU input row + + + 2 6 7 integral image accumulator weight weight weight weight control d_in 1 4 4 integral row DMA Master output previous row 8 8 8 8 8 8 8 8 PE classifier unit offset MUL MUL MUL MUL acc ... window buffer SRAM feature unit 26 16 16 16 16 8 Bus 16 + a d 26 26 26 26 PE acc. stage unit - x 26 ADD ADD ADD ADD + Scheduler b c acc fifo threshold unit weight1 + + 26 26 26 26 SIG + a d - x threshold sig. many more details + sigmoid unit b c feature unit fifo > weight2 8 + a d - x d_out + b c ‘yes’ weight weight3 ‘no’ weight in paper! Streaming face detection Evaluated NN topology and hardware accelerator impact on energy and accuracy Selected a 400-8-1 network topology Explored classifier and other and used 8-bit datapaths for optimal algorithm parameters to optimize energy/accuracy point energy optimality 17
Evaluation Which pipeline achieves the lowest overall power? Synthesized ASIC accelerators in Synopsys Constructed simulator to evaluate power consumption on real-world video input Computed power for computation and transfer of resulting data for each pipeline configuration 18
Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% 132 sensor motion face detect >99% <1% 257,236 sensor motion NN >99% <1% 419 sensor face detect NN >99% <1% 160 face detect NN sensor motion >99% <1% 1 1000 1000000 log Power (µW) 19
Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% prefilters reduce 132 sensor motion face detect >99% <1% overall power 257,236 sensor motion NN >99% <1% 419 sensor face detect NN >99% <1% 160 face detect NN sensor motion >99% <1% 1 1000 1000000 log Power (µW) 20
Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% just using NN 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% 132 sensor motion face detect >99% <1% prefilters with NN use 257,236 sensor motion NN >99% <1% less power 419 sensor face detect NN >99% <1% 160 face detect NN sensor motion >99% <1% 1 1000 1000000 log Power (µW) 21
Which pipeline achieves the lowest power consumption? (ratios) platform configuration compute transfer 11,340 sensor <1% >99% 3,731 sensor motion <1% >99% 374 sensor face detect 10% 90% 782,090 sensor NN 16% 84% most power- 132 sensor motion face detect >99% <1% e ffi cient 257,236 sensor motion NN >99% <1% 419 sensor face detect NN >99% <1% most power- e ffi cient with 160 face detect NN sensor motion >99% <1% on-chip NN 1 1000 1000000 log Power (µW) 22
In-camera processing for face authentication motion face neural detection detection network In isolation, even well-designed hardware can show sub-optimal performance Optional blocks can improve the overall cost, if they balance compute and communication better than the original design 23
Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 24
Challenges for modern camera systems Low-power: face authentication for energy-harvesting cameras with ASIC design motion face neural detection detection network Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration stitch prep align depth 25
Producing real-time VR video from a camera rig Goal: 30 fps 3D-360 stereo video 1.8 GB/s output 16 GoPro cameras 4K-30 fps 3.6 GB/s raw video 26
Producing real-time VR video from a camera rig Goal: cloud processing 30 fps prevents real- 3D-360 stereo video time video 1.8 GB/s output 16 GoPro cameras 4K-30 fps 3.6 GB/s raw video 27
VR pipeline is usually o ffl oaded to perform heavy computation o ffl oaded to cloud image depth image stream sensor prep align from flow stitch to viewer 5% 20% 70% 5% processing time need to accelerate “depth from flow” to achieve high performance 28
O ffl oading before the costly step doesn’t avoid compute-communication tradeo ff s 600 image alignment step Video Frame Size (MB) produces significant 450 intermediate data 300 o ffl oading early on is 150 still 2x final output size 0 image depth image stream sensor prep align from flow stitch to viewer 29
Evaluation Which pipeline achieves the highest frame rate? Designed a simple parallel accelerator for Xilinx implementation Zynq SoC, simulated for Virtex UltraScale+ details in paper Evaluated against CPU and GPU implementations in Halide Assumed 2GB/s network link for communication 30
Recommend
More recommend