algorithm soc co design for mobile continuous vision
play

Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu - PowerPoint PPT Presentation

Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research Mobile Continuous Vision:


  1. SoC Architecture RGB Semantic Photons Frames Results Vision Imaging Kernels Image Signal CNN Processor Accelerator 8

  2. SoC Architecture RGB Semantic Photons Frames Results Vision Imaging Kernels Camera Sensor Image Signal CNN Processor Accelerator Sensor Interface On-chip Interconnect 8

  3. SoC Architecture Camera Sensor Image Signal CNN Processor Accelerator Sensor Interface On-chip Interconnect 9

  4. SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display DRAM 9

  5. SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9

  6. SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9

  7. SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9

  8. SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9

  9. SoC Architecture CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9

  10. SoC Architecture CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer Metadata DRAM 9

  11. SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer Metadata DRAM 9

  12. SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9

  13. SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9

  14. SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9

  15. SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9

  16. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC 10

  17. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM 10

  18. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data 10

  19. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme 10

  20. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10

  21. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10

  22. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10

  23. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10

  24. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance MVs Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10

  25. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance MVs Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10

  26. ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance MVs Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10

  27. Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  28. Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  29. Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  30. Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  31. Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  32. Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹ Independent of vision algo./arch implementation Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  33. Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹ Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹ Switch-o ff CPU to enable “always-on” vision Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11

  34. Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹ Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹ Switch-o ff CPU to enable “always-on” vision Motion Controller Sequencer (FSM) ROI Selection CNN Extrapolation Unit Motion New ISP ROI Vector Scalar Accelerator 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs SoC Interconnect 12

  35. Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision Motion-based tracking and Algorithm detection synthesis. Exploits synergies across IP SoC blocks. Enables task autonomy. 66% energy saving & 1% accuracy Results loss with RTL/measurement. 13

  36. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement 14

  37. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) 14

  38. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs 14

  39. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs 14

  40. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹ Baseline CNN: YOLOv2 (state-of-the-art detection results) 14

  41. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹ Baseline CNN: YOLOv2 (state-of-the-art detection results) ▸ SCALESim : A systolic array-based, cycle-accurate CNN accelerator simulator. https://github.com/ARM-software/SCALE-Sim. 14

  42. Evaluation Results 0.7 0.6 Accuracy 0.5 0.4 0.3 0.2 0.1 2 v O L O Y 15

  43. Evaluation Results 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 v v O O L L O O Y Y 15

  44. Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 4 8 6 2 v v v - - - 1 3 W W W O O O - - W W L L L E E E O O O E E Y Y Y 15

  45. Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 v v v v - - - - - - 1 1 3 3 W W W W W W O O O O - - - - W W W W L L L L E E E E E E O O O O E E E E Y Y Y Y 15

  46. Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 v v v v - - - - - - 1 1 3 3 W W W W W W O O O O - - - - W W W W L L L L E E E E E E O O O O E E E E Y Y Y Y 66% system energy saving with ~ 1% accuracy loss. 15

  47. Evaluation Results EW = Extrapolation Window 0.7 1 Scale-down Norm. Energy 0.6 Accuracy CNN 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 2 4 6 O v v v v - - - - - - 1 1 3 3 v - 1 L W W W W W W W O O O O O - - - - - O W W W W W L L L L L E E E E E E E Y O O O O O E E E E E y Y Y Y Y Y n i T 66% system energy saving with ~ 1% accuracy loss. 15

  48. Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 2 4 6 O v v v v - - - - - - 1 1 3 3 v - 1 L W W W W W W W O O O O O - - - - - O W W W W W L L L L L E E E E E E E Y O O O O O E E E E E y Y Y Y Y Y n i T 66% system energy saving with ~ 1% accuracy loss. More e ffi cient than simply scaling-down the CNN. 15

  49. Conclusions 16

  50. Conclusions ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. 16

  51. Conclusions ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. 16

  52. Conclusions ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. 16

Recommend


More recommend