SoC Architecture RGB Semantic Photons Frames Results Vision Imaging Kernels Image Signal CNN Processor Accelerator 8
SoC Architecture RGB Semantic Photons Frames Results Vision Imaging Kernels Camera Sensor Image Signal CNN Processor Accelerator Sensor Interface On-chip Interconnect 8
SoC Architecture Camera Sensor Image Signal CNN Processor Accelerator Sensor Interface On-chip Interconnect 9
SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display DRAM 9
SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9
SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9
SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9
SoC Architecture CPU Camera Sensor Image Signal CNN Processor Accelerator (Host) Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9
SoC Architecture CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer DRAM 9
SoC Architecture CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer Metadata DRAM 9
SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Memory DMA SoC Controller Engine Display Frame Buffer Metadata DRAM 9
SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9
SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9
SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9
SoC Architecture 1 CPU Camera Sensor CNN Image Signal Accelerator (Host) Processor Sensor Interface On-chip Interconnect Motion Memory DMA 2 Controller Engine Controller Display Frame Buffer Metadata DRAM 9
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance MVs Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance MVs Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10
ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~ 6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer ISP Pipeline Temporal Denoising Stage Noisy Denoised Motion Motion Color Frame Frame Demosaic Estimation Compensation Balance MVs Prev. Prev. Denoised Noisy ISP SRAM Frame Frame Sequencer ISP Internal Interconnect Frame Buffer DMA (DRAM) SoC Interconnect 10
Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹ Independent of vision algo./arch implementation Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹ Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹ Switch-o ff CPU to enable “always-on” vision Sequencer (FSM) ROI Selection Extrapolation Unit Motion New ROI Vector Scalar 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs 11
Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹ Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹ Switch-o ff CPU to enable “always-on” vision Motion Controller Sequencer (FSM) ROI Selection CNN Extrapolation Unit Motion New ISP ROI Vector Scalar Accelerator 4-Way Buffer SIMD Unit MVs ROI Conf Winsize MMap Addrs Base DMA ROI Regs SoC Interconnect 12
Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision Motion-based tracking and Algorithm detection synthesis. Exploits synergies across IP SoC blocks. Enables task autonomy. 66% energy saving & 1% accuracy Results loss with RTL/measurement. 13
Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement 14
Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) 14
Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs 14
Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs 14
Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹ Baseline CNN: YOLOv2 (state-of-the-art detection results) 14
Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm 2 ) ▹ Motion Controller (2.2 mW, 0.035 mm 2 ) ▸ Evaluate on Object Tracking and Object Detection ▹ Important domains that are building blocks for many vision applications ▹ IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹ Baseline CNN: YOLOv2 (state-of-the-art detection results) ▸ SCALESim : A systolic array-based, cycle-accurate CNN accelerator simulator. https://github.com/ARM-software/SCALE-Sim. 14
Evaluation Results 0.7 0.6 Accuracy 0.5 0.4 0.3 0.2 0.1 2 v O L O Y 15
Evaluation Results 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 v v O O L L O O Y Y 15
Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 4 8 6 2 v v v - - - 1 3 W W W O O O - - W W L L L E E E O O O E E Y Y Y 15
Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 v v v v - - - - - - 1 1 3 3 W W W W W W O O O O - - - - W W W W L L L L E E E E E E O O O O E E E E Y Y Y Y 15
Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 v v v v - - - - - - 1 1 3 3 W W W W W W O O O O - - - - W W W W L L L L E E E E E E O O O O E E E E Y Y Y Y 66% system energy saving with ~ 1% accuracy loss. 15
Evaluation Results EW = Extrapolation Window 0.7 1 Scale-down Norm. Energy 0.6 Accuracy CNN 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 2 4 6 O v v v v - - - - - - 1 1 3 3 v - 1 L W W W W W W W O O O O O - - - - - O W W W W W L L L L L E E E E E E E Y O O O O O E E E E E y Y Y Y Y Y n i T 66% system energy saving with ~ 1% accuracy loss. 15
Evaluation Results EW = Extrapolation Window 0.7 1 Norm. Energy 0.6 Accuracy 0.75 0.5 0.4 0.5 0.3 0.25 0.2 0.1 0 2 2 2 2 2 2 4 4 8 8 6 6 2 2 2 4 6 O v v v v - - - - - - 1 1 3 3 v - 1 L W W W W W W W O O O O O - - - - - O W W W W W L L L L L E E E E E E E Y O O O O O E E E E E y Y Y Y Y Y n i T 66% system energy saving with ~ 1% accuracy loss. More e ffi cient than simply scaling-down the CNN. 15
Conclusions 16
Conclusions ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. 16
Conclusions ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. 16
Conclusions ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. 16
Recommend
More recommend