Algorithm-Hardware Co-design for Deformable Convolution Qijing Huang *, Dequan Wang*, Yizhao Gao † , Yaohui Cai ‡ , Zhen Dong, Bichen Wu, Kurt Keutzer, John Wawrzynek University of California, Berkeley † University of Chinese Academy of Science ‡ Peking University EMC2 Workshop @ NeurIPS 2019
Motivation • Deformable Convolution is an input-adaptive dynamic operation that samples inputs from variable spatial locations • Its sampling locations vary with: • Different input images • Different output pixel locations 1. Generate offsets 2. Sample from input • It captures the spatial variance of objects with different: feature map • Scales • Aspect Ratios • Rotation Angles • Challenges: • Increased compute and memory requirements • Irregular Input-dependent memory access patterns • Not friendly for dataflows that leverage the spatial reuse Sampling Locations (in red) for Different Output Pixels (in green) Variable Receptive Fields 2
Algorithm-Hardware Codesign Algorithm Modification: Hardware Optimization: (-2, 2) (2, 0.75) Input Buffer Input Buffer 0. Original Deformable • Preloads weights to on-chip buffer • Accuracy 1 (mIoU ↑): 79.9 Loads input and offsets directly from DRAM 1 Accuracy for Semantic Segmentation on CityScapes 3
Algorithm-Hardware Codesign Algorithm Modification: Hardware Optimization: (-2, 2.4) (2, 1) Input Buffer 1. Rounded Offsets • Reduces the computation for bilinear ↓ 0.3 Accuracy 1 (mIoU ↑): 79.6 interpolation 1 Accuracy for Semantic Segmentation on CityScapes 4
Algorithm-Hardware Codesign Algorithm Modification: Hardware Optimization: Δ x ≤ 2, Δ y ≤ 2 2. Bounded Range • Buffers inputs in the on-chip ↓ 0.2 Accuracy 1 (mIoU ↑): 79.4 line buffer to allow spatial reuse 1 Accuracy for Semantic Segmentation on CityScapes 5
Algorithm-Hardware Codesign Results Hardware Performance Algorithm Modification: Hardware Optimization: 4. Efficient Feature Extractor 5. Depthwise Convolution • Our algorithm-hardware co-design methodology for the deformable 3. Rectangular Shape convolution achieves a 1.36 × and 9.76 × speedup respectively for the • • Improves on-chip memory bandwidth ↓ 0.7 Reduce the total MACs full and depthwise deformable convolution on FPGA Accuracy 1 (mIoU ↑): 78.7 Email: qijing.huang@berkeley.edu 1 Accuracy for Semantic Segmentation on CityScapes 6
Recommend
More recommend