H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and Song Han Project Page: http://pvcnn.mit.edu/
3D Deep Learning 3D Part Segmentation 3D Semantic Segmentation 3D Object Detection (for Robotic Systems) (for VR/AR Headsets) (for Self-Driving Cars) 3D deep learning has been used in various applications on resource-constrained edge devices.
E ffi cient 3D Deep Learning Energy (pJ) Bandwidth (GB/s) w/ Bank Con fl icts Addres Ad ess Bu Bus ! " ! $ ! % 668 640 Wait f Wa for D DRAM A Acces ccess Wait f Wa for D DRAM A Acces ccess Wa Wait … … 167 Data Bu Da Bus # " # $ w/o Bank Con fl icts 30 Addres Ad ess Bu Bus ! " ! $ ! % ! & Wait f Wa for D DRAM A Acces ccess 5 Wait f Wa for D DRAM A Acces ccess 3 Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess Da Data Bu Bus # " # $ # % # & 32b Mult and Add 32b SRAM Read 32b DRAM Read O ff -chip DRAM access is much more Random memory access is ine ffi cient expensive than arithmetic operation! due to the potential bank con fl icts! E ffi cient 3D deep learning models should have small memory footprints and avoid random memory access .
Voxel-Based Models: Cubically-Growing Memory 256 500 128 x 128 x 128 resolution 192 83 GB (Titan XP x 7) 200 7% information loss GPU Memory (GB) 100 128 Voxel Resolution 50 64 x 64 x 64 resolution 96 11 GB (Titan XP x 1) 42% information loss 64 10 5 48 32 2 * ) 3D ShapeNets [CVPR’15] 16 1 8 VoxNet [IROS’15] 0 10 20 30 40 50 60 70 80 90 100 3D U-Net [MICCAI’16] Distinguishable Points (%) Low resolutions lead to signi fi cant information loss . High resolutions lead to large memory consumption .
Point-Based Models: Sparsity Overheads * DGCNN PointCNN SpiderCNN Ours ' 95.1 Runtime (%) 57.4 51.8 51.5 45.3 36.3 27.0 + ) 15.6 12.2 PointNet [CVPR’17] 4.9 2.9 0.0 PointCNN [NeurIPS’18] Irregular Access Dynamic Kernel Actual Computation DGCNN [SIGGRAPH’19] Up to 80% of the time is wasted on structuring the sparse data , not on the actual feature extraction.
Point-Voxel Convolution (PVConv) Voxel-Based Feature Aggregation (Coarse-Grained) Voxelize Convolve Devoxelize Fuse Normalize Multi-Layer Perceptron Point-Based Feature Transformation (Fine-Grained) PVCNN combines the advantages of point-based models ( small memory footprint ) and voxel-based models ( regularity ).
Point-Voxel Convolution (PVConv) Features from Voxel-Based Branch : Features from Point-Based Branch : Voxel-based branch captures large, contiguous parts. Point-based branch captures isolated, discontinuous details.
Results: 3D Part Segmentation (ShapeNet) PVCNN PointCNN DGCNN RSNet 3D-UNet SpiderCNN PointNet++ PointNet 86.0 85.5 Mean IoU 85.0 84.5 84.0 83.5 0 30 60 90 120 150 180 210 0.7 1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1 GPU Latency (ms) GPU Memory (GB) PVCNN outperforms PointCNN with 2.7x measured speedup and 1.5x memory reduction (on a GTX 1080Ti GPU).
Results: 3D Part Segmentation (ShapeNet) PointCNN (86.1 mIoU) PVCNN (86.2 mIoU) PointNet (83.7 mIoU) 0.25 PVCNN (85.2 mIoU) 20.2 139.9 Objects per Second 76.0 9.5 7.7 42.6 3.3 20.3 19.9 2.5 1.4 8.2 Jetson Nano Jetson TX2 Jetson AGX Xavier Jetson Nano Jetson TX2 Jetson AGX Xavier 0.25 PVCNN runs with real-time performance (20 FPS) on the lightweight edge device (Jetson Nano).
Results: 3D Semantic Segmentation (S3DIS) PVCNN PVCNN++ 3D-UNet PointCNN RSNet DGCNN PointNet 57.5 55.0 52.5 Mean IoU 50.0 47.5 45.0 42.5 20 60 100 140 180 220 260 300 0.4 1.0 1.6 2.2 2.8 3.4 4.0 4.6 GPU Latency (ms) GPU Memory (GB) PVCNN++ outperforms PointCNN with 6.9x measured speedup and 5.7x memory reduction (on a GTX 1080Ti GPU).
Results: 3D Semantic Segmentation (S3DIS) Input Scene PointNet 0.25 PVCNN (Ours) Ground Truth 0.25 PVCNN outperforms PointNet with 1.8x measured speedup and 1.4x memory reduction (on a GTX 1080Ti GPU).
Results: 3D Object Detection (KITTI) E ffi ciency Car Pedestrian Cyclist Latency Memory Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard (GPU) (GPU) F-PointNet++ 105.2 ms 2.0 GB 83.8 70.9 63.7 70.0 61.3 53.6 77.2 56.5 53.4 58.9 ms 1.4 GB 84.2 PVCNN 71.1 63.6 69.2 60.3 52.5 78.7 57.8 54.2 (+0.2) (-0.1) (-0.8) (-1.0) (-1.1) (+1.5) (+1.3) (+1.2) (1.8x) (1.4x) (+0.4) (e ffi cient) 1.4 GB 71.5 63.8 73.2 64.7 56.8 81.4 60.0 56.3 PVCNN 69.6 ms 84.0 (1.5x) (+0.2) (1.4x) (+0.6) (+0.1) (+3.2) (+3.4) (+3.2) (+4.2) (+3.5) (+2.9) (complete) PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.
Results: 3D Object Detection (KITTI) F-PointNet++ PVCNN (Ours) PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.
H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning * Bottleneck Analysis Hardware-E ffi cient Primitive ' + ) * ) Project Page: http://pvcnn.mit.edu/
Recommend
More recommend