Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , - PowerPoint PPT Presentation

H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and Song Han Project Page: http://pvcnn.mit.edu/

3D Deep Learning 3D Part Segmentation 3D Semantic Segmentation 3D Object Detection (for Robotic Systems) (for VR/AR Headsets) (for Self-Driving Cars) 3D deep learning has been used in various applications on resource-constrained edge devices.

E ffi cient 3D Deep Learning Energy (pJ) Bandwidth (GB/s) w/ Bank Con fl icts Addres Ad ess Bu Bus ! " ! $ ! % 668 640 Wait f Wa for D DRAM A Acces ccess Wait f Wa for D DRAM A Acces ccess Wa Wait … … 167 Data Bu Da Bus # " # $ w/o Bank Con fl icts 30 Addres Ad ess Bu Bus ! " ! $ ! % ! & Wait f Wa for D DRAM A Acces ccess 5 Wait f Wa for D DRAM A Acces ccess 3 Wa Wait f for D DRAM A Acces ccess Wa Wait f for D DRAM A Acces ccess Da Data Bu Bus # " # $ # % # & 32b Mult and Add 32b SRAM Read 32b DRAM Read O ff -chip DRAM access is much more Random memory access is ine ffi cient expensive than arithmetic operation! due to the potential bank con fl icts! E ffi cient 3D deep learning models should have small memory footprints and avoid random memory access .

Voxel-Based Models: Cubically-Growing Memory 256 500 128 x 128 x 128 resolution 192 83 GB (Titan XP x 7) 200 7% information loss GPU Memory (GB) 100 128 Voxel Resolution 50 64 x 64 x 64 resolution 96 11 GB (Titan XP x 1) 42% information loss 64 10 5 48 32 2 * ) 3D ShapeNets [CVPR’15] 16 1 8 VoxNet [IROS’15] 0 10 20 30 40 50 60 70 80 90 100 3D U-Net [MICCAI’16] Distinguishable Points (%) Low resolutions lead to signi fi cant information loss . High resolutions lead to large memory consumption .

Point-Based Models: Sparsity Overheads * DGCNN PointCNN SpiderCNN Ours ' 95.1 Runtime (%) 57.4 51.8 51.5 45.3 36.3 27.0 + ) 15.6 12.2 PointNet [CVPR’17] 4.9 2.9 0.0 PointCNN [NeurIPS’18] Irregular Access Dynamic Kernel Actual Computation DGCNN [SIGGRAPH’19] Up to 80% of the time is wasted on structuring the sparse data , not on the actual feature extraction.

Point-Voxel Convolution (PVConv) Voxel-Based Feature Aggregation (Coarse-Grained) Voxelize Convolve Devoxelize Fuse Normalize Multi-Layer Perceptron Point-Based Feature Transformation (Fine-Grained) PVCNN combines the advantages of point-based models ( small memory footprint ) and voxel-based models ( regularity ).

Point-Voxel Convolution (PVConv) Features from Voxel-Based Branch : Features from Point-Based Branch : Voxel-based branch captures large, contiguous parts. Point-based branch captures isolated, discontinuous details.

Results: 3D Part Segmentation (ShapeNet) PVCNN PointCNN DGCNN RSNet 3D-UNet SpiderCNN PointNet++ PointNet 86.0 85.5 Mean IoU 85.0 84.5 84.0 83.5 0 30 60 90 120 150 180 210 0.7 1.0 1.3 1.6 1.9 2.2 2.5 2.8 3.1 GPU Latency (ms) GPU Memory (GB) PVCNN outperforms PointCNN with 2.7x measured speedup and 1.5x memory reduction (on a GTX 1080Ti GPU).

Results: 3D Part Segmentation (ShapeNet) PointCNN (86.1 mIoU) PVCNN (86.2 mIoU) PointNet (83.7 mIoU) 0.25 PVCNN (85.2 mIoU) 20.2 139.9 Objects per Second 76.0 9.5 7.7 42.6 3.3 20.3 19.9 2.5 1.4 8.2 Jetson Nano Jetson TX2 Jetson AGX Xavier Jetson Nano Jetson TX2 Jetson AGX Xavier 0.25 PVCNN runs with real-time performance (20 FPS) on the lightweight edge device (Jetson Nano).

Results: 3D Semantic Segmentation (S3DIS) PVCNN PVCNN++ 3D-UNet PointCNN RSNet DGCNN PointNet 57.5 55.0 52.5 Mean IoU 50.0 47.5 45.0 42.5 20 60 100 140 180 220 260 300 0.4 1.0 1.6 2.2 2.8 3.4 4.0 4.6 GPU Latency (ms) GPU Memory (GB) PVCNN++ outperforms PointCNN with 6.9x measured speedup and 5.7x memory reduction (on a GTX 1080Ti GPU).

Results: 3D Semantic Segmentation (S3DIS) Input Scene PointNet 0.25 PVCNN (Ours) Ground Truth 0.25 PVCNN outperforms PointNet with 1.8x measured speedup and 1.4x memory reduction (on a GTX 1080Ti GPU).

Results: 3D Object Detection (KITTI) E ffi ciency Car Pedestrian Cyclist Latency Memory Easy Moderate Hard Easy Moderate Hard Easy Moderate Hard (GPU) (GPU) F-PointNet++ 105.2 ms 2.0 GB 83.8 70.9 63.7 70.0 61.3 53.6 77.2 56.5 53.4 58.9 ms 1.4 GB 84.2 PVCNN 71.1 63.6 69.2 60.3 52.5 78.7 57.8 54.2 (+0.2) (-0.1) (-0.8) (-1.0) (-1.1) (+1.5) (+1.3) (+1.2) (1.8x) (1.4x) (+0.4) (e ffi cient) 1.4 GB 71.5 63.8 73.2 64.7 56.8 81.4 60.0 56.3 PVCNN 69.6 ms 84.0 (1.5x) (+0.2) (1.4x) (+0.6) (+0.1) (+3.2) (+3.4) (+3.2) (+4.2) (+3.5) (+2.9) (complete) PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.

Results: 3D Object Detection (KITTI) F-PointNet++ PVCNN (Ours) PVCNN outperforms F-PointNet++ by 2.4% mAP with 1.5x measured speedup and 1.4x memory reduction.

H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning * Bottleneck Analysis Hardware-E ffi cient Primitive ' + ) * ) Project Page: http://pvcnn.mit.edu/

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , - PowerPoint PPT Presentation

H ardware, A I and N eural-nets Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and Song Han Project Page: http://pvcnn.mit.edu/ 3D Deep Learning 3D Part Segmentation 3D Semantic Segmentation 3D

Lecture 3. Su ffi ciency Lecture 3. Su ffi ciency 1 (114) 3. Su ffi ciency 3.1. Su ffi cient

Point-Voxel CNN for E ffi cient 3D Deep Learning Zhijian Liu* , Haotian Tang* , Yujun Lin , and

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

Immutability, or Putting the Dream Machine to Work The trie memory scheme is ine ffi cient for

PRACTICAL REAL-TIME VOXEL-BASED GLOBAL ILLUMINATION FOR CURRENT GPUS Alexey Panteleev NVIDIA

1 Splatting Splatting Algorithm: Process from closest voxel to furthest voxel

An E ffi cient A ffi ne-Scaling Algorithm for Hyperbolic Programming Jim Renegar joint work

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

FFI The good, the bad and the ugly Esteban Lorenzano (The Pharo firefighter) Current status of

Dynamic Graph CNN for learning on point clouds Wang Yue, et al. Otakar Jaek March 25, 2019

15 E ffi cient mesh models Steve Marschner CS5625 Spring 2020 Follows chapter 16 in RTR 4e Basics

Supere ffi cient estimation of the intensity of a stationary Poisson point process via the Stein

E ffi cient Modeling of Latent Information in Supervised Learning using Gaussian Processes

Classification of Point Cloud for Road Scene Understanding with Multiscale Voxel Deep Network

Taming the C Monster Haskell FFI Techniques Fraser Tweedale @hackuador May 22, 2018 FFI basics

Nano-Power Africa Higher Education for Development Program United

WiFi-Nano: Reclaiming WiFi Efficiency through 800 ns Slots Eugenio Magistretti Krishna Kant

Greening the Internet with Nano Data Centers V. Valancius (Georgia Institute of Technology), N.

CMPS 112: Spring 2019 Comparative Programming Languages Environments and closures

Nano-RK: an Energy-aware Resource-centric RTOS for Sensor Networks Anand Eswaran Anthony Rowe Raj

Nanoelectronic Architectures: Reliable Computation on Defective Devices Alex Orailoglu Computer

CS257 Introduction to Nanocomputing Overview of Crossbar-Based Computing John E Savage

Advances in nanoscience, methods, protocols and metrology Education and understanding of