Deep Watershed Transform for Instance Segmentation Min Bai & Raquel Urtasun To appear at IEEE CVPR 2017 in Hawaii Presented at NVIDIA GTC 2017
Semantic Segmentation ● Input: RGB Image ● Output at each pixel: ○ Semantic label
Instance Segmentation ● Input: RGB Image ● Output at each pixel: ○ Semantic label ○ Instance label ■ Same for each px in object ■ Different among objects ○ Difficulty: How to phrase the problem?
Applications ● Object tracking Image credit: Davi Frossard
Applications ● Interacting with the environment Image credit: http://www.rethinkrobotics.com/build-a-bot/
Applications ● Useful information for other algorithms such as optical flow, etc Image credit: Shenlong Wang
Semantic Segmentation ● Semantic segmentation is a well studied problem ○ Our instance segmentation method leverages an existing technique ○ H. Zhao et al, Pyramid Scene Parsing Network , https://arxiv.org/abs/1612.01105 Image credit: H. Zhao et al.
Watershed Transform ● Classical image segmentation technique Image (left) credit: Adrian Fisher
Scalar Field and Gradient ● Scalar field: single number at each pixel ● Gradient: vector at each pixel, pointing toward direction of greatest ascent Image source: Wikipedia: byVivekj78 - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=15346899
Overview of Approach Input Image Gradient of Energy Landscape Energy Landscape Predicted Instances Semantic Segmentation
Overview of Approach Input Image Gradient of Energy Landscape Energy Landscape Predicted Instances Semantic Segmentation
Why Predict Direction First? Input Image Direction of Gradient Energy Landscape Much sharper difference in the direction label at the boundary!
Overall Network
Direction Prediction Network Input Image Ground Truth Directions Predicted Directions Semantic Segmentation
Energy Prediction Network Ground Truth Energy Ground Truth Instances Predicted Energy Predicted Instances
Training and Inference ● Pre-train both networks ● End-to-end fine-tuning ● Network trained on NVIDIA DGX-1 ○ Approximately 25 hours total for training on one GP100 core ○ ~0.1s per image for forward pass ○ Thank you NVIDIA for the generous gift! Image source: www.nvidia.com
Cityscapes Dataset ● 2975 training / 500 validation / 1525 testing images ● Instances: car, truck, bus, train, person, rider, motorcycle, bicycle
Cityscapes Dataset ● 2975 training / 500 validation / 1525 testing images ● Instances: car, truck, bus, train, person, rider, motorcycle, bicycle
Cityscapes Instance Segmentation Leaderboard AP* AP* @ 50% AP* @ 50m AP* @ 100m van den Brand et al. 2.3% 3.7% 3.9% 4.9% Cordts et al. 4.6% 12.9% 7.7% 10.3% Uhrig et al. 8.9% 21.1% 15.3% 16.7% Ours 19.4% 35.3% 31.4% 36.8% * Average Precision (AP): higher is better Recently, new approaches have achieved even higher performance.
Sample Output Input RGB Direction Prediction Energy Prediction Semantic Segmentation Predicted Instances Ground Truth Instances
Sample Output Input RGB Direction Prediction Energy Prediction Semantic Segmentation Predicted Instances Ground Truth Instances
Sample Output Input RGB Direction Prediction Energy Prediction Semantic Segmentation Predicted Instances Ground Truth Instances
Sample Output Input RGB Direction Prediction Energy Prediction Semantic Segmentation Predicted Instances Ground Truth Instances
Sample Output Input RGB Direction Prediction Energy Prediction Semantic Segmentation Predicted Instances Ground Truth Instances
Sample Output Input RGB Direction Prediction Energy Prediction Semantic Segmentation Predicted Instances Ground Truth Instances
Preliminary TorontoCity Aerial Instance Segmentation Semantic Segmentation (ResNet) Input RGB Predicted Building Instances
Preliminary TorontoCity Aerial Instance Segmentation Weighted AP* Recall* @ Precision* @ Coverage* 50% 50% FCN-8 41.92% 11.37% 21.50% 36.00% ResNet-56 40.65% 12.13% 18.90% 45.36% Ours 56.22% 21.22% 67.16% 63.67% * higher is better
In Summary... ● Simple technique for instance segmentation ● Encodes object instances as energy map ● Predicts gradient direction as intermediate task for better supervision
Recommend
More recommend