VALSE Web Webinar nar Recent Progresses in Visual Segmentation Yunchao Wei ReLER, Australian Artificial Intelligence Institute University of Technology Sydney
The importance of visual segmentation Medical Agriculture Autonomous Vehicle Satellite Imagery Imagery Video Editing Robotics UT UTS ReLER ER Lab Lab VALSE We Webinar 2
Outline Part I: Semantic Segmentation Part II: Interactive Image Segmentation Part III: Video Object Segmentation UT UTS ReLER ER Lab Lab VALSE We Webinar 3
Part I: Semantic Segmentation UT UTS ReLER ER Lab Lab VALSE We Webinar 4
Semantic Segmentation Pascal VOC ADE 20K LIP Cityscapes UT UTS ReLER ER Lab Lab VALSE We Webinar 5
Context Modeling in FCN Structures Non-adaptive context modeling [Long et al. ICCV 2015] [Ronneberger et al. MICCAI 2015] [Chen et al. PAMI 2018] [Zhao et al. CVPR 2017] [Chen et al. ECCV 2018] UT UTS ReLER ER Lab Lab VALSE We Webinar 6
Graph Neural Networks Adaptive context modeling but [Wang et al. CVPR 2018] high computational complexity UT UTS ReLER ER Lab Lab VALSE We Webinar 7
Criss-Cross Attention Criss-cross attention block, a.k.a. , sparse connected self-attention [Huang et al. ICCV 2019] UT UTS ReLER ER Lab Lab VALSE We Webinar 8
Recurrent Criss-Cross Attention Criss-cross (R=2) equals to Non-local network Time & space complexity: 𝑃 𝑂 ! → 𝑃(𝑂) UT UTS ReLER ER Lab Lab VALSE We Webinar 9
CCNet: Criss-cross Network UT UTS ReLER ER Lab Lab VALSE We Webinar 10
Results on Cityscapes More accurate, 15% FLOPS & 9% memory cost UT UTS ReLER ER Lab Lab VALSE We Webinar 11
Results on ADE20K, LIP & COCO Human parsing results on LIP Scene parsing results on ADE20K Instance segmentation results on COCO UT UTS ReLER ER Lab Lab VALSE We Webinar 12
From Image to Video CCNet 3D Video semantic segmentation results on CamVID [Huang et al. PAMI 2020] UT UTS ReLER ER Lab Lab VALSE We Webinar 13
Visualization of the Learned Context on Cityscapes Image Ground Truth R=1 R=2 UT UTS ReLER ER Lab Lab VALSE We Webinar 14
Follow-up Works Axial-Deeplab [Wang et al. ECCV 2020] Axial Attention [Ho et al. arxiv 2019] UT UTS ReLER ER Lab Lab VALSE We Webinar 15
Recent Hotspots: Boundary modeling for better segmentation [Cheng et al. ECCV 2020] [Cheng et al. CVPR 2020] [Chen et al. CVPR 2020] [Li et al. ECCV 2020] [Kirillov et al. CVPR 2020] UT UTS ReLER ER Lab Lab VALSE We Webinar 16
Part II: Interactive Image Segmentation UT UTS ReLER ER Lab Lab VALSE We Webinar 17
What is Interactive Image Segmentation? • Semi-automated, class-agnostic segmentation • Target object depends on the user inputs (e.g. points) • Allows iterative refinement until result is satisfactory Target object Unrelated region UT UTS ReLER ER Lab Lab VALSE We Webinar 18
Why should we consider interactive image segmentation? ≈ 60s per instance Unaffordable!! ≈ 1.5 hours per image UT UTS ReLER ER Lab Lab VALSE We Webinar 19
Why should we consider interactive image segmentation? Accurately & Efficiently UT UTS ReLER ER Lab Lab VALSE We Webinar 20
Standard pipeline • RGB image and user interactions are used as the network input • Train end-to-end with FCNs (e.g. Deeplab series, PSPNet) Image User interactions Fully convolutional network (FCN) Ground-truth [Xu et al. CVPR 2016] UT UTS ReLER ER Lab Lab VALSE We Webinar 21
Common types of user interaction • Sparse clicks • Bounding box • Scribbles UT UTS ReLER ER Lab Lab VALSE We Webinar 22
Common types of user interaction • Sparse clicks • Bounding box • Scribbles UT UTS ReLER ER Lab Lab VALSE We Webinar 23
Common types of user interaction • Sparse clicks ≈ 2s per instance Manual annotation • Bounding box ≈ 7s per instance • Scribbles ≈ 60s per instance ≈ 17s per instance UT UTS ReLER ER Lab Lab VALSE We Webinar 24
Existing State-of-the-Art Method: DEXTR • DEXTR (Deep Extreme Cut) • Take 4 extreme points (top, bottom, leftmost and rightmost pixels) as inputs [Maninis et al. CVPR 2018] UT UTS ReLER ER Lab Lab VALSE We Webinar 25
Existing State-of-the-Art Method: DEXTR • DEXTR (Deep Extreme Cut) • Take 4 extreme points (top, bottom, leftmost and rightmost pixels) as inputs Segmentation Network Cropped image Location cues [Maninis et al. CVPR 2018] UT UTS ReLER ER Lab Lab VALSE We Webinar 26
Existing State-of-the-Art Method: DEXTR • DEXTR (Deep Extreme Cut) • Take 4 extreme points (top, bottom, leftmost and rightmost pixels) as inputs • Problems • Multiple extreme points appear at similar location • Unrelated object lying inside the target object [Maninis et al. CVPR 2018] UT UTS ReLER ER Lab Lab VALSE We Webinar 27
Inside-Outside Guidance (IOG) • Inside guidance (1 click) • Interior point located roughly at the object center • Disambiguate the segmentation target • Outside guidance (2 clicks) • 2 corner clicks of a box enclosing the object • Indicate the background region • The remaining 2 corners can be inferred automatically [Zhang et al. CVPR 2020] UT UTS ReLER ER Lab Lab VALSE We Webinar 28
Clicking Paradigm • Click on a corner point • Click on the symmetrical corner • Click on the object center Clicks Time Outside clicks 6.7s Inside click 1.5s UT UTS ReLER ER Lab Lab VALSE We Webinar 29
Input Representation • Follow the practice of DEXTR • Enlarge the bounding box by 10 pixels to include context • Crop and resize the inputs to 512x512 • Input representation • 2 separate Gaussian heatmaps for the inside and outside clicks Segmentation Network RGB Image Inside Guidance Outside Guidance UTS ReLER UT ER Lab Lab VALSE We Webinar 30
Network Architecture • Segmentation errors mostly occur around the object boundaries UT UTS ReLER ER Lab Lab VALSE We Webinar 31
Network Architecture • Segmentation errors mostly occur around the object boundaries • Use a coarse-to-fine network structure (b) FineNet (a) CoarseNet [Chen et al. CVPR 2018] UT UTS ReLER ER Lab Lab VALSE We Webinar 32
Network Architecture • Segmentation errors mostly occur around the object boundaries • Use a coarse-to-fine network structure UT UTS ReLER ER Lab Lab VALSE We Webinar 33
Beyond Three Clicks • Our IOG naturally supports interactive adding of new clicks • Add a lightweight branch to accept additional inputs • Train with iterative training strategy Refinement Optional click for refinement (b) FineNet (a) CoarseNet UT UTS ReLER ER Lab Lab VALSE We Webinar 34
IOG vs. Extreme Clicks • Observation • IOG is more effective than extreme points across different backbone UT UTS ReLER ER Lab Lab VALSE We Webinar 35
IOG vs. Extreme Clicks • Observation • IOG is more effective than extreme points across different backbone • Using a coarse-to-fine network structure further improves the performance UT UTS ReLER ER Lab Lab VALSE We Webinar 36
Comparison with SOTA PASCAL GrabCut 100 96.9 96.3 94.4 94.4 93.2 91.5 90 85 84 80.7 80 75.2 70 59.3 60 56.9 55.6 55.1 50 45.9 41.1 40 Graph cut Random Geodesic iFCN RIS-Net DEXTR IOG(3 clicks) IOG(4 clicks) walker matting UT UTS ReLER ER Lab Lab VALSE We Webinar 37
Generalization • Our IOG performs well even on unseen categories • Performs well across different domain even without fine-tuning • Can be further improved using 10% domain data for fine-tuning 95 85 85 83.0% 92.8 82.1% 83.8 81.4 84 90.7 81.7% 82.0% 90 80 83 81.0% 80.3% 82 85 79.9% 80.0% 75 81 80.2 80 78.2 80 Curve-GCN 79.0% 79.4 70 79 DEXTR 78.0% 75 78 IOG 77.0% 65 77 68.3 70 76.0% 76 60.9 65 60 75 75.0% W FT W/O FT W/O FT CURVE- DEXTR IOG seen unseen PASCAL -> COCO Aerial imagery Medical domain Autonomous driving GCN UTS ReLER UT ER Lab Lab VALSE We Webinar 38
Qualitative Results Cityscapes Agriculture-Vision Rooftop ssTEM General object scenes UT UTS ReLER ER Lab Lab VALSE We Webinar 39
Demo [Youtube] [Bilibili] UT UTS ReLER ER Lab Lab VALSE We Webinar 40
Automated Mode of IOG Segmentation Network RGB Image Inside Guidance Outside Guidance UT UTS ReLER ER Lab Lab VALSE We Webinar 41
Automated Mode of IOG Segmentation Network Outside Guidance RGB Image • Without user interaction, our IOG can still harvest high quality masks from off-the- shelf datasets with box annotations (e.g. ImageNet) • Solution: Two-stage Training: Inputs IoU (PASCAL) (S1) Train a network that takes box as inputs w/ human 93.2 (S2) Infer interior clicks from the masks produced in S1 w/o human 91.1 and apply IOG UTS ReLER UT ER Lab Lab VALSE We Webinar 42
IM GENET PIXEL- https://github.com/shiyinzhang/Pixel-ImageNet Characteristics Possible Applications • #Classes: 1000 Image classification • #Instance: >600K Instance segmentation Semantic segmentation Salient object detection …. and more UT UTS ReLER ER Lab Lab VALSE We Webinar 43 43
Failure Cases UT UTS ReLER ER Lab Lab VALSE We Webinar 44
Recommend
More recommend