Deep Fisher Networks and Class Saliency Maps for Object - - PowerPoint PPT Presentation

deep fisher networks and class saliency maps for object
SMART_READER_LITE
LIVE PREVIEW

Deep Fisher Networks and Class Saliency Maps for Object - - PowerPoint PPT Presentation

Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation Karn Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, University of Oxford Outline Classification challenge can Fisher Vector


slide-1
SLIDE 1

Deep Fisher Networks and Class Saliency Maps for Object Classification and Localisation

Karén Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, University of Oxford

slide-2
SLIDE 2

Outline

  • Classification challenge
  • can Fisher Vector encodings be improved by a deep architecture?
  • deep Fisher Network (FN)
  • combination of two deep models: Convolutional Network (CN)

and deep Fisher Network

  • Localisation challenge
  • visualization of class saliency maps and per‐image foreground

pixels from a single classification CN

  • bounding boxes computed from foreground pixels
  • weak supervision: only image class labels used for training
slide-3
SLIDE 3
  • Bag of Visual Words (BOW) pipeline

... ... ... ...

VQ

Linear SVM

dogs

Shallow Image Encoding & Classification

  • Dense SIFT features

[Luong & Malik, 1999] [Varma & Zisserman, 2003] [Csurka et al, 2004] [Vogel & Schiele, 2004] [Jurie & Triggs, 2005] [Lazebnik et al, 2006] [Bosch et al, 2006]

slide-4
SLIDE 4

soft-assignment to GMM 1st order stats (k-th Gaussian): 2nd order stats (k-th Gaussian): 80-D 80-D 80-D

FV dimensionality: 80×2×512=81,920 (for a mixture of 512 Gaussians)

stacking e.g. if SIFT x reduced to 80 dimensions by PCA

Dense set of local SIFT features → Fisher vector (high dim)

Fisher Vector (FV) – Encoding

Perronnin et al CVPR 07 & 10, ECCV 10

slide-5
SLIDE 5
  • Learn projection onto a low-dim space where classes are well-

separated

  • Joint learning of projection and projected-space classifiers

(WSABIE):

  • Or project onto the space of classifier scores:
  • are linear SVM classifiers in the high-dimensional FV space
  • fast-to-learn

Projection Learning

Fisher vector (high dim) → low dimensional representation

slide-6
SLIDE 6

Deep Fisher Network

Dense feature extraction

SIFT, colour

One vs. rest linear SVMs low-dim FV encoder Spatial stacking L2 norm. & PCA FV encoder SSR & L2 norm. SSR & L2 norm.

input image 0-th layer 1-st Fisher layer

(local & global pooling)

2-nd Fisher layer

(global pooling)

classifier layer

Dense feature extraction

SIFT, raw patches, …

One vs rest linear SVMs FV encoder SSR & L2 norm.

Shallow Fisher Vector

slide-7
SLIDE 7

Fisher Layer

Compressed local Fisher encoding Spatial stacking (2×2) L2 norm‐n & PCA decorrelation

feature

w h 80 w/2 h/2 82000 w/2 h/2 1000 w/2 h/2 4000 w/2 h/2 256

slide-8
SLIDE 8

Deep Fisher Network

Dense feature extraction

SIFT, colour

One vs. rest linear SVMs low-dim FV encoder Spatial stacking L2 norm. & PCA FV encoder SSR & L2 norm. SSR & L2 norm.

input image 0-th layer 1-st Fisher layer

(local & global pooling)

2-nd Fisher layer

(global pooling)

classifier layer

Dense feature extraction

SIFT, raw patches, …

One vs rest linear SVMs FV encoder SSR & L2 norm.

Shallow Fisher Vector

slide-9
SLIDE 9

Classification Results for Fisher Network

ImageNet 2010 challenge dataset:

  • 1.2M images, 1K classes
  • SIFT & colour features
  • Learning: 2‐3 days on 200 CPU cores (MATLAB + MEX implementation)

Improved classification accuracy by adding layer

slide-10
SLIDE 10

Deep ConvNet Implementation

  • Based on cuda‐convnet [Krizhevsky et al., 2012]
  • 8 weight layers (rather narrow):

conv64‐conv256‐conv256‐conv256‐conv256‐ full4096‐full4096‐full1000

  • Jittering:
  • cropping, flipping, PCA‐aligned noise
  • random occlusion:
  • Single ConvNet instance
slide-11
SLIDE 11

Classification Results

ImageNet 2012 challenge dataset:

  • 1.2M images, 1K classes
  • top‐5 classification accuracy

Method top‐5 accuracy FV encoding (our 2012 entry) 72.7% Deep FishNet 76.9% Deep ConvNet [Krizhevsky et al., 2012] 81.8% 83.6% (5 ConvNets) Deep ConvNet (our implementation) 82.3% Deep ConvNet + Deep FishNet 84.8%

ConvNet and FisherNet are complementary

slide-12
SLIDE 12

Outline

  • Classification challenge
  • can Fisher Vector encodings be improved by a deep architecture?
  • deep Fisher Network (FN)
  • combination of two deep models: Convolutional Network (CN)

and deep Fisher Network

  • Localisation challenge
  • visualization of class saliency maps and per‐image foreground

pixels from a single classification CN

  • bounding boxes computed from foreground pixels
  • weak supervision: only image class labels used for training
slide-13
SLIDE 13

Deep inside ConvNets: what Has Been Learnt?

ConvNet class model visualisation

  • find a (regularised) image with a high

class score : with a fixed learnt model

  • compute using back‐prop

Cf ConvNet training

  • max log‐likelihood of the correct class
  • using back‐prop

Visualizing higher‐layer features of a deep network. Erhan, D., Bengio, Y., Courville, A., Vincent, P. Technical report, University of Montreal, 2009.

fully connected classifier layer soft-max layer

slide-14
SLIDE 14

fox

slide-15
SLIDE 15

pepper

slide-16
SLIDE 16

dumbbell

slide-17
SLIDE 17

Deep inside ConvNets: what Has Been Learnt?

ConvNet class model visualisation

  • find a (regularised) image with a high

class score : with a fixed learnt model

  • compute using back‐prop

NB

Visualizing higher‐layer features of a deep network. Erhan, D., Bengio, Y., Courville, A., Vincent, P. Technical report, University of Montreal, 2009.

fully connected classifier layer soft-max layer

gives less prominent visualisation, as it concentrates on reducing scores of other classes

slide-18
SLIDE 18

Deep inside ConvNets: What Makes an Image Belong to a Class?

  • ConvNets are highly non‐linear → local linear approximaon
  • 1st order expansion of a class score around a given image :
  • has the same dimensions as image
  • magnitude of defines a saliency map for image and class

– computed using back‐prop – score of ‐th class

How to Explain Individual Classification Decisions. Baehrens, D., Schroeter, T., Harmeling, S., Kawanabe, M., Hansen, K., Müller, K.‐R. JMLR, 2010.

slide-19
SLIDE 19

Saliency Maps For Top‐1 Class

slide-20
SLIDE 20

Saliency Maps For Top‐1 Class

slide-21
SLIDE 21

Saliency Maps For Top‐1 Class

slide-22
SLIDE 22
  • Weakly supervised
  • computed using class‐n ConvNet, trained on image class labels
  • no additional annotation required (e.g. boxes or masks)
  • Highlights discriminative object parts
  • Instant computation – no sliding window
  • Fires on several object instances
  • Related to deconvnet [Zeiler and Fergus, 2013]
  • very similar for convolution, max‐pooling, and RELU layers
  • but we also back‐prop through fully‐connected layers

Image Saliency Map

slide-23
SLIDE 23

Saliency Maps for Object Localisation

  • Image → top‐k class → class saliency map → object box
slide-24
SLIDE 24
  • Given an image and a saliency map:

BBox Localisation for ILSVRC Submission

slide-25
SLIDE 25
  • Given an image and a saliency map:
  • 1. Foreground/background mask

using thresholds on saliency

BBox Localisation for ILSVRC Submission

blue – foreground cyan – background red – undefined

slide-26
SLIDE 26
  • Given an image and a saliency map:
  • 1. Foreground/background mask

using thresholds on saliency

  • 2. GraphCut colour segmentation

[Boykov and Jolly, 2001]

BBox Localisation for ILSVRC Submission

slide-27
SLIDE 27
  • Given an image and a saliency map:
  • 1. Foreground/background mask

using thresholds on saliency

  • 2. GraphCut colour segmentation

[Boykov and Jolly, 2001]

  • 3. Bounding box of the largest

connected component

  • Colour information propagates

segmentation from the most discriminative areas

BBox Localisation for ILSVRC Submission

slide-28
SLIDE 28

Segmentation‐Localisation Examples

slide-29
SLIDE 29

Segmentation‐Localisation Examples

slide-30
SLIDE 30

Segmentation‐Localisation Failure Cases

  • Several object instances
slide-31
SLIDE 31
  • Segmentation isn’t propagated from the salient parts

Segmentation‐Localisation Failure Cases

slide-32
SLIDE 32
  • Limitations of GraphCut segmentation

Segmentation‐Localisation Failure Cases

slide-33
SLIDE 33

Summary

  • Fisher encoding benefits from stacking
  • Deep FishNet is complementary to Deep ConvNet
  • Class saliency maps are useful for localisation
  • location of discriminative object parts
  • weakly supervised: bounding boxes not used for training
  • fast to compute