All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - - PowerPoint PPT Presentation

all that glisters is not convnets
SMART_READER_LITE
LIVE PREVIEW

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - - PowerPoint PPT Presentation

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond Image classification network structure #classes Conv+RELU Conv+ Conv+RELU Reshape FC+RELU FC+Softmax Downsample Feature detection


slide-1
SLIDE 1

All That Glisters Is Not Convnets:

Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond

slide-2
SLIDE 2

Image classification network structure

Conv+RELU Conv+ Downsample Conv+RELU Reshape FC+RELU FC+Softmax

#classes Feature detection Interpretation Solving

slide-3
SLIDE 3

Semantic segmentation network structure

Conv+RELU Conv+ Downsample Conv+RELU Upsample Conv+Softmax

Feature detection Interpretation Solving

1x1 Conv +RELU

slide-4
SLIDE 4

Two issues:

1) RELU doesn’t have much nonlinearity

– so need lots of layers to fit complex functions

2) Softmax solvers are inefficient

– Number of classes is “baked in” – Classes have to be axis-aligned

slide-5
SLIDE 5

Two solutions:

1) Use Solvers with much more compactly represented non-linearities (Decision Forests) 2) Replace axis-aligned representation with embedding representation and solve using nearest-neighbor examples

slide-6
SLIDE 6

Nonlinearity in RELU

This part is linear … … and so is this bit

slide-7
SLIDE 7

Nonlinearity in RELU

This part is linear … … and so is this bit - even for leaky RELU All the non-linearity is here This can only build piecewise linear functions

slide-8
SLIDE 8

Nonlinearity in RELU

Can work pretty well in 1 dimension Also in 2 dimensions – although more pieces But problems are in thousands or millions of dimensions not 1 or 2!

slide-9
SLIDE 9

Semantic segmentation network structure

Sum vectors Softmax Decision Forest Each pixel is passed through the decision forest Each question tests a single channel at that pixel vs threshold Tree leaves contain a marginal log distribution Selected log distributions are summed over the forest Softmax gives the probabilities for every class at each pixel

slide-10
SLIDE 10

Semantic segmentation network structure

Sum vectors Upsample Decision Forest Sum

slide-11
SLIDE 11

Benefits

Decision trees pack a lot of nonlinearity into a small space They are responsive to complex joint distributions across different channels Trees are very quick to train (minutes or hours – not days) Trees outperform softmax solvers

slide-12
SLIDE 12

Performance

NYUDv2 Pixel Acc (%) Mean Acc (%) Mean I/U (%) Time to train FCN-8s-heavy 60.9 43.1 30.2 20Hrs FCN-16s-heavy 61.5 42.4 30.5 RRF (Ours) 66.6 49.9 36.5 30 mins Pascal VOC Mean I/U (%) FCN-8s-heavy 67.2 Deconvnet 69.6 RRF (Ours) 68.9

slide-13
SLIDE 13

Softmax solvers are inefficient

Softmax demands classes are axis-aligned: Only allows one class per dimension

Bicycles

class = exp () ∑ exp ()

slide-14
SLIDE 14

Embedding spaces are more efficient

Cats Dogs Bicycles

slide-15
SLIDE 15

Embedding spaces are more efficient

Cats Dogs Bicycles Cars

slide-16
SLIDE 16

Embedding spaces are more efficient

Bicycles Cars Cats

Even more so in higher dimensions We are embedding in 4096D

slide-17
SLIDE 17

Neural Network as Embedding

Record embedding

  • f each example of

training set

slide-18
SLIDE 18

Neural Network as Embedding

To classify, put a Gaussian kernel at each location Sum distributions at point of interest In practice just use K nearest neighbors to compute P

∈ class = ∑ exp (( − )/2)

∑ exp ( −

/2)

slide-19
SLIDE 19

Neural Network as Embedding

Benefits of this approach: 1. Semantically related instances get clustered together (instead of strung out across a large space) 2. Classes can happily form more than one cluster if that’s appropriate (e.g. two very different kinds of dogs in the dataset) 3. Can add new classes or transfer learning to new domains on the fly (just pump examples through the network)

slide-20
SLIDE 20

Embedding transfer learning results

Stanford Cars

196 Classes ~82 images / class Train on 98 classes Test on remaining

CUB Birds - 2011

200 Classes ~58 images / class Train on 100 classes Test on remaining

slide-21
SLIDE 21

Classification Results

CUB Birds - 2011

Conventional classification task 200 Classes 5994 Training images 5794 Test images

slide-22
SLIDE 22

Finding K nearest neighbors

We could use sparse SVM as a classifier but we have state-of-the-art approximate nearest neighbor algorithm: FANNG FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate

  • f 2.5M queries/s – or at 99.7% accuracy at 300K

queries/s It can also return K nearest neighbors

slide-23
SLIDE 23

FANNG: Fast Approximate Nearest Neighbour Graph

Based on two insights: 1. If a reference point is near a query point, it’s worth checking its neighbors to see if they’re closer

  • build a graph linking points to their neighbors

2. If there are many neighbors in the same direction it’s not necessary to have edges to all of them

  • if the nearest of them isn’t closer to the query

then the others probably arent either.

slide-24
SLIDE 24

FANNG Algorithm

Start somewhere in the graph While (not bored) { Measure distance to target (query point) Insert this point into a priority* queue of vertices visited ( *in order of distance from target) Find the nearest unvisited neighbor of the highest priority vertex with unvisited neighbors }

Query point

1 2 3 4 5 6

6 5 4 3 1 2

slide-25
SLIDE 25

FANNG Algorithm

Query point

1 2 3 4 5 6 7 9 10 8

Start somewhere in the graph While (not bored) { Measure distance to target (query point) Insert this point into a priority* queue of vertices visited ( *in order of distance from target) Find the nearest unvisited neighbor of the highest priority vertex with unvisited neighbors }

6 7 5 8 9 4 10 3 1 2

slide-26
SLIDE 26

Fill a 128D hypercube with 1,000,000 points of random data

High dimensional data is VERY counterintuitive

slide-27
SLIDE 27

High dimensional data is VERY counterintuitive

Fill a 128D hypercube with 1,000,000 points of random data Inflate a hypersphere centred in the hypercube until it touches a data point (find the largest sphere with no data in it)

slide-28
SLIDE 28

FANNG Performance

FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate of 2.5M queries/s – or at 99.7% accuracy at 300K queries/s It can also return K nearest neighbors (the highest K entries in the Priority Queue when searching is finished)

slide-29
SLIDE 29

FANNG Performance

For 128D SIFT data search time scales as 0.2 power of data set size:

∝ .

slide-30
SLIDE 30

FANNG Additional Uses

In addition, we are using FANNG for:

  • Camera relocalization for SLAM
  • finding keypoints or scenes previously visited
  • Hard negative mining for training triplet networks
  • prioritising badly classified negative examples
  • K means clustering (K = 1,000,000 N = 20,000,000 D = 128)