All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - - PowerPoint PPT Presentation
All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - - PowerPoint PPT Presentation
All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond Image classification network structure #classes Conv+RELU Conv+ Conv+RELU Reshape FC+RELU FC+Softmax Downsample Feature detection
Image classification network structure
Conv+RELU Conv+ Downsample Conv+RELU Reshape FC+RELU FC+Softmax
#classes Feature detection Interpretation Solving
Semantic segmentation network structure
Conv+RELU Conv+ Downsample Conv+RELU Upsample Conv+Softmax
Feature detection Interpretation Solving
1x1 Conv +RELU
Two issues:
1) RELU doesn’t have much nonlinearity
– so need lots of layers to fit complex functions
2) Softmax solvers are inefficient
– Number of classes is “baked in” – Classes have to be axis-aligned
Two solutions:
1) Use Solvers with much more compactly represented non-linearities (Decision Forests) 2) Replace axis-aligned representation with embedding representation and solve using nearest-neighbor examples
Nonlinearity in RELU
This part is linear … … and so is this bit
Nonlinearity in RELU
This part is linear … … and so is this bit - even for leaky RELU All the non-linearity is here This can only build piecewise linear functions
Nonlinearity in RELU
Can work pretty well in 1 dimension Also in 2 dimensions – although more pieces But problems are in thousands or millions of dimensions not 1 or 2!
Semantic segmentation network structure
Sum vectors Softmax Decision Forest Each pixel is passed through the decision forest Each question tests a single channel at that pixel vs threshold Tree leaves contain a marginal log distribution Selected log distributions are summed over the forest Softmax gives the probabilities for every class at each pixel
Semantic segmentation network structure
Sum vectors Upsample Decision Forest Sum
Benefits
Decision trees pack a lot of nonlinearity into a small space They are responsive to complex joint distributions across different channels Trees are very quick to train (minutes or hours – not days) Trees outperform softmax solvers
Performance
NYUDv2 Pixel Acc (%) Mean Acc (%) Mean I/U (%) Time to train FCN-8s-heavy 60.9 43.1 30.2 20Hrs FCN-16s-heavy 61.5 42.4 30.5 RRF (Ours) 66.6 49.9 36.5 30 mins Pascal VOC Mean I/U (%) FCN-8s-heavy 67.2 Deconvnet 69.6 RRF (Ours) 68.9
Softmax solvers are inefficient
Softmax demands classes are axis-aligned: Only allows one class per dimension
Bicycles
class = exp () ∑ exp ()
Embedding spaces are more efficient
Cats Dogs Bicycles
Embedding spaces are more efficient
Cats Dogs Bicycles Cars
Embedding spaces are more efficient
Bicycles Cars Cats
Even more so in higher dimensions We are embedding in 4096D
Neural Network as Embedding
Record embedding
- f each example of
training set
Neural Network as Embedding
To classify, put a Gaussian kernel at each location Sum distributions at point of interest In practice just use K nearest neighbors to compute P
∈ class = ∑ exp (( − )/2)
∈
∑ exp ( −
/2)
Neural Network as Embedding
Benefits of this approach: 1. Semantically related instances get clustered together (instead of strung out across a large space) 2. Classes can happily form more than one cluster if that’s appropriate (e.g. two very different kinds of dogs in the dataset) 3. Can add new classes or transfer learning to new domains on the fly (just pump examples through the network)
Embedding transfer learning results
Stanford Cars
196 Classes ~82 images / class Train on 98 classes Test on remaining
CUB Birds - 2011
200 Classes ~58 images / class Train on 100 classes Test on remaining
Classification Results
CUB Birds - 2011
Conventional classification task 200 Classes 5994 Training images 5794 Test images
Finding K nearest neighbors
We could use sparse SVM as a classifier but we have state-of-the-art approximate nearest neighbor algorithm: FANNG FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate
- f 2.5M queries/s – or at 99.7% accuracy at 300K
queries/s It can also return K nearest neighbors
FANNG: Fast Approximate Nearest Neighbour Graph
Based on two insights: 1. If a reference point is near a query point, it’s worth checking its neighbors to see if they’re closer
- build a graph linking points to their neighbors
2. If there are many neighbors in the same direction it’s not necessary to have edges to all of them
- if the nearest of them isn’t closer to the query
then the others probably arent either.
FANNG Algorithm
Start somewhere in the graph While (not bored) { Measure distance to target (query point) Insert this point into a priority* queue of vertices visited ( *in order of distance from target) Find the nearest unvisited neighbor of the highest priority vertex with unvisited neighbors }
Query point
1 2 3 4 5 6
6 5 4 3 1 2
FANNG Algorithm
Query point
1 2 3 4 5 6 7 9 10 8
Start somewhere in the graph While (not bored) { Measure distance to target (query point) Insert this point into a priority* queue of vertices visited ( *in order of distance from target) Find the nearest unvisited neighbor of the highest priority vertex with unvisited neighbors }
6 7 5 8 9 4 10 3 1 2
Fill a 128D hypercube with 1,000,000 points of random data
High dimensional data is VERY counterintuitive
High dimensional data is VERY counterintuitive
Fill a 128D hypercube with 1,000,000 points of random data Inflate a hypersphere centred in the hypercube until it touches a data point (find the largest sphere with no data in it)
FANNG Performance
FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate of 2.5M queries/s – or at 99.7% accuracy at 300K queries/s It can also return K nearest neighbors (the highest K entries in the Priority Queue when searching is finished)
FANNG Performance
For 128D SIFT data search time scales as 0.2 power of data set size:
∝ .
FANNG Additional Uses
In addition, we are using FANNG for:
- Camera relocalization for SLAM
- finding keypoints or scenes previously visited
- Hard negative mining for training triplet networks
- prioritising badly classified negative examples
- K means clustering (K = 1,000,000 N = 20,000,000 D = 128)