all that glisters is not convnets
play

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - PowerPoint PPT Presentation

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond Image classification network structure #classes Conv+RELU Conv+ Conv+RELU Reshape FC+RELU FC+Softmax Downsample Feature detection


  1. All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond

  2. Image classification network structure #classes Conv+RELU Conv+ Conv+RELU Reshape FC+RELU FC+Softmax Downsample Feature detection Interpretation Solving

  3. Semantic segmentation network structure Conv+RELU Conv+ Conv+RELU 1x1 Conv Upsample Conv+Softmax Downsample +RELU Feature detection Interpretation Solving

  4. Two issues: 1) RELU doesn’t have much nonlinearity – so need lots of layers to fit complex functions 2) Softmax solvers are inefficient Number of classes is “baked in” – Classes have to be axis-aligned –

  5. Two solutions: 1) Use Solvers with much more compactly represented non-linearities (Decision Forests) 2) Replace axis-aligned representation with embedding representation and solve using nearest-neighbor examples

  6. Nonlinearity in RELU This part is linear … … and so is this bit

  7. Nonlinearity in RELU This part is linear … … and so is this bit - even for leaky RELU All the non-linearity is here This can only build piecewise linear functions

  8. Nonlinearity in RELU Can work pretty well in 1 dimension Also in 2 dimensions – although more pieces But problems are in thousands or millions of dimensions not 1 or 2!

  9. Semantic segmentation network structure Decision Forest Each pixel is passed through the decision forest Sum vectors Each question tests a single channel at that pixel vs threshold Tree leaves contain a marginal log distribution Softmax Selected log distributions are summed over the forest Softmax gives the probabilities for every class at each pixel

  10. Semantic segmentation network structure Decision Forest Sum vectors Upsample Sum

  11. Benefits Decision trees pack a lot of nonlinearity into a small space They are responsive to complex joint distributions across different channels Trees are very quick to train (minutes or hours – not days) Trees outperform softmax solvers

  12. Performance NYUDv2 Pixel Acc (%) Mean Acc (%) Mean I/U (%) Time to train FCN-8s-heavy 60.9 43.1 30.2 20Hrs FCN-16s-heavy 61.5 42.4 30.5 RRF (Ours) 66.6 49.9 36.5 30 mins Pascal VOC Mean I/U (%) FCN-8s-heavy 67.2 Deconvnet 69.6 RRF (Ours) 68.9

  13. Softmax solvers are inefficient Softmax demands classes are axis-aligned: Bicycles exp (� � ) � class � = ∑ exp (� � � ) �� Only allows one class per dimension

  14. Embedding spaces are more efficient Cats Bicycles Dogs

  15. Embedding spaces are more efficient Cars Cats Bicycles Dogs

  16. Embedding spaces are more efficient Even more so in higher dimensions Bicycles We are embedding in 4096D Cats Cars

  17. Neural Network as Embedding Record embedding of each example of training set

  18. Neural Network as Embedding To classify, put a Gaussian kernel at each location Sum distributions at point of interest ((� − � � ) � /2� � ) � � ∈ class � = ∑ exp �∈� � /2� � ) ∑ exp ( � − � � � In practice just use K nearest neighbors to compute P

  19. Neural Network as Embedding Benefits of this approach: 1. Semantically related instances get clustered together (instead of strung out across a large space) 2. Classes can happily form more than one cluster if that’s appropriate (e.g. two very different kinds of dogs in the dataset) 3. Can add new classes or transfer learning to new domains on the fly (just pump examples through the network)

  20. Embedding transfer learning results Stanford Cars 196 Classes ~82 images / class Train on 98 classes Test on remaining CUB Birds - 2011 200 Classes ~58 images / class Train on 100 classes Test on remaining

  21. Classification Results CUB Birds - 2011 Conventional classification task 200 Classes 5994 Training images 5794 Test images

  22. Finding K nearest neighbors We could use sparse SVM as a classifier but we have state-of-the-art approximate nearest neighbor algorithm: FANNG FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate of 2.5M queries/s – or at 99.7% accuracy at 300K queries/s It can also return K nearest neighbors

  23. FANNG: Fast Approximate Nearest Neighbour Graph Based on two insights: 1. If a reference point is near a query point, it’s worth checking its neighbors to see if they’re closer build a graph linking points to their neighbors • 2. If there are many neighbors in the same direction it’s not necessary to have edges to all of them if the nearest of them isn’t closer to the query • then the others probably aren � t either.

  24. FANNG Algorithm Start somewhere in the graph 6 5 While (not bored) { 1 4 5 Measure distance to target (query point) 2 3 6 3 4 1 Insert this point into a priority* queue of vertices 2 visited ( *in order of distance from target) Query point Find the nearest unvisited neighbor of the highest priority vertex with unvisited neighbors }

  25. FANNG Algorithm Start somewhere in the graph 6 7 While (not bored) { 1 8 5 5 Measure distance to target (query point) 2 8 9 6 3 7 4 9 10 Insert this point into a priority* queue of vertices 4 visited ( *in order of distance from target) 10 Query point 3 Find the nearest unvisited neighbor of the highest 1 2 priority vertex with unvisited neighbors }

  26. High dimensional data is VERY counterintuitive Fill a 128D hypercube with 1,000,000 points of random data

  27. High dimensional data is VERY counterintuitive Fill a 128D hypercube with 1,000,000 points of random data Inflate a hypersphere centred in the hypercube until it touches a data point (find the largest sphere with no data in it)

  28. FANNG Performance FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate of 2.5M queries/s – or at 99.7% accuracy at 300K queries/s It can also return K nearest neighbors (the highest K entries in the Priority Queue when searching is finished)

  29. FANNG Performance For 128D SIFT data search time scales as 0.2 power of data set size: � ∝ � �.�

  30. FANNG Additional Uses In addition, we are using FANNG for: • Camera relocalization for SLAM - finding keypoints or scenes previously visited • Hard negative mining for training triplet networks - prioritising badly classified negative examples • K means clustering (K = 1,000,000 N = 20,000,000 D = 128)

Recommend


More recommend