All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - PowerPoint PPT Presentation

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond

Image classification network structure #classes Conv+RELU Conv+ Conv+RELU Reshape FC+RELU FC+Softmax Downsample Feature detection Interpretation Solving

Semantic segmentation network structure Conv+RELU Conv+ Conv+RELU 1x1 Conv Upsample Conv+Softmax Downsample +RELU Feature detection Interpretation Solving

Two issues: 1) RELU doesn’t have much nonlinearity – so need lots of layers to fit complex functions 2) Softmax solvers are inefficient Number of classes is “baked in” – Classes have to be axis-aligned –

Two solutions: 1) Use Solvers with much more compactly represented non-linearities (Decision Forests) 2) Replace axis-aligned representation with embedding representation and solve using nearest-neighbor examples

Nonlinearity in RELU This part is linear … … and so is this bit

Nonlinearity in RELU This part is linear … … and so is this bit - even for leaky RELU All the non-linearity is here This can only build piecewise linear functions

Nonlinearity in RELU Can work pretty well in 1 dimension Also in 2 dimensions – although more pieces But problems are in thousands or millions of dimensions not 1 or 2!

Semantic segmentation network structure Decision Forest Each pixel is passed through the decision forest Sum vectors Each question tests a single channel at that pixel vs threshold Tree leaves contain a marginal log distribution Softmax Selected log distributions are summed over the forest Softmax gives the probabilities for every class at each pixel

Semantic segmentation network structure Decision Forest Sum vectors Upsample Sum

Benefits Decision trees pack a lot of nonlinearity into a small space They are responsive to complex joint distributions across different channels Trees are very quick to train (minutes or hours – not days) Trees outperform softmax solvers

Performance NYUDv2 Pixel Acc (%) Mean Acc (%) Mean I/U (%) Time to train FCN-8s-heavy 60.9 43.1 30.2 20Hrs FCN-16s-heavy 61.5 42.4 30.5 RRF (Ours) 66.6 49.9 36.5 30 mins Pascal VOC Mean I/U (%) FCN-8s-heavy 67.2 Deconvnet 69.6 RRF (Ours) 68.9

Softmax solvers are inefficient Softmax demands classes are axis-aligned: Bicycles exp (� � ) � class � = ∑ exp (� � � ) �� Only allows one class per dimension

Embedding spaces are more efficient Cats Bicycles Dogs

Embedding spaces are more efficient Cars Cats Bicycles Dogs

Embedding spaces are more efficient Even more so in higher dimensions Bicycles We are embedding in 4096D Cats Cars

Neural Network as Embedding Record embedding of each example of training set

Neural Network as Embedding To classify, put a Gaussian kernel at each location Sum distributions at point of interest ((� − � � ) � /2� � ) � � ∈ class � = ∑ exp �∈� � /2� � ) ∑ exp ( � − � � � In practice just use K nearest neighbors to compute P

Neural Network as Embedding Benefits of this approach: 1. Semantically related instances get clustered together (instead of strung out across a large space) 2. Classes can happily form more than one cluster if that’s appropriate (e.g. two very different kinds of dogs in the dataset) 3. Can add new classes or transfer learning to new domains on the fly (just pump examples through the network)

Embedding transfer learning results Stanford Cars 196 Classes ~82 images / class Train on 98 classes Test on remaining CUB Birds - 2011 200 Classes ~58 images / class Train on 100 classes Test on remaining

Classification Results CUB Birds - 2011 Conventional classification task 200 Classes 5994 Training images 5794 Test images

Finding K nearest neighbors We could use sparse SVM as a classifier but we have state-of-the-art approximate nearest neighbor algorithm: FANNG FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate of 2.5M queries/s – or at 99.7% accuracy at 300K queries/s It can also return K nearest neighbors

FANNG: Fast Approximate Nearest Neighbour Graph Based on two insights: 1. If a reference point is near a query point, it’s worth checking its neighbors to see if they’re closer build a graph linking points to their neighbors • 2. If there are many neighbors in the same direction it’s not necessary to have edges to all of them if the nearest of them isn’t closer to the query • then the others probably aren � t either.

FANNG Algorithm Start somewhere in the graph 6 5 While (not bored) { 1 4 5 Measure distance to target (query point) 2 3 6 3 4 1 Insert this point into a priority* queue of vertices 2 visited ( *in order of distance from target) Query point Find the nearest unvisited neighbor of the highest priority vertex with unvisited neighbors }

FANNG Algorithm Start somewhere in the graph 6 7 While (not bored) { 1 8 5 5 Measure distance to target (query point) 2 8 9 6 3 7 4 9 10 Insert this point into a priority* queue of vertices 4 visited ( *in order of distance from target) 10 Query point 3 Find the nearest unvisited neighbor of the highest 1 2 priority vertex with unvisited neighbors }

High dimensional data is VERY counterintuitive Fill a 128D hypercube with 1,000,000 points of random data

High dimensional data is VERY counterintuitive Fill a 128D hypercube with 1,000,000 points of random data Inflate a hypersphere centred in the hypercube until it touches a data point (find the largest sphere with no data in it)

FANNG Performance FANNG can find the nearest neighbor with 90% accuracy in 1M examples of SIFT descriptors at rate of 2.5M queries/s – or at 99.7% accuracy at 300K queries/s It can also return K nearest neighbors (the highest K entries in the Priority Queue when searching is finished)

FANNG Performance For 128D SIFT data search time scales as 0.2 power of data set size: � ∝ � �.�

FANNG Additional Uses In addition, we are using FANNG for: • Camera relocalization for SLAM - finding keypoints or scenes previously visited • Hard negative mining for training triplet networks - prioritising badly classified negative examples • K means clustering (K = 1,000,000 N = 20,000,000 D = 128)

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, - PowerPoint PPT Presentation

All That Glisters Is Not Convnets: Hybrid Architectures For Faster, Better Solvers Prof Tom Drummond Image classification network structure #classes Conv+RELU Conv+ Conv+RELU Reshape FC+RELU FC+Softmax Downsample Feature detection

Deformation Modeling in ConvNets Jifeng Dai Visual Computing Group Microsoft Research Asia

Visualization Visualization Understand what ConvNets learn 2 Visualization The development of

Understanding How ConvNets See Springerberg et al, Striving for Simplicity: The All Convolutional

Object Detection Deep ConvNets for Recognition for... Images (global) Objects (local) Video

Bored by Classification ConvNets? End-to-end Learning of other Computer Vision Tasks Thomas Brox

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 11:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 11:

NOT FOR REPRODUCTION NOT FOR REPRODUCTION NOT FOR REPRODUCTION NOT FOR REPRODUCTION NOT FOR

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

Lesson 2 Greek Vocabulary One does not equal five!!! One does not equal five!!! One does not

We are not. We are not. We are not Neighbourhood Watch We are not. We are not. We

All Under Sin All Under Sin All Under Sin All Under Sin Gentiles Jews

When NOT to Use ASICs When NOT to Use ASICs Rick Van Berg HEPIC2013 When NOT to Use ASICs When

Do Not Worry Matthew 6:25-34 Do Not Worry Do Not Worry Worry Preoccupation with problems

Advanced Section #3: CNNs and Object Detection AC 209B: Data Science Javier Zazo Pavlos

Introduction to Brain Imaging: fMRI and MEG/EEG The Algonauts Workshop Yalda Mohsenzadeh

New Directions in 4D-Var: the ECMWF Perspective Massimo Bonavita and many colleagues ECMWF

NON-LINEAR PROGRESSIVE FAILURE ANALYSIS OF COMPOSITE AEROSPACE STRUCTURES M.Gnel 1 and A. Kayran

Performance of time-marching techniques dedicated to nonsmooth systems A nonlinear modal analysis

a) Rudarsko-geoloski fakultet, Goce Delcev 89, 2000 Stip, Republic of Macedonia b) Fakultet za

Fully Nonlinear Elliptic Path Dependent PDEs The viscosity solutions to the Dirichlet problem REN

Factor Analysis for Volatility - Part II Ross Askanazi and Jacob Warren September 4, 2015 Ross

Multigrid solution methods for nonlinear time-dependent systems Feng Wei Yang Department of

Stephen Penneck President-elect International Statistical Institute (ISI) 1 Examples include