Layers as embeddings Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 24 / 89
In the classification case, the network can be seen as a series of processings aiming as disentangling classes to make them easily separable for the final decision. In this perspective, it makes sense to look at how the samples are distributed spatially after each layer. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 25 / 89
The main issue to do so is the dimensionality of the signal. If we look at the total number of dimensions in each layer: • A MNIST sample in a LeNet goes from 784 to up to 18k dimensions, • A ILSVRC12 sample in Resnet152 goes from 150k to up to 800k dimensions. This require a mean to project a [very] high dimension point cloud into a 2d or 3d “human-brain accessible” representation Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 26 / 89
We have already seen PCA and k -means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 27 / 89
We have already seen PCA and k -means as two standard methods for dimension reduction, but they poorly convey the structure of a smooth low-dimension and non-flat manifold. It exists a plethora of methods that aim at reflecting in low-dimension the structure of data points in high dimension. A popular one is t-SNE developed by van der Maaten and Hinton (2008). Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 27 / 89
Given data-points in high dimension � � x n ∈ R D , n = 1 , . . . , N D = the objective of data-visualization is to find a set of corresponding low-dimension points � y n ∈ R C , n = 1 , . . . , N � E = such that the positions of the y s “reflect” that of the x s. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 28 / 89
The t-Distributed Stochastic Neighbor Embedding (t-SNE) proposed by van der Maaten and Hinton (2008) optimizes with SGD the y i s so that the distances to close neighbors of each point are preserved. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 29 / 89
The t-Distributed Stochastic Neighbor Embedding (t-SNE) proposed by van der Maaten and Hinton (2008) optimizes with SGD the y i s so that the distances to close neighbors of each point are preserved. It actually matches for D KL two distance-dependent distributions: Gaussian in the original space, and Student t-distribution in the low-dimension one. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 29 / 89
The scikit-learn toolbox http://scikit-learn.org/ is built around SciPy, and provides many machine learning algorithms, in particular embeddings, among which an implementation of t-SNE. The only catch to use it in PyTorch is the conversions to and from numpy arrays. from sklearn.manifold import TSNE # x is the array of the original high -dimension points x_np = x.numpy () y_np = TSNE( n_components = 2, perplexity = 50). fit_transform (x_np) y = torch.from_numpy (y_np) n components specifies the embedding dimension and perplexity states [crudely] how many points are considered neighbors of each point. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 30 / 89
t-SNE unrolling of the swiss roll (with one noise dimension) Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 31 / 89
t-SNE unrolling of the swiss roll (with one noise dimension) Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 31 / 89
Input t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89
Layer #1 t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89
Layer #4 t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89
Layer #7 t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89
Layer #9 t-SNE for LeNet on MNIST Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 32 / 89
Input t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #5 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #10 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #15 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #20 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #25 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #30 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #31 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #32 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #33 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #34 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #35 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #36 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Layer #37 t-SNE for an home-baked resnet (no pooling, 66 layers) CIFAR10 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 33 / 89
Occlusion sensitivity Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 34 / 89
Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image. For instance, we can get a simple estimate of the importance of a part of the input image by computing the difference between: 1. the value of the maximally responding output unit on the image, and 2. the value of the same unit with that part occluded. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 35 / 89
Another approach to understanding the functioning of a network is to look at the behavior of the network “around” an image. For instance, we can get a simple estimate of the importance of a part of the input image by computing the difference between: 1. the value of the maximally responding output unit on the image, and 2. the value of the same unit with that part occluded. This is computationally intensive since it requires as many forward passes as there are locations of the occlusion mask, ideally the number of pixels. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 35 / 89
Original images Occlusion mask 32 × 32 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 36 / 89
Original images Occlusion sensitivity, mask 32 × 32, stride of 2, AlexNet Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89
Original images Occlusion sensitivity, mask 32 × 32, stride of 2, VGG16 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89
Original images Occlusion sensitivity, mask 32 × 32, stride of 2, VGG19 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 37 / 89
Saliency maps Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 38 / 89
An alternative is to compute the gradient of the maximally responding output unit with respect to the input (Erhan et al., 2009; Simonyan et al., 2013), e.g. ∇ | x f ( x ; w ) where f is the activation of the output unit with maximum response, and | x stresses that the gradient is computed with respect to the input x and not as usual with respect to the parameters w . Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 39 / 89
This can be implemented by specifying that we need the gradient with respect to the input. We use here the correct unit, not the maximum response one. Using torch.autograd.grad to compute the gradient wrt the input image instead of torch.autograd.backward has the advantage of not changing the model’s parameter gradients. input = Variable(img , requires_grad = True) output = model(input) loss = nllloss(output , target) grad_input , = torch.autograd.grad(loss , input) Note that since torch.autograd.grad computes the gradient of a function with possibly multiple inputs, the returned result is a tuple. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 40 / 89
The resulting maps are quite noisy. For instance with AlexNet: Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 41 / 89
This is due to the local irregularity of the network’s response as a function of the input. Figure 2. The partial derivative of S c with respect to the RGB val- ues of a single pixel as a fraction of the maximum entry in the ∂S c gradient vector, max i ∂x i ( t ) , (middle plot) as one slowly moves away from a baseline image x (left plot) to a fixed location x + ǫ (right plot). ǫ is one random sample from N (0 , 0 . 01 2 ) . The fi- nal image ( x + ǫ ) is indistinguishable to a human from the origin image x . (Smilkov et al., 2017) Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 42 / 89
Smilkov et al. (2017) proposed to smooth the gradient with respect to the input image by averaging over slightly perturbed versions of the latter. N ∇ | x f y ( x ; w ) = 1 � ˜ ∇ | x f y ( x + ǫ n ; w ) N n =1 where ǫ 1 , . . . , ǫ N are i.i.d of distribution N (0 , σ 2 I ), and σ is a fraction of the gap ∆ between the maximum and the minimum of the pixel values. Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 43 / 89
A simple version of this “SmoothGrad” approach can be implemented as follows nb_smooth = 100 std = smooth_std * (img.max () - img.min ()) acc_grad = img.new(img.size ()).zero_ () for q in range(nb_smooth): # This should be done with mini -batches ... noisy_input = img + img.new(img.size ()).normal_ (0, std) noisy_input = Variable(noisy_input , requires_grad = True) output = model( noisy_input ) loss = nllloss(output , target) grad_input , = torch.autograd.grad(loss , noisy_input ) acc_grad += grad_input .data acc_grad = acc_grad.abs ().sum (1) # sum across channels Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 44 / 89
Original images Gradient, AlexNet SmoothGrad, AlexNet, σ = ∆ 4 Fran¸ cois Fleuret EE-559 – Deep learning / 8. Under the hood 45 / 89
Recommend
More recommend