Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives understand the convolution layer and the architecture of conv-net its inductive bias its derivation from fully connected layer different types of convolution 2
MLP and image data MLP and image data we can apply an MLP to image data first vectorize the input x → vec( x ) ∈ R 784 feed it to the MLP (with L layers) and predict the labels { L } {1} softmax ∘ W ∘ … ∘ ReLU ∘ W vect( x ) the model knows nothing about the image structure we could shuffle all pixels and learn an MLP with similar performance how to bias the model, so that it "knows" its input is image? image is like 2D version of sequence data lets find the right model for sequence first... image:https://medium.com/@rajatjain0807/machine-learning-6ecde3bfd2f4 3 . 1
Parameter-sharing Parameter-sharing suppose we want to convert one sequence to another R R D D → ( n ) ( n ) suppose we have a dataset of input-output pairs {( x , y )} n consider only a single layer y = g ( Wx ) e.g., convert one voice to another ... ... output W ... ... input we may assume, each output unit is the same function shifted along the sequence when is this a good assumption? ... ... output elements of w of the same color are tied together (parameter-sharing) W ... ... input 3 . 2
Locality & sparse weight Locality & sparse weight we may assume, each output unit is the same function shifted along the sequence ... ... output W ... ... input we may further assume each output is a local function of input larger receptive field with multiple layers ... ... ... ... size of the receptive field is 3 ... ... size of the receptive field is 5 3 . 3
Cross-correlation Cross-correlation (1D) (1D) we may further assume each output is a local function of input ... ... output parameter-sharing in W W is very sparse W ... ... input instead of the whole matrix we can keep the one set of nonzero values w = [ w , … , w ] = [ W , … , W ] 1 K c , c −⌊ K ⌋ c , c +⌊ K ⌋ 2 2 we can write matrix multiplication as cross-correlation of w and x K D = g ( y = ∑ k =1 ) g ( ∑ d =1 x ) w x W c , d k c −⌊ K ⌋+ k c d 2 slide on the input, calculate inner product and apply the nonlinearity 3 . 4
Convolution (1D) Convolution (1D) Cross-correlation is similar to convolution w w ∞ y ( c ) = ∑ k =−∞ w ( k ) x ( c + k ) Cross-correlation w ⋆ x x x w is called the filter or kernel ignoring the activation (for simpler notation) assuming w and x are zero for any index outside the input and filter bound w ∗ x x ⋆ w Convolution flips w or x (to be commutative) ∞ ∞ y ( c ) = w ( k ) x ( c − k ) = w ( c − d ) x ( d ) ∑ k =−∞ ∑ d =−∞ w ∗ x x ∗ w change of variable x ∗ w w ⋆ x since we learn w, flipping it makes no difference in practice, we use cross correlation rather than convolution convolution is equivariant wrt translation -- i.e., shifting x , shifts w*x 3 . 5
Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 participates in all outputs participates in a single output this is related to the borders image credit: Vincent Dumoulin, Francesco Visin 3 . 6
Convolution Convolution (2D) (2D) similar idea of parameter-sharing and locality extends to 2 dimension (i.e. image data) K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d d + k −1, d + k −1 k , k 1 2 1 1 2 2 1 2 1 2 there are different ways of handling the borders zero-pad the input, and produce all non-zero outputs (full) output length (for one dimension) output is larger than input (by how much?) ⌊ D + padding − K + 1⌋ each input participates in the same number of output elements y zero-pad the input, to keep the output dims similar to input (same) w 3x3 kernel no padding at all (valid) output is small than input (how much?) x image credit: Vincent Dumoulin, Francesco Visin 3 . 7 Winter 2020 | Applied Machine Learning (COMP551)
Pooling Pooling sometimes we would like to reduce the size of output e.g., from D x D to D/2 x D/2 a combination of pooling and downsampling is used ~ d K 1. calculate the output = g ( ∑ k =1 w ) y x d + k −1 k left translation 2. aggregate the output over different regions ~ d ~ d + p y = pool{ , … , } y y d two common aggregation functions are max and mean pooling results in some degree of invariance to translation 3. often this is followed by subsampling using the same step size the same idea extends to higher dimensions 4 . 1
Strided convolution Strided convolution alternatively we can directly subsample the output ~ d K = g ( ∑ k =1 w ) y x ( d −1)+ k k ~ d K = ~ dp g ( ∑ k =1 w ) y x y = p ( d −1)+ k k y d y 1 y 2 y 3 y 1 y 2 y 3 equivalent to ~ 1 ~ 2 ~ 3 ~ 3 ~ 4 ~ 5 y y y y y y 4 . 2
Strided convolution Strided convolution the same idea extends to higher dimensions K 1 K 2 = ∑ k =1 ∑ k =1 y x w d , d p ( d −1)+ k , p ( d −1)+ k k , k 1 2 1 1 1 2 2 2 1 2 1 2 different step-sizes for different dimensions output input with padding output output length (for one dimension) D +padding− K ⌊ + 1⌋ input stride image: Dumoulin & Visin'16 4 . 3 Winter 2020 | Applied Machine Learning (COMP551)
Channels Channels so far we assumed a single input and output sequence or image ′ w ∈ R M × M × K × K we have one filters per input-output channel combination 1 2 K × K 2 1 + add the result of convolution from different input channels with RGB data, we have 3 input channels ( ) M = 3 this example: 2 input channels x ∈ R M × D × D 1 2 similarly we can produce multiple output channels M = ′ 3 ′ ′ ′ y ∈ R M × D × D 1 2 image: Dumoulin & Visin'16 5 . 1
Channels Channels so far we assumed a single input and output sequence or image b ∈ R M ′ we can also add a bias parameter (b), one per each output channel M = + g ( ∑ m =1 ∑ k 1 ∑ k 2 ) y w x b ′ ′ m ′ m , d , d m , m , k , k m , d + k −1, d + k −1 1 2 1 2 1 1 2 2 x ∈ R M × D × D 1 2 ′ ′ ′ y ∈ R M × D × D D = 1 2 2 K 1 ′ w ∈ R M × M × K × K K 2 1 2 D = 1 M = ′ M = 5 RGB channels image: https://cs231n.github.io/convolutional-networks/ 5 . 2 Winter 2020 | Applied Machine Learning (COMP551)
Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data example: conv-net architecture (derived from AlexNet) for image classification fully connected layers number of classes visualization of the convolution kernel at the first layer 11x11x3x96 96 filters, each one is 11x11x3. each of these is responsible for one of 96 feature maps in the second layer 6 . 1
Convolutional Neural Network ( Convolutional Neural Network (CNN CNN) CNN or convnet is a neural network with convolutional layers (so it's a special type of MLP) it could be applied to 1D sequence, 2D image or 3D volumetric data example: conv-net architecture (derived from AlexNet) for image classification fully connected layers number of classes deeper units represent more abstract features 6 . 2
Application: image classification Application: image classification Convnets have achieved super-human performance in image classification ImageNet challenge: > 1M images, 1000 classes image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 3
Application: image classification Application: image classification variety of increasingly deeper architectures have been proposed image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 4
Application: image classification Application: image classification variety of increasingly deeper architectures have been proposed image credit: He et al'15, https://semiengineering.com/new-vision-technologies-for-real-world-applications/ 6 . 5 Winter 2020 | Applied Machine Learning (COMP551)
Training: Training: backpropagation through convolution backpropagation through convolution = ∑ m ∑ k y w x consider the strided 1D convolution op. m , d ′ m , m , k ′ m , p ( d −1)+ k output channel index input channel index filter index stride using backprop. we have so far and we need ∂ J ∂ y m , d ′ ′ x m , p ( d −1)+ k ′ ∂ y m , d ′ ′ ∂ y m , d ∂ J ∂ J 1) ′ ′ = ∑ d ′ ∂ y m , d so as to get the gradients ∂ w m , m , k ∂ w m , m , k ∂ w m , m , k ′ ′ ′ ′ ′ ∑ k w ′ m , m , k such that ′ p ( d − 1) + k = d ∂ y m , d ∂ y m , d ∂ J ∂ J ′ ′ 2) ′ ′ = ∑ d , m to backpropagate to previous layer ′ ∂ y m , d ′ ∂ x d , m ∂ x d , m ′ ′ ∂ x m , d this operation is similar to multiplication by transpose of the parameter-sharing matrix (transposed convolution) 7 . 1
Recommend
More recommend