convolutional neural networks

Convolutional Neural Networks Kaitlin Palmer San Diego State - PDF document

Convolutional Neural Networks Kaitlin Palmer San Diego State University 1 Outline What are Convolutional Neural Networks (CNN) Why use a CNN Typical Layout Kernel Size Stride Size/Padding Pooling Keras Implementation

  1. Convolutional Neural Networks Kaitlin Palmer San Diego State University 1 Outline • What are Convolutional Neural Networks (CNN) • Why use a CNN • Typical Layout – Kernel Size – Stride Size/Padding – Pooling • Keras Implementation 2 2 1

  2. What are CNNs? • Neural networks that use convolution (or cross correlation) of a weight and bias matrix rather than matrix multiplication 3 3 What are CNNs? s 𝑢 = ׬ 𝑦 𝑏 𝑥 𝑢 − 𝑏 𝑒𝑏 • Spaceship example • s(t) - smoothed estimate of the • x(t) - radar position • a – age of measurement • w(t-a) – weighting function • Considered probability density function • 0 for all negative zeros 4 ISS tracking Data: 4 2

  3. What are CNNs? Discretized ∞ 𝑦 𝑏 𝑥(𝑢 − 𝑏) s 𝑢 = σ −∞ • s(t) – feature map • x(a) input (multi-dimensional array) • w(t-a) kernel (multi-dimensional array) 5 5 What are CNNs? • What is convolution? • Practice example 1D (summation of the products) ∗ 0 1 1 2 5 3 0 0 1 0 1 1 0 2 3 7 8 3 0 6 6 3

  4. What are CNNs? Multi-dimensional Array 𝑔 𝑦 = ෍ ෍ 𝐽 𝑗 − 𝑛, 𝑘 − 𝑜 𝐿(𝑛, 𝑜) 𝑜 𝑛 Beware matrix flipping – convolution vs. cross correlation 𝑔 𝑦 = ෍ ෍ 𝐽 𝑗 + 𝑛, 𝑘 + 𝑜 𝐿(𝑛, 𝑜) 𝑜 𝑛 7 7 Why CNNs? • Sparse Interactions • Parameter sharing • Equivariant Representations 8 8 4

  5. Why CNNs • Sparse Interactions – AKA sparse connectivity or sparse weights – Fewer Parameters – Kernels (storing) smaller than input – Tens or hundreds of parameters to learn vs. millions 9 Fig. 9.2 Goodfellow et al. 9 Why CNNs • Parameter Sharing – Same parameter for more than one function in a model – Weights applied to one input applied elsewhere – Each member of the kernel used at every position – One set of parameters is learned- regardless of location 10 10 5

  6. Why CNNs • Sparse Interactions – Receptive Field – Few Direct connections but units in deeper layers indirectly connected to most of the input image Fig. 9.4 Goodfellow et al. 11 11 Why CNNs • Parameter Sharing – Each kernel value used at every position of the input – Convolution Example • 280 x 320 * 280 x 319 = 319*280*3 = 267,960 [two multiplications and one addition per kernel • 320 x 280 x 319 x 280 = >8 billion parameters 4 billion times more effective Fig. 9.5 Goodfellow et al. 12 12 6

  7. Why CNN’s • Equivariance to translation – If input changes output changes by the same amount – Event moves later in time (or location) in input shifts the same in output – Not naturally invariant to rotation or scale 13 13 Why CNN’s • Edge Detection Example Fig. 9.6 Goodfellow et al. 14 14 7

  8. 8 8 8 0 0 0 8 8 8 0 0 0 0 24 24 0 8 8 8 0 0 0 1 0 -1 0 24 24 0 8 8 8 0 0 0 ∗ = 1 0 -1 0 24 24 0 8 8 8 0 0 0 1 0 -1 0 24 24 0 8 8 8 0 0 0 ∗ 15 Andrew Ng 2017 15 CNN Layout • Kernel Size (typically odd) • Stride/ padding • Pooling 16 16 8

  9. CNN Layout • Padding 17 17 CNN Layout • Why padding? – Input shrinks at each layer – Edge Effects • Padding Types and Terminology – Valid: No padding – Same: Make output size the same as the input size – Full: Sufficient pixels to be visited k times in each direction 18 18 9

  10. Layout - Step Size • AKA stride • Equivalent to hop size- advance • Down sample neural network 19 19 Convolutions on RGB ∗ = 4 x 4 Andrew ng 20 20 10

  11. CNN Layout- Pooling • Pooling Layers • Invariant to small translations of the input • Replace net output with summary statistic – Max pooling – Neighborhood average – L 2 norm – Weighted average distance from central pixel 21 21 Pooling • Pooling – Equivalent to infinitely strong prior – Max Pooling Example Fig. 9.8 Goodfellow et al. 22 22 11

  12. Pooling • Down sampling – Computational Efficiency – Possible to use fewer pooling units than detector layer – Pool over k pixels 23 Fig. 9.10 Goodfellow et al. 23 Pooling • Invariance to translation 24 Fig. 9.9 Goodfellow et al. 24 12

  13. Pooling Invariance Yann LeCun: 25 25 CNN Layout Fig. 9.9 Goodfellow et al. 26 26 13

  14. Keras Implementation LeCun et al. 1998 Gradient Based Learning Applied to Document Recognition 27 27 Keras Implementation LeNet-5 model = Sequential() model.add(Conv2D(6, kernel_size=(5, 5), activation= tanh ', input_shape=input_shape)) model.add(MaxPooling2D(6, pool_size=(2, 2))) model.add (Conv2D(16, (5, 5), activation=‘ tanh ’)) model.add(MaxPooling2D(16, pool_size=(2, 2))) model.add(Conv2D(120, (5, 5), activation= ‘ tanh ')) model.add(Dense(84, activation= ‘sigmoid ')) model.add(Dense(num_classes, activation='softmax')) 28 28 14

  15. Convolution Backpropagation • Convolution, Backpropagation from output to weights, Backpropagation from output to input • Kernel stack K • Multidimensional input (e.g. image) V • Stride s • Convolution output (feature map) Z • Loss function J 29 29 Backpropagation 𝐷𝑝𝑜𝑤𝑝𝑚𝑣𝑢𝑗𝑝𝑜 = 𝑑 𝐿, 𝑊, 𝑡 = 𝑎 Backpropagation from output to kernel 𝑀𝑝𝑡𝑡 𝐺𝑣𝑜𝑑𝑢𝑗𝑝𝑜 = 𝐾 𝑊, 𝐿 𝜖 Tensor, change loss with 𝐻 = = 𝐾(𝑊, 𝐿) respect to feature map 𝜖𝑎 𝑗,𝑘,𝑙 𝜖 Derivatives with 𝑕(𝐻, 𝑊, 𝑡) 𝑗,𝑘,𝑙,𝑚 = = ෍ 𝐻 𝑗,𝑛,𝑜 𝑊 𝑘, 𝑛−1 𝑡+𝑙, 𝑜−1 𝑡+𝑚 respect to the 𝜖𝐿 𝑗,𝑘,𝑙,𝑚 𝑛,𝑜 kernel 𝜖 Backpropagation ℎ(𝐿, 𝐻, 𝑡) 𝑗,𝑘,𝑙 = 𝐾(𝑊, 𝐿) = ෍ ෍ ෍ 𝐿 𝑟,𝑗,𝑛,𝑞 𝐻 𝑟,𝑚,𝑜 through hidden 𝜖𝑊 𝑗,𝑘,𝑙 𝑜,𝑞 𝑚,𝑛 𝑟 layer 𝑡.𝑢. 𝑡.𝑢. 𝑜−1 𝑡+𝑞=𝑙 𝑚−1 𝑡+𝑛=𝑘 30 30 15

  16. Structured Output • For pixel-wise labeling of images pooling is not always necessary Fig. 9.17 Goodfellow et al. 31 31 Locally Connected Layers • Aka unshared convolution Locally connected layer • Features a fx small portion of (patch size 2) space, but not across all space • Look for chin in the bottom half Convolutional Layer of an image 𝑎 𝑗,𝑘,𝑙 = ෍ [𝑊 𝑚,𝑘+𝑛−1,𝑙+𝑜−1 𝑥 𝑗,𝑘,𝑙,𝑚,𝑛,𝑜 ] Fully connected layer 𝑚,𝑛,𝑜 Fig. 9.14 Goodfellow et al. 32 32 16

  17. Tiled Convolution • Midway between locally connected Locally layers and convolutional layer connected layer • (patch size 2) Learn a set of kernels to rotate through • Immediate neighbors different filters but memory size increased only by a Tiled convolution (t=2) factor of the size of the kernels 𝑊 𝑚,𝑘+𝑛−1,𝑙+𝑜−1 𝑎 𝑗,𝑘,𝑙 = ෍ Traditional convolution 𝐿 𝑗,𝑚,𝑛,𝑜,𝑘%𝑢+1,𝑙%𝑢+1 ~tiled convolution with 𝑚,𝑛,𝑜 t=1 Fig. 9.16 Goodfellow et al. 33 33 Data Types • Flexibility in CNNs • Multiple input sizes 34 34 17

  18. Data Types • 1D Multi-channel Single Channel Position Rotation Scale 35 35 Data Types • 2D Multi-channel Single Channel 36 36 18

  19. Data Types • 3D Multi Channel Single Channel 37 37 Random or Unsupervised Features • Learning features is expensive – Every gradient step requires full forward/back prop • Use features not trained in a supervised fashion 38 38 19

  20. Random or Unsupervised Features • Random kernel initialization • Design kernels by hand • Learn kernels with an unsupervised criterion 39 39 Random or Unsupervised Features • Random kernel initialization – As before, random weights typically perform well – Need to test multiple architectures • Good approach: – Build multiple architectures – Set random weights – Only train the last layer- pick the best architecture and train using full back prop 40 40 20

  21. Random or Unsupervised Features • Learn kernels ( k ) using unsupervised criterion – Allows features to be determined separately from the classifier late in the architecture – What unsupervised tools have we used so far? – K-means clustering to image patches, each centroid as a convolution kernel – Extract k-means for the entire training set and use this as the last layer before classification 41 41 Random or Unsupervised Features • Hand designed features ? ? ? 42 42 21

  22. Neurobiologically Inspired Networks • Hubel and Wisel, 1959,1962,1968 43 43 Neurobiological Basis • Simple cells – Roughly linear – Feature selection • Complex cells – Nonlinear – Invariant to some transformations of simple cell features 44 44 22


More recommend