Convolutional Neural Nets II EECS 442 β Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/
Previously β Backpropagation π π¦ = βπ¦ + 3 2 x -x -x+3 (-x+3) 2 -n n 2 n+3 1 2x β 6 β2π¦ + 6 β2π¦ + 6 Forward pass: compute function Backward pass: compute derivative of all parts of the function
Setting Up A Neural Net Input Hidden Output h 1 y 1 x 1 h 2 y 2 x 2 h 3 y 3 h 4
Setting Up A Neural Net Input Hidden 2 Output Hidden 1 a 1 h 1 y 1 x 1 a 2 h 2 y 2 x 2 a 3 h 3 y 3 a 4 h 4
Fully Connected Network a 1 h 1 y 1 Each neuron connects x 1 a 2 h 2 to each neuron in the y 2 previous layer x 2 a 3 h 3 y 3 a 4 h 4
Fully Connected Network Define New Block: βLinear Layerβ (Ok technically itβs Affine) W b π π = πΏπ + π n L Can get gradient with respect to all the inputs (do on your own; useful trick: have to be able to do matrix multiply)
Fully Connected Network a 1 h 1 y 1 x 1 a 2 h 2 y 2 x 2 a 3 h 3 y 3 a 4 h 4 W 1 b 1 W 2 b 2 W 3 b 3 x L f (n) L f (n) L f (n)
Convolutional Layer New Block: 2D Convoluiton W b π· π = π β πΏ + π n C
Convolution Layer F w 32 F h c πΊ β πΊ π π₯ π + ΰ· ΰ· ΰ· πΊ π,π,π β π½ π§+π,π¦+π,π 32 π=1 π=1 π=1 3 Slide credit: Karpathy and Fei-Fei
Convolutional Neural Network (CNN) W 1 b 1 W 2 b 2 W 3 b 3 x C f (n) C f (n) C f (n)
Today C W F 1 CNN 1 H Convert HxW image into a F-dimensional vector β’ Whatβs the probability this image is a cat (F=1) β’ Which of 1000 categories is this image? (F=1000) β’ At what GPS coord was this image taken? (F=2) β’ Identify the X,Y coordinates of 28 body joints of an image of a human (F=56)
Todayβs Running Example: Classification C W F 1 CNN 1 H Running example: image classification P(image is class #1) P(image is class #2) P(image is class #F)
Todayβs Running Example: Classification C W 1 CNN 1 0.5 0.2 0.1 0.2 H y i : class #0 Loss function βHippoβ exp( ππ¦ π§ π β log Ο π exp( ππ¦ π ))
Todayβs Running Example: Classification C W 1 CNN 1 0.5 0.2 0.1 0.2 H y i : class #3 Loss function βBaboonβ exp( ππ¦ π§ π β log Ο π exp( ππ¦ π ))
Model For Your Head C W F 1 CNN 1 H β’ Provide: β’ Examples of images and desired outputs β’ Sequence of layers producing a 1x1xF output β’ A loss function that measures success β’ Train the network -> network figures out the parameters that makes this work
Layer Collection You can construct functions out of layers. The only requirement is the layers βfitβ together. Optimization figures out what the parameters of the layers are. Image credit: lego.com
Review β Pooling Idea: just want spatial resolution of activations / images smaller; applied per-channel Max-pool 1 1 2 4 2x2 Filter 5 6 7 8 6 8 Stride 2 3 2 1 0 3 4 1 1 3 4 Slide credit: Karpathy and Fei-Fei
Review β Pooling Max-pool 2x2 Filter Stride 2 1 1 2 4 6 8 5 6 7 8 3 2 1 0 3 4 1 1 3 4
Other Layers β Fully Connected 1x1xC 1x1xF Map C-dimensional feature to F-dimensional feature using linear transformation W (FxC matrix) + b (Fx1 vector) How can we write this as a convolution?
Everythingβs a Convolution 1x1xC 1x1xF Set Fh=1, Fw=1 1x1 Convolution with F Filters πΊ β πΊ π π π₯ π + ΰ· ΰ· ΰ· πΊ π,π,π β π½ π§+π,π¦+π,π π + ΰ· πΊ π β π½ π π=1 π=1 π=1 π=1
Converting to a Vector HxWxC 1x1xF How can we do this?
Converting to a Vector* β Pool HxWxC 1x1xF Avg Pool 1 1 2 4 HxW Filter 5 6 7 8 Stride 1 3.1 3 2 1 0 1 1 3 4 *(If F == C)
Converting to a Vector β Convolve HxWxC 1x1xF HxW Convolution with F Filters Single value β Per-filter
Looking At Networks β’ Weβll look at 3 landmark networks, each trained to solve a 1000-way classification output (Imagenet) β’ Alexnet (2012) β’ VGG-16 (2014) β’ Resnet (2015)
AlexNet Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 Each block is a HxWxC volume. You transform one volume to another with convolution
CNN Terminology Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 Each entry is called an βactivationβ/βneuronβ/βfeatureβ
AlexNet Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384
AlexNet Input Conv 1 227x227 55x55 55x55 227x227 55x55 3 96 96 3 96 ReLU 11x11 filter, stride of 4 (227-11)/4+1 = 55
AlexNet Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 All layers followed by ReLU Red layers are followed by maxpool Early layers have βnormalizationβ
AlexNet β Details Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 C: 11 C:5 C:3 C:3 C:3 P: 3 P:3 P:3 C: Size of conv P: Size of pool
AlexNet Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 13x13 Input, 1x1 output. How?
Alexnet β How Many Parameters? Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384
Alexnet β How Many Parameters? Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 96 11x11 filters on 3 -channel input 11x11 x 3 x 96+96 = 34,944
Alexnet β How Many Parameters? Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 Note: max pool to 6x6 4096 6x6 filters on 256 -channel input 6x6 x 256 x 4096+4096 = 38 million
Alexnet β How Many Parameters? Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 4096 1x1 filters on 4096 -channel input 1x1 x 4096 x 4096+4096 = 17 million
Alexnet β How Many Parameters How long would it take you to list the parameters of Alexnet at 4s / parameter? 1 year? 4 years? 8 years? 16 years? β’ 62.4 million parameters β’ Vast majority in fully connected layers β’ But... paper notes that removing the convolutions is disastrous for performance.
Dataset β ILSVRC β’ Imagenet Largescale Visual Recognition Challenge β’ 1000 Categories β’ 1.4M images
Dataset β ILSVRC Figure Credit: O. Russakovsky
Visualizing Filters Input Conv 1 227x227 55x55 3 96 Conv 1 Filters β’ Q. How many input dimensions? β’ A: 3 β’ What does the input mean? β’ R, G, B, duh.
Whatβs Learned First layer filters of a network trained to distinguish 1000 categories of objects Remember these filters go over color. Figure Credit: Karpathy and Fei-Fei
Visualizing Later Filters Input Conv Conv 1 2 227x227 55x55 27x27 3 96 256 Conv 2 Filters β’ Q. How many input dimensions? β’ A: 96β¦. hmmm β’ What does the input mean? β’ Uh, the uh, previous slide
Visualizing Later Filters β’ Understanding the meaning of the later filters from their values is typically impossible: too many input dimensions, not even clear what the input means.
Understanding Later Filters Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 CNN that extracts a 2-hidden layer 13x13x256 output Neural network
Understanding Later Filters Input Conv Conv Conv Conv Conv FC FC Output 1 2 3 4 5 6 7 1x1 227x227 55x55 27x27 13x13 13x13 13x13 1x1 1x1 256 4096 4096 1000 3 96 256 384 384 CNN that extracts a 1-hidden 1x1x4096 feature layer NN
Understanding Later Filters Input Conv Conv Conv Conv Conv 1 2 3 4 5 227x227 55x55 27x27 13x13 13x13 13x13 256 3 96 256 384 384 CNN that extracts a 13x13x256 output
Understanding Later Filters Feed an image in, see what score the filter gives it. A more pleasant version of a real neuroscience procedure. 13x13 256 Which oneβs bigger? What image makes the output biggest? 13x13 256
Figure Credit: Girschick et al. CVPR 2014.
Whatβs Up With the White Boxes? 3 384 13 227 227 13
Recommend
More recommend