Administrative - A2 has a number of corrections on Pizza. They are fixed in most recent .zip file. - Btw CNNs in Matlab: http://www.vlfeat. org/matconvnet/ Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 1
[Simonyan et al. 2014] Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 2
Where we are... Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 3
Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 4
before: output layer input layer hidden layer 1 hidden layer 2 now: Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 5
Every stage in a ConvNet has activations of three dimensions: HEIGHT WIDTH DEPTH Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 6
CONV CONV POOLCONV CONV POOL CONV CONV POOL FC ReLU ReLU ReLU ReLU ReLU ReLU (Fully-connected) Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 7
Typical ConvNets look like: [CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX or [CONV-RELU-CONV-RELU-POOL]xN,[FC-RELU]xM,FC,SOFTMAX N >= 0, M >=0 Note: (last FC layer should not have RELU - these are the class scores) Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 8
Convolutional Layer Just like normal Hidden Layer BUT: - Connect neurons to the input in a local receptive field - All neurons in a single depth slice share weights Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 9
The weights of this neuron visualized Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 10
convolving the first filter in the input gives the first slice of depth in output volume Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 11
Max Pooling Layer downsampling 32 16 16 Single depth slice 32 1 1 2 4 x max pool with 2x2 filters 6 8 5 6 7 8 and stride 2 3 4 3 2 1 0 1 2 3 4 Pooling layer downsamples every activation map in the input independently with max. y Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 12
Modern CNN trend toward: - Small filter sizes (3x3 and less) - Small pooling sizes (2x2 and less) - Small strides (stride = 1, ideally) - Deep - Conv Layers should pad with zeros to not reduce spatial size - Pool Layers should reduce size once in a while - Eventually Fully-Connected Layers take over Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 13
(not counting biases) INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 Note: CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 Most memory is in CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 early CONV CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 Most params are CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 in late FC CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 14
[Simonyan et al. 2014] Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 15
Q: What are the properties of the learned CNN representation? ... “CNN code” POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) A CNN transforms the *512 = 2,359,296 image to 4096 numbers POOL2: [7x7x512] memory: 7*7*512=25K params: 0 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 that are then linearly FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 classified. FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 16
Method 3: Visualizing the CNN code representation (“CNN code” = 4096-D vector before classifier) query image nearest neighbors in the “code” space (But we’d like a more global way to visualize the distances) Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 17
t-SNE visualization [van der Maaten & Hinton] Embed high-dimensional points so that locally, pairwise distances are conserved i.e. similar things end up in similar places. dissimilar things end up wherever Right : Example embedding of MNIST digits (0-9) in 2D Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 18
t-SNE visualization: two images are placed nearby if their CNN codes are close. See more: http://cs.stanford. edu/people/karpathy/cnnembed/ Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 19
t-SNE visualization Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 20
Q: What images maximize the score of some class in a ConvNet? Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 21
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014 1. Find images that maximize some class score: Remember: Score for class c (before Softmax) Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 22
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014 1. Find images that maximize some class score: Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 23
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014 1. Find images that maximize some class score: Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 24
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014 2. Visualize the Data gradient: M = ? (note that the gradient on data has three channels. Here they visualize M, s.t.: (at each pixel take abs val, and max over channels) Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 25
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, 2014 2. Visualize the Data gradient: (note that the gradient on data has three channels. Here they visualize M, s.t.: (at each pixel take abs val, and max over channels) Fei-Fei Li & Andrej Karpathy Fei-Fei Li & Andrej Karpathy Lecture 8 - Lecture 8 - 2 Feb 2015 2 Feb 2015 26
Recommend
More recommend