A Shallow Introduction to Deep Learning for Computer Vision Ramprasaath
Lecture Outline • Computer Vision • Before (Image/Alex)Net era – (Summer 1956-2012) • After (Image/Alex)Net era – (2012 – present) • Neural Networks (Brief Introduction) • Need for CNNs • Visualizing, Understanding and Analyzing ConvNets • Transfer Learning • Going beyond Classification: • Localization • Detection • Segmentation • Depth Estimation • Video Classification • Image Ranking and retrieval • Image Captioning • Visual Question Answering
Where are we • Computer Vision • Before (Image/Alex)Net era – (Summer 1956-2012)
1956 Dartmouth AI Project “ We propose that a 2 month, 10 man study of artificial intelligence be carried out during the summer of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer. ” http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
1956 Dartmouth AI Project Five of the attendees of the 1956 Dartmouth Summer Research Project on AI reunited in 2006 : Trenchard More, John McCarthy, Marvin Minsky, Oliver Selfridge, and Ray Solomonoff. Missing were: Arthur Samuel, Herbert Simon, Allen Newell, Nathaniel Rochester and Claude Shannon.
The beginning of Computer Vision • During the summer of 1966, Dartmouth Professor Late Dr. Marvin Minsky, asked a student to attach a camera to a Computer and asked him to write an algorithm that would allow the computer to describe what it sees.
Example: Color (Hue) Histogram hue bins +1 Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 12 7 Jan 2015 Fei-Fei Li & Andrej Karpathy
Example: HOG features 8x8 pixel region, quantize the edge orientation into 9 bins (images from vlfeat.org) Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 13 7 Jan 2015 Fei-Fei Li & Andrej Karpathy
Example: Bag of Words 1. Resize patch to a fixed size (e.g. 32x32 pixels) 2. Extract HOG on the patch (get 144 numbers) repeat for each detected feature gives a matrix of size [number_of_features x 144] Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 15 7 Jan 2015 Fei-Fei Li & Andrej Karpathy
Example: Bag of Words histogram of visual words visual word vectors 1000-d vector learn k-means centroids “vocabulary of visual words 144 1000-d vector e.g. 1000 centroids 1000-d vector Fei-Fei Li &Andrej Karpathy Lecture 4 - Lecture 4 - 16 7 Jan 2015 Fei-Fei Li & Andrej Karpathy
Traditional Object recognition K-means SIFT Pooling Classifier Sparse Coding HoG supervised unsupervised fixed
Most recognition systems are build on the same Architecture CNNs: end-to-end models Fei-Fei Li &Andrej Karpathy Lecture 4 - 7 Jan 2015 (slide from Yann LeCun)
Lecture Outline • Computer Vision • After (Image/Alex)Net era – (2012 – present)
Year 2012 Year 2014 Year 2015 Year 2010 NEC-UIUC SuperVision GoogLeNet VGG MSRA Dense grid descriptor: HOG,LBP Coding: local coordinate, super- ‐vector Pooling,SPM LinearSVM Convolution Pooling Softmax Other [Simonyan arxiv 2014] [He arxiv 2014] [Lin CVPR2011] [Krizhevsky NIPS 2012] [Szegedy arxiv 2014] Deep Residual Learning for Image Recognition Fei-Fei Li & AndrejKarpathy Lecture 1 - 5- J ‐ a n 1 ‐ - 5 16 Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun Fei-Fei Li & Andrej Karpathy [He arxiv 2015]
Lecture Outline • Brief Introduction to Neural Networks
Neural Networks: Architectures Neurons “Fully - connected” layers Fei-Fei Li &Andrej Karpathy Lecture 5 - Lecture 5 - 62 21 Jan 2015 Fei-Fei Li & Andrej Karpathy
Neural Networks: Architectures “3 - layer Neural Net”, or “2 - layer Neural Net”, or “2 -hidden-layer Neural Net” “1 -hidden-layer Neural Net” “Fully - connected” layers Fei-Fei Li &Andrej Karpathy Lecture 5 - Lecture 5 - 63 21 Jan 2015 Fei-Fei Li & Andrej Karpathy
Where are we • Need for CNNs
Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 21 Slide Credit: Marc'Aurelio Ranzato
Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 22 Slide Credit: Marc'Aurelio Ranzato
Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 23 Slide Credit: Marc'Aurelio Ranzato
Convolutional Layer Locality? Nearby pixels are correlated Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 24 Slide Credit: Marc'Aurelio Ranzato
Convolutional Layer Learn multiple filters. E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters (C) Dhruv Batra 25 Slide Credit: Marc'Aurelio Ranzato
Pooling Layer Let us assume filter is an “eye” detector. Q.: how can we make the detection robust to the exact location of the eye? (C) Dhruv Batra 26 Slide Credit: Marc'Aurelio Ranzato
Pooling Layer By “ pooling ” (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features. (C) Dhruv Batra 27 Slide Credit: Marc'Aurelio Ranzato
Hyperparameters to play with: - network architecture - learning rate, its decay schedule, update type - regularization (L2/L1/Maxnorm/Dropout) - loss to use (e.g. SVM/Softmax) - initialization neural networks practitioner music = loss function Fei-Fei Li &Andrej Karpathy Lecture 6 - Lecture 6 - 36 21 Jan 2015 Fei-Fei Li & Andrej Karpathy
Y LeCun MA Ranzato Filter Non- feature Filter Non- feature Norm Classifier Norm Bank Linear Pooling Bank Linear Pooling Normalization: eg. Contrast Normalization Filter Bank: Matrix Multiplication Non-Linearity: eg. ReLU Pooling: aggregation over space or feature type
[From recent Yann LeCun slides] Fast-forward to today Lecture 7 - 56
SHALLOW DEEP Y LeCun MA Ranzato Neural Networks Boosting Neural Net RNN AE D-AE Perceptron Conv. Net DBN SVM RBM DBM Sparse BayesNP GMM Coding Probabilistic Models DecisionTree Unsupervised Supervised Supervised
(convnetjs example of training of CIFAR-10) demo Fei-Fei Li &Andrej Karpathy Lecture 7 - Lecture 7 - 79 21 Jan 2015 Fei-Fei Li & Andrej Karpathy
Convolutional Layer Just like normal Hidden Layer BUT: - Connect neurons to the input in a local receptive field - All neurons in a single depth slice share weights Lecture 8 - Fei-Fei Li &Andrej Karpathy Lecture 8 - 9 2 Feb 2015 Fei-Fei Li & Andrej Karpathy
The weights of this neuron visualized Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 10 2 Feb 2015 Fei-Fei Li & Andrej Karpathy
convolving the first filter in the input gives the first slice of depth in output volume Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 11 2 Feb 2015 Fei-Fei Li & Andrej Karpathy
Visualizing Learned Filters (C) Dhruv Batra 38 Figure Credit: [Zeiler & Fergus ECCV14]
Visualizing Learned Filters (C) Dhruv Batra 39 Figure Credit: [Zeiler & Fergus ECCV14]
Visualizing Learned Filters (C) Dhruv Batra 40 Figure Credit: [Zeiler & Fergus ECCV14]
Q: What is the learned CNN representation? ... “CNN code” POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) *512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512) A CNN transforms the *512 = 2,359,296 image to 4096 numbers POOL2: [7x7x512] memory: 7*7*512=25K params: 0 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 that are then linearly FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 classified. FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 16 2 Feb 2015 Fei-Fei Li & Andrej Karpathy
Visualizing the CNN code representation (“CNN code” = 4096 -D vector before classifier) query image nearest neighbors in the “code” space (But we’d like a more global way to visualize the distances) Fei-Fei Li &Andrej Karpathy Lecture 8 - Lecture 8 - 17 2 Feb 2015 Fei-Fei Li & Andrej Karpathy
Recommend
More recommend