It’s an old paradigm • The first learning machine: Feature Extractor the Perceptron - Built at Cornell in 1960 A • The Perceptron was a linear classifier on top of a simple feature extractor W i • The vast majority of practical applications of ML today use glorified linear classifiers N y=sign ( W i F i ( X ) +b ) or glorified template matching. ∑ • Designing a feature extractor requires i= 1 considerable e ff orts by experts. slide by Marc’Aurelio Ranzato, Yann LeCun 33
Hierarchical Compositionality VISION pixels edge texton motif part object SPEECH spectral sample formant motif phone word band slide by Marc’Aurelio Ranzato, Yann LeCun NLP character word NP/VP/.. clause sentence story 34
Building A Complicated Function Given a library of simple functions Compose into a complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 35
Building A Complicated Function Given a library of simple functions Idea 1: Linear Combinations • Boosting Compose into a • Kernels • … complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 36
Building A Complicated Function Given a library of simple functions Idea 2: Compositions • Deep Learning Compose into a • Grammar models • Scattering transforms… complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 37
Building A Complicated Function Given a library of simple functions Idea 2: Compositions • Deep Learning Compose into a • Grammar models • Scattering transforms… complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 38
Deep Learning = Hierarchical Compositionality “car” slide by Marc’Aurelio Ranzato, Yann LeCun 39
Deep Learning = Hierarchical Compositionality “car” Low-Level Mid-Level High-Level Trainable Feature Feature Feature Classifier slide by Marc’Aurelio Ranzato, Yann LeCun Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013] 40
slide by Dhruv Batra Sparse DBNs [Lee et al. ICML ‘09] Figure courtesy: Quoc Le 41
Three key ideas • (Hierarchical) Compositionality - Cascade of non-linear transformations - Multiple layers of representations • End-to-End Learning - Learning (goal-driven) representations - Learning to feature extract • Distributed Representations - No single neuron “encodes” everything - Groups of neurons work together slide by Dhruv Batra 42
Traditional Machine Learning VISION hand-crafted your favorite features “car” classifier SIFT/HOG fixed learned SPEECH hand-crafted your favorite features \ ˈ d ē p\ classifier MFCC fixed learned slide by Marc’Aurelio Ranzato, Yann LeCun NLP hand-crafted your favorite This burrito place features “+” classifier is yummy and fun! Bag-of-words fixed learned 43
Traditional Machine Learning (more accurately) “Learned” VISION K-Means/ SIFT/HOG classifier “car” pooling fixed unsupervised supervised SPEECH Mixture of MFCC classifier \ ˈ d ē p\ Gaussians slide by Marc’Aurelio Ranzato, Yann LeCun fixed unsupervised supervised NLP Parse Tree This burrito place n-grams classifier “+” Syntactic is yummy and fun! fixed unsupervised supervised 44
Deep Learning = End-to-End Learning “Learned” VISION K-Means/ SIFT/HOG classifier “car” pooling fixed unsupervised supervised SPEECH Mixture of MFCC classifier \ ˈ d ē p\ Gaussians fixed unsupervised supervised slide by Marc’Aurelio Ranzato, Yann LeCun NLP Parse Tree This burrito place n-grams classifier “+” Syntactic is yummy and fun! fixed unsupervised supervised 45
Deep Learning = End-to-End Learning • A hierarchy of trainable feature transforms - Each module transforms its input representation into a higher-level one. - High-level features are more global and more invariant - Low-level features are shared among categories Trainable Trainable Trainable slide by Marc’Aurelio Ranzato, Yann LeCun Feature- Feature- Feature- Transform / Transform / Transform / Classifier Classifier Classifier Learned Internal Representations 46
“Shallow” vs Deep Learning • “Shallow” models hand-crafted “Simple” Trainable Feature Extractor Classifier fixed learned • Deep models Trainable Trainable Trainable Feature- Feature- Feature- slide by Marc’Aurelio Ranzato, Yann LeCun Transform / Transform / Transform / Classifier Classifier Classifier Learned Internal Representations 47
Three key ideas • (Hierarchical) Compositionality - Cascade of non-linear transformations - Multiple layers of representations • End-to-End Learning - Learning (goal-driven) representations - Learning to feature extract • Distributed Representations - No single neuron “encodes” everything - Groups of neurons work together slide by Dhruv Batra 48
Localist representations • The simplest way to represent things with neural networks is to dedicate one neuron to each thing . - Easy to understand. - Easy to code by hand • Often used to represent inputs to a net - Easy to learn • This is what mixture models do. • Each cluster corresponds to one neuron - Easy to associate with other representations or responses. • But localist models are very inefficient whenever the data has componential slide by Geoff Hinton structure. Image credit: Moontae Lee 49
Distributed Representations • Each neuron must represent something, so this must be a local representation. • Distributed representation means a many-to-many relationship between two types of representation (such as concepts and neurons). - Each concept is represented by many neurons - Each neuron participates in the representation of many concepts slide by Geoff Hinton Local Distributed Image credit: Moontae Lee 50
Power of distributed representations! Scene Classification bedroom mountain • Possible internal representations: - Objects - Scene attributes - Object parts - Textures slide by Bolei Zhou B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba “Object Detectors Emerge in Deep Scene CNNs” , ICLR 2015 51
Deep Convolutional Neural Networks 52
53 Convolutions slide by Yisong Yue
Convolution Filters slide by Yisong Yue 54
55 Gabor Filters slide by Yisong Yue
Gaussian Blur Filters slide by Yisong Yue 56
Convolutional Neural Networks 57 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
58 Convolution Layer height 32x32x3 image width 32 depth 32 3 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson spatially, computing dot 32 products” 3 59
Convolution Layer Filters always extend the full depth of the input volume 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson spatially, computing dot 32 products” 3 60
Convolution Layer 32x32x3 image 5x5x3 filter 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 (i.e. 5*5*3 = 75-dimensional dot product + bias) 3 61
Convolution Layer activation 32x32x3 image map 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 3 1 62
Convolution Layer consider a second, green filter activation maps 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 3 1 63
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 3 6 We stack these up to get a “new image” of size 28x28x6! 64
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 28 CONV, ReLU e.g. 6 5x5x3 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 28 filters 3 6 65
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 28 24 …. CONV, CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x 6 5x5x3 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 28 24 filters filters 3 6 10 66
[From recent Yann 67 LeCun slides] Preview slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
[From recent Yann 68 LeCun slides] Preview slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
one filter => one activation map example 5x5 filters (32 total) We call the layer convolutional because it is related to convolution of two signals: slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson elementwise multiplication and sum of a filter and the signal (image) 69
70 70 Preview slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
A closer look at spatial dimensions: activation 32x32x3 image map 5x5x3 filter 32 28 convolve (slide) over all spatial locations slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 28 32 3 1 71
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 72
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 73
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 74
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 75
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter => 5x5 output 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 76
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 77
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 78
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output! 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 79
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 80
A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 doesn’t fit! cannot apply 3x3 filter on slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 7x7 input with stride 3. 81
N Output size: (N - F) / stride + 1 F e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 N F stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 : \ slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 82
In practice: Common to zero pad the border e.g. input 7x7 0 0 0 0 0 0 3x3 filter, applied with stride 1 0 pad with 1 pixel border => what is the output? 0 0 0 (recall:) slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson (N - F) / stride + 1 83
In practice: Common to zero pad the border e.g. input 7x7 0 0 0 0 0 0 3x3 filter, applied with stride 1 0 pad with 1 pixel border => what is the output? 0 0 7x7 output! 0 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 84
In practice: Common to zero pad the border e.g. input 7x7 0 0 0 0 0 0 3x3 filter, applied with stride 1 0 pad with 1 pixel border => what is the output? 0 0 7x7 output! in general, common to see CONV layers 0 with stride 1, filters of size FxF , and zero- padding with (F-1)/2. (will preserve size slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3 85
Remember back to… E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well. 32 28 24 …. CONV, CONV, CONV, slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x 6 5x5x3 32 28 24 filters filters 3 6 10 86
Recap: Convolution Layer (No padding, no strides) Convolving a 3 × 3 kernel over a 4 × 4 input using unit strides (i.e., i = 4, k = 3, s = 1 and p = 0). Image credit: Vincent Dumoulin and Francesco Visin 87
Computing the output values of a 2D discrete convolution i 1 = i 2 = 5, k 1 = k 2 = 3, s 1 = s 2 = 2, and p 1 = p 2 = 1 Image credit: Vincent Dumoulin and Francesco Visin 88
Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: ? slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 89
Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: (32+2*2-5)/1+1 = 32 spatially, so slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32x32x10 90
Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 91
Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params (+1 for bias) slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson => 76*10 = 760 92
93 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
Common settings: K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P = ? (whatever fits) - F = 1, S = 1, P = 0 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 94
(btw, 1x1 convolution layers make perfect sense) 1x1 CONV 56 with 32 filters 56 (each filter has size 1x1x64, and performs a slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 64-dimensional dot product) 56 56 64 32 95
96 Example: CONV layer in Torch slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
97 Example: CONV layer in Caffe slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
98 Example: CONV layer in Lasagne slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson
The brain/neuron view of CONV Layer 32x32x3 image 5x5x3 filter 32 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 1 number: 32 the result of taking a dot product between the filter and this part of the image 3 (i.e. 5*5*3 = 75-dimensional dot product) 99
The brain/neuron view of CONV Layer 32x32x3 image 5x5x3 filter 32 It’s just a neuron with local connectivity... slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 1 number: 32 the result of taking a dot product between the filter and this part of the image 3 (i.e. 5*5*3 = 75-dimensional dot product) 100
Recommend
More recommend