Deep Learning Basics Lecture 6: Convolutional NN Princeton University COS 495 Instructor: Yingyu Liang
Review: convolutional layers
Convolution: two dimensional case Input Kernel/filter a b c d w x e f g h y z i j k l wa + bx + bw + cx + ey + fz fy + gz Feature map
Convolutional layers the same weight shared for all output nodes ๐ output nodes ๐ kernel size ๐ input nodes Figure from Deep Learning, by Goodfellow, Bengio, and Courville
Terminology Figure from Deep Learning, by Goodfellow, Bengio, and Courville
Case study: LeNet-5
LeNet-5 โข Proposed in โ Gradient-based learning applied to document recognition โ , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the IEEE, 1998
LeNet-5 โข Proposed in โ Gradient-based learning applied to document recognition โ , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the IEEE, 1998 โข Apply convolution on 2D images (MNIST) and use backpropagation
LeNet-5 โข Proposed in โ Gradient-based learning applied to document recognition โ , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the IEEE, 1998 โข Apply convolution on 2D images (MNIST) and use backpropagation โข Structure: 2 convolutional layers (with pooling) + 3 fully connected layers โข Input size: 32x32x1 โข Convolution kernel size: 5x5 โข Pooling: 2x2
LeNet-5 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Filter: 5x5, stride: 1x1, #filters: 6 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Pooling: 2x2, stride: 2 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Filter: 5x5x6, stride: 1x1, #filters: 16 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Pooling: 2x2, stride: 2 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
LeNet-5 Weight matrix: 400x120 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Weight matrix: 84x10 LeNet-5 Weight matrix: 120x84 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Software platforms for CNN Updated in April 2016; checked more recent ones online
Platform: Marvin (marvin.is)
Platform: Marvin by
LeNet in Marvin: convolutional layer
LeNet in Marvin: pooling layer
LeNet in Marvin: fully connected layer
Platform: Caffe (caffe.berkeleyvision.org)
LeNet in Caffe
Platform: Tensorflow (tensorflow.org)
Platform: Tensorflow (tensorflow.org)
Platform: Tensorflow (tensorflow.org)
Others โข Theano โ CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) โข Torch โ provides a Matlab-like environment for state-of-the-art machine learning algorithms in lua โข Lasagne - Lasagne is a lightweight library to build and train neural networks in Theano โข See: http://deeplearning.net/software_links/
Optimization: momentum
Basic algorithms โข Minimize the (regularized) empirical loss 1 เท ๐ ๐ ฯ ๐ข=1 ๐ ๐ ๐ = ๐(๐, ๐ฆ ๐ข , ๐ง ๐ข ) + ๐(๐) where the hypothesis is parametrized by ๐ โข Gradient descent ๐ ๐ข+1 = ๐ ๐ข โ ๐ ๐ข ๐ผเท ๐ ๐ ๐ ๐ข
Mini-batch stochastic gradient descent โข Instead of one data point, work with a small batch of ๐ points (๐ฆ ๐ข๐+1, ๐ง ๐ข๐+1 ) ,โฆ, (๐ฆ ๐ข๐+๐, ๐ง ๐ข๐+๐ ) โข Update rule 1 ๐ ๐ข+1 = ๐ ๐ข โ ๐ ๐ข ๐ผ ๐ เท ๐ ๐ ๐ข , ๐ฆ ๐ข๐+๐ , ๐ง ๐ข๐+๐ + ๐(๐ ๐ข ) 1โค๐โค๐
Momentum โข Drawback of SGD: can be slow when gradient is small โข Observation: when the gradient is consistent across consecutive steps, can take larger steps โข Metaphor: rolling marble ball on gentle slope
Momentum Contour: loss function Path: SGD with momentum Arrow: stochastic gradient Figure from Deep Learning, by Goodfellow, Bengio, and Courville
Momentum โข work with a small batch of ๐ points (๐ฆ ๐ข๐+1, ๐ง ๐ข๐+1 ) ,โฆ, (๐ฆ ๐ข๐+๐, ๐ง ๐ข๐+๐ ) โข Keep a momentum variable ๐ค ๐ข , and set a decay rate ๐ฝ โข Update rule 1 ๐ค ๐ข = ๐ฝ๐ค ๐ขโ1 โ ๐ ๐ข ๐ผ ๐ เท ๐ ๐ ๐ข , ๐ฆ ๐ข๐+๐ , ๐ง ๐ข๐+๐ + ๐(๐ ๐ข ) 1โค๐โค๐ ๐ ๐ข+1 = ๐ ๐ข + ๐ค ๐ข
Momentum โข Keep a momentum variable ๐ค ๐ข , and set a decay rate ๐ฝ โข Update rule 1 ๐ค ๐ข = ๐ฝ๐ค ๐ขโ1 โ ๐ ๐ข ๐ผ ๐ เท ๐ ๐ ๐ข , ๐ฆ ๐ข๐+๐ , ๐ง ๐ข๐+๐ + ๐(๐ ๐ข ) 1โค๐โค๐ ๐ ๐ข+1 = ๐ ๐ข + ๐ค ๐ข โข Practical guide: ๐ฝ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.
Recommend
More recommend