lecture 6 convolutional nn
play

Lecture 6: Convolutional NN Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 6: Convolutional NN Princeton University COS 495 Instructor: Yingyu Liang Review: convolutional layers Convolution: two dimensional case Input Kernel/filter a b c d w x e f g h y z i j k l wa + bx


  1. Deep Learning Basics Lecture 6: Convolutional NN Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: convolutional layers

  3. Convolution: two dimensional case Input Kernel/filter a b c d w x e f g h y z i j k l wa + bx + bw + cx + ey + fz fy + gz Feature map

  4. Convolutional layers the same weight shared for all output nodes ๐‘› output nodes ๐‘™ kernel size ๐‘œ input nodes Figure from Deep Learning, by Goodfellow, Bengio, and Courville

  5. Terminology Figure from Deep Learning, by Goodfellow, Bengio, and Courville

  6. Case study: LeNet-5

  7. LeNet-5 โ€ข Proposed in โ€œ Gradient-based learning applied to document recognition โ€ , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the IEEE, 1998

  8. LeNet-5 โ€ข Proposed in โ€œ Gradient-based learning applied to document recognition โ€ , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the IEEE, 1998 โ€ข Apply convolution on 2D images (MNIST) and use backpropagation

  9. LeNet-5 โ€ข Proposed in โ€œ Gradient-based learning applied to document recognition โ€ , by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, in Proceedings of the IEEE, 1998 โ€ข Apply convolution on 2D images (MNIST) and use backpropagation โ€ข Structure: 2 convolutional layers (with pooling) + 3 fully connected layers โ€ข Input size: 32x32x1 โ€ข Convolution kernel size: 5x5 โ€ข Pooling: 2x2

  10. LeNet-5 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  11. LeNet-5 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  12. LeNet-5 Filter: 5x5, stride: 1x1, #filters: 6 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  13. LeNet-5 Pooling: 2x2, stride: 2 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  14. LeNet-5 Filter: 5x5x6, stride: 1x1, #filters: 16 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  15. LeNet-5 Pooling: 2x2, stride: 2 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  16. LeNet-5 Weight matrix: 400x120 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  17. Weight matrix: 84x10 LeNet-5 Weight matrix: 120x84 Figure from Gradient-based learning applied to document recognition, by Y. LeCun, L. Bottou, Y. Bengio and P. Haffner

  18. Software platforms for CNN Updated in April 2016; checked more recent ones online

  19. Platform: Marvin (marvin.is)

  20. Platform: Marvin by

  21. LeNet in Marvin: convolutional layer

  22. LeNet in Marvin: pooling layer

  23. LeNet in Marvin: fully connected layer

  24. Platform: Caffe (caffe.berkeleyvision.org)

  25. LeNet in Caffe

  26. Platform: Tensorflow (tensorflow.org)

  27. Platform: Tensorflow (tensorflow.org)

  28. Platform: Tensorflow (tensorflow.org)

  29. Others โ€ข Theano โ€“ CPU/GPU symbolic expression compiler in python (from MILA lab at University of Montreal) โ€ข Torch โ€“ provides a Matlab-like environment for state-of-the-art machine learning algorithms in lua โ€ข Lasagne - Lasagne is a lightweight library to build and train neural networks in Theano โ€ข See: http://deeplearning.net/software_links/

  30. Optimization: momentum

  31. Basic algorithms โ€ข Minimize the (regularized) empirical loss 1 เท  ๐‘œ ๐‘œ ฯƒ ๐‘ข=1 ๐‘€ ๐‘† ๐œ„ = ๐‘š(๐œ„, ๐‘ฆ ๐‘ข , ๐‘ง ๐‘ข ) + ๐‘†(๐œ„) where the hypothesis is parametrized by ๐œ„ โ€ข Gradient descent ๐œ„ ๐‘ข+1 = ๐œ„ ๐‘ข โˆ’ ๐œƒ ๐‘ข ๐›ผเท  ๐‘€ ๐‘† ๐œ„ ๐‘ข

  32. Mini-batch stochastic gradient descent โ€ข Instead of one data point, work with a small batch of ๐‘ points (๐‘ฆ ๐‘ข๐‘+1, ๐‘ง ๐‘ข๐‘+1 ) ,โ€ฆ, (๐‘ฆ ๐‘ข๐‘+๐‘, ๐‘ง ๐‘ข๐‘+๐‘ ) โ€ข Update rule 1 ๐œ„ ๐‘ข+1 = ๐œ„ ๐‘ข โˆ’ ๐œƒ ๐‘ข ๐›ผ ๐‘ เท ๐‘š ๐œ„ ๐‘ข , ๐‘ฆ ๐‘ข๐‘+๐‘— , ๐‘ง ๐‘ข๐‘+๐‘— + ๐‘†(๐œ„ ๐‘ข ) 1โ‰ค๐‘—โ‰ค๐‘

  33. Momentum โ€ข Drawback of SGD: can be slow when gradient is small โ€ข Observation: when the gradient is consistent across consecutive steps, can take larger steps โ€ข Metaphor: rolling marble ball on gentle slope

  34. Momentum Contour: loss function Path: SGD with momentum Arrow: stochastic gradient Figure from Deep Learning, by Goodfellow, Bengio, and Courville

  35. Momentum โ€ข work with a small batch of ๐‘ points (๐‘ฆ ๐‘ข๐‘+1, ๐‘ง ๐‘ข๐‘+1 ) ,โ€ฆ, (๐‘ฆ ๐‘ข๐‘+๐‘, ๐‘ง ๐‘ข๐‘+๐‘ ) โ€ข Keep a momentum variable ๐‘ค ๐‘ข , and set a decay rate ๐›ฝ โ€ข Update rule 1 ๐‘ค ๐‘ข = ๐›ฝ๐‘ค ๐‘ขโˆ’1 โˆ’ ๐œƒ ๐‘ข ๐›ผ ๐‘ เท ๐‘š ๐œ„ ๐‘ข , ๐‘ฆ ๐‘ข๐‘+๐‘— , ๐‘ง ๐‘ข๐‘+๐‘— + ๐‘†(๐œ„ ๐‘ข ) 1โ‰ค๐‘—โ‰ค๐‘ ๐œ„ ๐‘ข+1 = ๐œ„ ๐‘ข + ๐‘ค ๐‘ข

  36. Momentum โ€ข Keep a momentum variable ๐‘ค ๐‘ข , and set a decay rate ๐›ฝ โ€ข Update rule 1 ๐‘ค ๐‘ข = ๐›ฝ๐‘ค ๐‘ขโˆ’1 โˆ’ ๐œƒ ๐‘ข ๐›ผ ๐‘ เท ๐‘š ๐œ„ ๐‘ข , ๐‘ฆ ๐‘ข๐‘+๐‘— , ๐‘ง ๐‘ข๐‘+๐‘— + ๐‘†(๐œ„ ๐‘ข ) 1โ‰ค๐‘—โ‰ค๐‘ ๐œ„ ๐‘ข+1 = ๐œ„ ๐‘ข + ๐‘ค ๐‘ข โ€ข Practical guide: ๐›ฝ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.

Recommend


More recommend