deep learning in computer vision and natural language
play

Deep learning in computer vision and natural language processing - PowerPoint PPT Presentation

Introduction to Machine Learning Deep learning in computer vision and natural language processing Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Matt Gormley, Russ Salakhutdinov Yifeng Tao Carnegie


  1. Introduction to Machine Learning Deep learning in computer vision and natural language processing Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Matt Gormley, Russ Salakhutdinov Yifeng Tao Carnegie Mellon University 1

  2. Review o Perceptron algorithm o Multilayer perceptron and activation functions o Backpropagation o Momentum-based mini-batch gradient descent methods Yifeng Tao Carnegie Mellon University 2

  3. Outline o Regularization in neural networks – methods to prevent overfitting o Widely used deep learning architecture in practice o CNN o RNN Yifeng Tao Carnegie Mellon University 3

  4. Overfitting o The model tries to learn too well the noise in training samples [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 4

  5. Model Selection [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 5

  6. Regularization in Machine Learning o Regularization penalizes the coefficients. o In deep learning, it penalizes the weight matrices of the nodes. [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 6

  7. Regularization in Deep Learning o L2 & L1 regularization o Dropout o Data augmentation o Early stopping o Batch normalization [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 7

  8. Dropout o Produces very good results and is the most frequently used regularization technique in deep learning. o Can be thought of as an ensemble technique. [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 8

  9. Dropout at Test Time [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 9

  10. Data Augmentation o Increase the size of the training data o It can be considered as a mandatory trick to improve predictions [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 10

  11. Early Stop o To select the number of epochs, stop training when validation set error increases (with some look ahead) [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 11

  12. Batch Normalization o Normalizing the inputs will speed up training (Lecun et al. 1998) o could normalization be useful at the level of the hidden layers? o Batch normalization is an attempt to do that (Ioffe and Szegedy, 2015) o each unit’s pre-activation is normalized (mean subtraction, stddev division) o during training, mean and stddev is computed for each minibatch o backpropagation takes into account the normalization o at test time, the global mean / stddev is used [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 12

  13. Batch Normalization [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 13

  14. Batch Normalization [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 14

  15. Computer Vision: Image Classification o ImageNet LSVRC-2011 contest: o Dataset: 1.2 million labeled images, 1000 classes o Task: Given a new image, label it with the correct class [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 15

  16. Computer Vision: Image Classification [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 16

  17. CNNs for Image Recognition o Convolutional Neural Networks (CNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 17

  18. Convolutional Neural Network (CNN) o Typical layers include: o Convolutional layer o Max-pooling layer o Fully-connected (Linear) layer o ReLU layer (or some other nonlinear activation function) o Softmax o These can be arranged into arbitrarily deep topologies o Architecture #1: LeNet-5 [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 18

  19. What is a Convolution o Basic idea: o Pick a 3x3 matrix F of weights o Slide this over an image and compute the “inner product” (similarity) of F and the corresponding field of the image, and replace the pixel in the center of the field with the output of the inner product operation o Key point: o Different convolutions extract different low-level “features” of an image o All we need to vary to generate these different features is the weights of F o A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 19

  20. What is a Convolution [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 20

  21. What is a Convolution [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 21

  22. What is a Convolution [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 22

  23. Downsampling by Averaging o Suppose we use a convolution with stride 2 o Only 9 patches visited in input, so only 9 pixels in output [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 23

  24. Downsampling by Max-Pooling o Max-pooling is another (common) form of downsampling o Instead of averaging, we take the max value within the same range as the equivalently-sized convolution o The example below uses a stride of 2 [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 24

  25. CNN in protein-DNA binding o Feature extractor for motifs [Slide from Babak Alipanahi et al. 2015] Yifeng Tao Carnegie Mellon University 25

  26. Recurrent Neural Networks o Dataset for Supervised Part-of-Speech (POS) Tagging [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 26

  27. Recurrent Neural Networks o Dataset for Supervised Handwriting Recognition [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 27

  28. Time Series Data o Question 1 : How could we apply the neural networks we’ve seen so far (which expect fixed size input/output) to a prediction task with variable length input/output? o Question 2 : How could we incorporate context (e.g. words to the left/right, or tags to the left/right) into our solution? [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 28

  29. Recurrent Neural Networks (RNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 29

  30. Recurrent Neural Networks (RNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 30

  31. Recurrent Neural Networks (RNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 31

  32. Bidirectional RNN [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 32

  33. Deep Bidirectional RNNs o Notice that the upper level hidden units have input from two previous layers (i.e. wider input) o Likewise for the output layer [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 33

  34. Long Short-Term Memory (LSTM) o Motivation: o Vanishing gradient problem for Standard RNNs o Figure shows sensitivity (darker = more sensitive) to the input at time t=1 [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 34

  35. Long Short-Term Memory (LSTM) o Motivation: o LSTM units have a rich internal structure o The various “gates” determine the propagation of information and can choose to “remember” or “forget” information [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 35

  36. Long Short-Term Memory (LSTM) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 36

  37. Long Short-Term Memory (LSTM) o Input gate: masks out the standard RNN inputs o Forget gate : masks out the previous cell o Cell : stores the input/forget mixture o Output gate: masks out the values of the next hidden [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 37

  38. Deep Bidirectional LSTM (DBLSTM) o How important is this particular architecture? o Jozefowicz et al. (2015) evaluated 10,000 different LSTM-like architectures and found several variants that worked just as well on several tasks. [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 38

  39. Take home message o Methods to prevent overfitting in deep learning o L2 & L1 regularization o Dropout o Data augmentation o Early stopping o Batch normalization o CNN o Are used for all aspects of computer vision o Learn interpretable features at different levels of abstraction o Typically, consist of convolution layers, pooling layers, nonlinearities , and fully connected layers o RNN o Applicable to sequential tasks o Learn context features for time series data o Vanishing gradients are still a problem – but LSTM units can help Yifeng Tao Carnegie Mellon University 39

  40. References o Matt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html o Barnabás Póczos, Maria-Florina Balcan, Russ Salakhutdinov. 10715 Advanced Introduction to Machine Learning: https://sites.google.com/site/10715advancedmlintro2017f/lectures Yifeng Tao Carnegie Mellon University 40

Recommend


More recommend