Introduction to Machine Learning Deep learning in computer vision and natural language processing Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Matt Gormley, Russ Salakhutdinov Yifeng Tao Carnegie Mellon University 1
Review o Perceptron algorithm o Multilayer perceptron and activation functions o Backpropagation o Momentum-based mini-batch gradient descent methods Yifeng Tao Carnegie Mellon University 2
Outline o Regularization in neural networks – methods to prevent overfitting o Widely used deep learning architecture in practice o CNN o RNN Yifeng Tao Carnegie Mellon University 3
Overfitting o The model tries to learn too well the noise in training samples [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 4
Model Selection [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 5
Regularization in Machine Learning o Regularization penalizes the coefficients. o In deep learning, it penalizes the weight matrices of the nodes. [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 6
Regularization in Deep Learning o L2 & L1 regularization o Dropout o Data augmentation o Early stopping o Batch normalization [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 7
Dropout o Produces very good results and is the most frequently used regularization technique in deep learning. o Can be thought of as an ensemble technique. [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 8
Dropout at Test Time [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 9
Data Augmentation o Increase the size of the training data o It can be considered as a mandatory trick to improve predictions [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 10
Early Stop o To select the number of epochs, stop training when validation set error increases (with some look ahead) [Slide from https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/ ] Yifeng Tao Carnegie Mellon University 11
Batch Normalization o Normalizing the inputs will speed up training (Lecun et al. 1998) o could normalization be useful at the level of the hidden layers? o Batch normalization is an attempt to do that (Ioffe and Szegedy, 2015) o each unit’s pre-activation is normalized (mean subtraction, stddev division) o during training, mean and stddev is computed for each minibatch o backpropagation takes into account the normalization o at test time, the global mean / stddev is used [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 12
Batch Normalization [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 13
Batch Normalization [Slide from Russ Salakhutdinov et al.] Yifeng Tao Carnegie Mellon University 14
Computer Vision: Image Classification o ImageNet LSVRC-2011 contest: o Dataset: 1.2 million labeled images, 1000 classes o Task: Given a new image, label it with the correct class [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 15
Computer Vision: Image Classification [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 16
CNNs for Image Recognition o Convolutional Neural Networks (CNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 17
Convolutional Neural Network (CNN) o Typical layers include: o Convolutional layer o Max-pooling layer o Fully-connected (Linear) layer o ReLU layer (or some other nonlinear activation function) o Softmax o These can be arranged into arbitrarily deep topologies o Architecture #1: LeNet-5 [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 18
What is a Convolution o Basic idea: o Pick a 3x3 matrix F of weights o Slide this over an image and compute the “inner product” (similarity) of F and the corresponding field of the image, and replace the pixel in the center of the field with the output of the inner product operation o Key point: o Different convolutions extract different low-level “features” of an image o All we need to vary to generate these different features is the weights of F o A convolution matrix is used in image processing for tasks such as edge detection, blurring, sharpening, etc. [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 19
What is a Convolution [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 20
What is a Convolution [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 21
What is a Convolution [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 22
Downsampling by Averaging o Suppose we use a convolution with stride 2 o Only 9 patches visited in input, so only 9 pixels in output [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 23
Downsampling by Max-Pooling o Max-pooling is another (common) form of downsampling o Instead of averaging, we take the max value within the same range as the equivalently-sized convolution o The example below uses a stride of 2 [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 24
CNN in protein-DNA binding o Feature extractor for motifs [Slide from Babak Alipanahi et al. 2015] Yifeng Tao Carnegie Mellon University 25
Recurrent Neural Networks o Dataset for Supervised Part-of-Speech (POS) Tagging [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 26
Recurrent Neural Networks o Dataset for Supervised Handwriting Recognition [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 27
Time Series Data o Question 1 : How could we apply the neural networks we’ve seen so far (which expect fixed size input/output) to a prediction task with variable length input/output? o Question 2 : How could we incorporate context (e.g. words to the left/right, or tags to the left/right) into our solution? [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 28
Recurrent Neural Networks (RNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 29
Recurrent Neural Networks (RNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 30
Recurrent Neural Networks (RNNs) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 31
Bidirectional RNN [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 32
Deep Bidirectional RNNs o Notice that the upper level hidden units have input from two previous layers (i.e. wider input) o Likewise for the output layer [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 33
Long Short-Term Memory (LSTM) o Motivation: o Vanishing gradient problem for Standard RNNs o Figure shows sensitivity (darker = more sensitive) to the input at time t=1 [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 34
Long Short-Term Memory (LSTM) o Motivation: o LSTM units have a rich internal structure o The various “gates” determine the propagation of information and can choose to “remember” or “forget” information [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 35
Long Short-Term Memory (LSTM) [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 36
Long Short-Term Memory (LSTM) o Input gate: masks out the standard RNN inputs o Forget gate : masks out the previous cell o Cell : stores the input/forget mixture o Output gate: masks out the values of the next hidden [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 37
Deep Bidirectional LSTM (DBLSTM) o How important is this particular architecture? o Jozefowicz et al. (2015) evaluated 10,000 different LSTM-like architectures and found several variants that worked just as well on several tasks. [Slide from Matt Gormley et al.] Yifeng Tao Carnegie Mellon University 38
Take home message o Methods to prevent overfitting in deep learning o L2 & L1 regularization o Dropout o Data augmentation o Early stopping o Batch normalization o CNN o Are used for all aspects of computer vision o Learn interpretable features at different levels of abstraction o Typically, consist of convolution layers, pooling layers, nonlinearities , and fully connected layers o RNN o Applicable to sequential tasks o Learn context features for time series data o Vanishing gradients are still a problem – but LSTM units can help Yifeng Tao Carnegie Mellon University 39
References o Matt Gormley. 10601 Introduction to Machine Learning: http://www.cs.cmu.edu/~mgormley/courses/10601/index.html o Barnabás Póczos, Maria-Florina Balcan, Russ Salakhutdinov. 10715 Advanced Introduction to Machine Learning: https://sites.google.com/site/10715advancedmlintro2017f/lectures Yifeng Tao Carnegie Mellon University 40
Recommend
More recommend