Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang
Review
Regularization as hard constraint β’ Constrained optimization π π π = 1 ΰ· min π ΰ· π(π, π¦ π , π§ π ) π π=1 subject to: π π β€ π
Regularization as soft constraint β’ Unconstrained optimization π π π π = 1 ΰ· min π ΰ· π(π, π¦ π , π§ π ) + ππ(π) π π=1 for some regularization parameter π > 0
Regularization as Bayesian prior β’ Bayesian rule: π π | {π¦ π , π§ π } = π π π π¦ π , π§ π π) π({π¦ π , π§ π }) β’ Maximum A Posteriori (MAP): max log π π | {π¦ π , π§ π } = max log π π + log π π¦ π , π§ π | π π π Regularization MLE loss
Classical regularizations β’ Norm penalty β’ π 2 regularization β’ π 1 regularization
More examples
Other types of regularizations β’ Robustness to noise β’ Noise to the input β’ Noise to the weights β’ Noise to the output β’ Data augmentation β’ Early stopping β’ Dropout
Multiple optimal solutions? Class +1 π₯ 1 π₯ 2 π₯ 3 Class -1 Prefer π₯ 2 (higher confidence)
Add noise to the input Class +1 π₯ 2 Class -1 Prefer π₯ 2 (higher confidence)
Caution: not too much noise Too much noise leads to data points cross the boundary Class +1 π₯ 2 Class -1 Prefer π₯ 2 (higher confidence)
Equivalence to weight decay β’ Suppose the hypothesis is π π¦ = π₯ π π¦ , noise is π~π(0, ππ½) β’ After adding noise, the loss is π(π) = π½ π¦,π§,π π π¦ + π β π§ 2 = π½ π¦,π§,π π π¦ + π₯ π π β π§ 2 π(π) = π½ π¦,π§,π π π¦ β π§ 2 + 2 π½ π¦,π§,π π₯ π π π π¦ β π§ + π½ π¦,π§,π π₯ π π 2 2 π(π) = π½ π¦,π§,π π π¦ β π§ 2 + π π₯
Add noise to the weights β’ For the loss on each data point, add a noise term to the weights before computing the prediction π~π(0, ππ½) , π₯β² = π₯ + π β’ Prediction: π π₯ β² π¦ instead of π π₯ π¦ β’ Loss becomes π₯+π π¦ β π§ 2 π(π) = π½ π¦,π§,π π
Add noise to the weights β’ Loss becomes π₯+π π¦ β π§ 2 π(π) = π½ π¦,π§,π π β’ To simplify, use Taylor expansion π π πΌ 2 π π¦ π π₯ π¦ + π π πΌπ π¦ + β’ π π₯+π π¦ β π 2 β’ Plug in π₯ π¦ β π§ 2 + ππ½[ π π₯ π¦ β π§ πΌ 2 π π₯ (π¦)|| 2 β’ π π β π½ π π₯ π¦ ] + ππ½||πΌπ Small so can be ignored Regularization term
Data augmentation Figure from Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7, by Keven Wang
Data augmentation β’ Adding noise to the input: a special kind of augmentation β’ Be careful about the transformation applied: β’ Example: classifying βbβ and βdβ β’ Example: classifying β6β and β9β
Early stopping β’ Idea: donβt train the network to too small training error β’ Recall overfitting: Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β’ Prevent overfitting: do not push the hypothesis too much; use validation error to decide when to stop
Early stopping Figure from Deep Learning , Goodfellow, Bengio and Courville
Early stopping β’ When training, also output validation error β’ Every time validation error improved, store a copy of the weights β’ When validation error not improved for some time, stop β’ Return the copy of the weights stored
Early stopping β’ hyperparameter selection: training step is the hyperparameter β’ Advantage β’ Efficient: along with training; only store an extra copy of weights β’ Simple: no change to the model/algo β’ Disadvantage: need validation data
Early stopping β’ Strategy to get rid of the disadvantage β’ After early stopping of the first run, train a second run and reuse validation data β’ How to reuse validation data 1. Start fresh, train with both training data and validation data up to the previous number of epochs 2. Start from the weights in the first run, train with both training data and validation data util the validation loss < the training loss at the early stopping point
Early stopping as a regularizer Figure from Deep Learning , Goodfellow, Bengio and Courville
Dropout β’ Randomly select weights to update β’ More precisely, in each update step β’ Randomly sample a different binary mask to all the input and hidden units β’ Multiple the mask bits with the units and do the update as usual β’ Typical dropout probability: 0.2 for input and 0.5 for hidden units
Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville
Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville
Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville
What regularizations are frequently used? β’ π 2 regularization β’ Early stopping β’ Dropout β’ Data augmentation if the transformations known/easy to implement
Recommend
More recommend