lecture 4 regularization ii
play

Lecture 4: Regularization II Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang Review Regularization as hard constraint Constrained optimization = 1 min (, ,


  1. Deep Learning Basics Lecture 4: Regularization II Princeton University COS 495 Instructor: Yingyu Liang

  2. Review

  3. Regularization as hard constraint β€’ Constrained optimization π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 subject to: 𝑆 πœ„ ≀ 𝑠

  4. Regularization as soft constraint β€’ Unconstrained optimization π‘œ 𝑀 𝑆 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡π‘†(πœ„) πœ„ 𝑗=1 for some regularization parameter πœ‡ > 0

  5. Regularization as Bayesian prior β€’ Bayesian rule: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = π‘ž πœ„ π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) π‘ž({𝑦 𝑗 , 𝑧 𝑗 }) β€’ Maximum A Posteriori (MAP): max log π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = max log π‘ž πœ„ + log π‘ž 𝑦 𝑗 , 𝑧 𝑗 | πœ„ πœ„ πœ„ Regularization MLE loss

  6. Classical regularizations β€’ Norm penalty β€’ π‘š 2 regularization β€’ π‘š 1 regularization

  7. More examples

  8. Other types of regularizations β€’ Robustness to noise β€’ Noise to the input β€’ Noise to the weights β€’ Noise to the output β€’ Data augmentation β€’ Early stopping β€’ Dropout

  9. Multiple optimal solutions? Class +1 π‘₯ 1 π‘₯ 2 π‘₯ 3 Class -1 Prefer π‘₯ 2 (higher confidence)

  10. Add noise to the input Class +1 π‘₯ 2 Class -1 Prefer π‘₯ 2 (higher confidence)

  11. Caution: not too much noise Too much noise leads to data points cross the boundary Class +1 π‘₯ 2 Class -1 Prefer π‘₯ 2 (higher confidence)

  12. Equivalence to weight decay β€’ Suppose the hypothesis is 𝑔 𝑦 = π‘₯ π‘ˆ 𝑦 , noise is πœ—~𝑂(0, πœ‡π½) β€’ After adding noise, the loss is 𝑀(𝑔) = 𝔽 𝑦,𝑧,πœ— 𝑔 𝑦 + πœ— βˆ’ 𝑧 2 = 𝔽 𝑦,𝑧,πœ— 𝑔 𝑦 + π‘₯ π‘ˆ πœ— βˆ’ 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,πœ— 𝑔 𝑦 βˆ’ 𝑧 2 + 2 𝔽 𝑦,𝑧,πœ— π‘₯ π‘ˆ πœ— 𝑔 𝑦 βˆ’ 𝑧 + 𝔽 𝑦,𝑧,πœ— π‘₯ π‘ˆ πœ— 2 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,πœ— 𝑔 𝑦 βˆ’ 𝑧 2 + πœ‡ π‘₯

  13. Add noise to the weights β€’ For the loss on each data point, add a noise term to the weights before computing the prediction πœ—~𝑂(0, πœƒπ½) , π‘₯β€² = π‘₯ + πœ— β€’ Prediction: 𝑔 π‘₯ β€² 𝑦 instead of 𝑔 π‘₯ 𝑦 β€’ Loss becomes π‘₯+πœ— 𝑦 βˆ’ 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,πœ— 𝑔

  14. Add noise to the weights β€’ Loss becomes π‘₯+πœ— 𝑦 βˆ’ 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,πœ— 𝑔 β€’ To simplify, use Taylor expansion πœ— π‘ˆ 𝛼 2 𝑔 𝑦 πœ— π‘₯ 𝑦 + πœ— π‘ˆ 𝛼𝑔 𝑦 + β€’ 𝑔 π‘₯+πœ— 𝑦 β‰ˆ 𝑔 2 β€’ Plug in π‘₯ 𝑦 βˆ’ 𝑧 2 + πœƒπ”½[ 𝑔 π‘₯ 𝑦 βˆ’ 𝑧 𝛼 2 𝑔 π‘₯ (𝑦)|| 2 β€’ 𝑀 𝑔 β‰ˆ 𝔽 𝑔 π‘₯ 𝑦 ] + πœƒπ”½||𝛼𝑔 Small so can be ignored Regularization term

  15. Data augmentation Figure from Image Classification with Pyramid Representation and Rotated Data Augmentation on Torch 7, by Keven Wang

  16. Data augmentation β€’ Adding noise to the input: a special kind of augmentation β€’ Be careful about the transformation applied: β€’ Example: classifying β€˜b’ and β€˜d’ β€’ Example: classifying β€˜6’ and β€˜9’

  17. Early stopping β€’ Idea: don’t train the network to too small training error β€’ Recall overfitting: Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β€’ Prevent overfitting: do not push the hypothesis too much; use validation error to decide when to stop

  18. Early stopping Figure from Deep Learning , Goodfellow, Bengio and Courville

  19. Early stopping β€’ When training, also output validation error β€’ Every time validation error improved, store a copy of the weights β€’ When validation error not improved for some time, stop β€’ Return the copy of the weights stored

  20. Early stopping β€’ hyperparameter selection: training step is the hyperparameter β€’ Advantage β€’ Efficient: along with training; only store an extra copy of weights β€’ Simple: no change to the model/algo β€’ Disadvantage: need validation data

  21. Early stopping β€’ Strategy to get rid of the disadvantage β€’ After early stopping of the first run, train a second run and reuse validation data β€’ How to reuse validation data 1. Start fresh, train with both training data and validation data up to the previous number of epochs 2. Start from the weights in the first run, train with both training data and validation data util the validation loss < the training loss at the early stopping point

  22. Early stopping as a regularizer Figure from Deep Learning , Goodfellow, Bengio and Courville

  23. Dropout β€’ Randomly select weights to update β€’ More precisely, in each update step β€’ Randomly sample a different binary mask to all the input and hidden units β€’ Multiple the mask bits with the units and do the update as usual β€’ Typical dropout probability: 0.2 for input and 0.5 for hidden units

  24. Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville

  25. Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville

  26. Dropout Figure from Deep Learning , Goodfellow, Bengio and Courville

  27. What regularizations are frequently used? β€’ π‘š 2 regularization β€’ Early stopping β€’ Dropout β€’ Data augmentation if the transformations known/easy to implement

Recommend


More recommend