lecture 3 regularization i
play

Lecture 3: Regularization I Princeton University COS 495 - PowerPoint PPT Presentation

Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is regularization? In general: any method to prevent overfitting or help the optimization Specifically: additional terms in the


  1. Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang

  2. What is regularization? β€’ In general: any method to prevent overfitting or help the optimization β€’ Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization

  3. Review: overfitting

  4. Overfitting example: regression using polynomials 𝑒 = sin 2πœŒπ‘¦ + πœ— Figure from Machine Learning and Pattern Recognition , Bishop

  5. Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition , Bishop

  6. Overfitting β€’ Empirical loss and expected loss are different β€’ Smaller the data set, larger the difference between the two β€’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β€’ Thus has small training error but large test error (overfitting)

  7. Prevent overfitting β€’ Larger data set helps β€’ Throwing away useless hypotheses also helps β€’ Classical regularization: some principal ways to constrain hypotheses β€’ Other types of regularization: data augmentation, early stopping, etc.

  8. Different views of regularization

  9. Regularization as hard constraint β€’ Training objective π‘œ 𝑀 𝑔 = 1 ΰ·  min π‘œ ෍ π‘š(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) 𝑔 𝑗=1 subject to: 𝑔 ∈ π“˜ β€’ When parametrized π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 subject to: πœ„ ∈ 𝛻

  10. Regularization as hard constraint β€’ When 𝛻 measured by some quantity 𝑆 π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 subject to: 𝑆 πœ„ ≀ 𝑠 β€’ Example: π‘š 2 regularization π‘œ 𝑀 πœ„ = 1 ΰ·  min π‘œ ෍ π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) πœ„ 𝑗=1 2 ≀ 𝑠 2 subject to: | πœ„| 2

  11. Regularization as soft constraint β€’ The hard-constraint optimization is equivalent to soft-constraint π‘œ 𝑀 𝑆 πœ„ = 1 ΰ·  π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡ βˆ— 𝑆(πœ„) min π‘œ ෍ πœ„ 𝑗=1 for some regularization parameter πœ‡ βˆ— > 0 β€’ Example: π‘š 2 regularization π‘œ 𝑀 𝑆 πœ„ = 1 ΰ·  2 π‘š(πœ„, 𝑦 𝑗 , 𝑧 𝑗 ) + πœ‡ βˆ— | πœ„| 2 min π‘œ ෍ πœ„ 𝑗=1

  12. Regularization as soft constraint β€’ Showed by Lagrangian multiplier method β„’ πœ„, πœ‡ ≔ ΰ·  𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠] β€’ Suppose πœ„ βˆ— is the optimal for hard-constraint optimization πœ„ βˆ— = argmin πœ‡β‰₯0 β„’ πœ„, πœ‡ ≔ ΰ·  max 𝑀 πœ„ + πœ‡[𝑆 πœ„ βˆ’ 𝑠] πœ„ β€’ Suppose πœ‡ βˆ— is the corresponding optimal for max πœ„ βˆ— = argmin β„’ πœ„, πœ‡ βˆ— ≔ ΰ·  𝑀 πœ„ + πœ‡ βˆ— [𝑆 πœ„ βˆ’ 𝑠] πœ„

  13. Regularization as Bayesian prior β€’ Bayesian view: everything is a distribution β€’ Prior over the hypotheses: π‘ž πœ„ β€’ Posterior over the hypotheses: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } β€’ Likelihood: π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) β€’ Bayesian rule: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = π‘ž πœ„ π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) π‘ž({𝑦 𝑗 , 𝑧 𝑗 })

  14. Regularization as Bayesian prior β€’ Bayesian rule: π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = π‘ž πœ„ π‘ž 𝑦 𝑗 , 𝑧 𝑗 πœ„) π‘ž({𝑦 𝑗 , 𝑧 𝑗 }) β€’ Maximum A Posteriori (MAP): max log π‘ž πœ„ | {𝑦 𝑗 , 𝑧 𝑗 } = max log π‘ž πœ„ + log π‘ž 𝑦 𝑗 , 𝑧 𝑗 | πœ„ πœ„ πœ„ Regularization MLE loss

  15. Regularization as Bayesian prior β€’ Example: π‘š 2 loss with π‘š 2 regularization π‘œ 𝑀 𝑆 πœ„ = 1 πœ„ 𝑦 𝑗 βˆ’ 𝑧 𝑗 2 + πœ‡ βˆ— | πœ„| 2 ΰ·  2 min π‘œ ෍ 𝑔 πœ„ 𝑗=1 β€’ Correspond to a normal likelihood π‘ž 𝑦, 𝑧 | πœ„ and a normal prior π‘ž(πœ„)

  16. Three views β€’ Typical choice for optimization: soft-constraint 𝑀 𝑆 πœ„ = ΰ·  ΰ·  min 𝑀 πœ„ + πœ‡π‘†(πœ„) πœ„ β€’ Hard constraint and Bayesian view: conceptual; or used for derivation

  17. Three views β€’ Hard-constraint preferred if β€’ Know the explicit bound 𝑆 πœ„ ≀ 𝑠 β€’ Soft-constraint causes trapped in a local minima with small πœ„ β€’ Projection back to feasible set leads to stability β€’ Bayesian view preferred if β€’ Know the prior distribution

  18. Some examples

  19. Classical regularization β€’ Norm penalty β€’ π‘š 2 regularization β€’ π‘š 1 regularization β€’ Robustness to noise

  20. π‘š 2 regularization 𝑀(πœ„) + 𝛽 𝑀 𝑆 πœ„ = ΰ·  ΰ·  2 min 2 | πœ„| 2 πœ„ β€’ Effect on (stochastic) gradient descent β€’ Effect on the optimal solution

  21. Effect on gradient descent β€’ Gradient of regularized objective 𝛼෠ 𝑀 𝑆 πœ„ = 𝛼෠ 𝑀(πœ„) + π›½πœ„ β€’ Gradient descent update πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀 𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½πœ„ = 1 βˆ’ πœƒπ›½ πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ β€’ Terminology: weight decay

  22. Effect on the optimal solution β€’ Consider a quadratic approximation around πœ„ βˆ— 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ βˆ— + πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝛼෠ 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— 𝑀 πœ„ βˆ— = 0 β€’ Since πœ„ βˆ— is optimal, 𝛼෠ 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— 𝛼෠ 𝑀 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ—

  23. Effect on the optimal solution β€’ Gradient of regularized objective 𝑀 𝑆 πœ„ β‰ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— + π›½πœ„ 𝛼෠ βˆ— β€’ On the optimal πœ„ 𝑆 βˆ— β‰ˆ 𝐼 πœ„ 𝑆 βˆ— βˆ’ πœ„ βˆ— + π›½πœ„ 𝑆 0 = 𝛼෠ βˆ— 𝑀 𝑆 πœ„ 𝑆 βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1 πΌπœ„ βˆ— πœ„ 𝑆

  24. Effect on the optimal solution β€’ The optimal βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1 πΌπœ„ βˆ— πœ„ 𝑆 β€’ Suppose 𝐼 has eigen-decomposition 𝐼 = 𝑅Λ𝑅 π‘ˆ βˆ— β‰ˆ 𝐼 + 𝛽𝐽 βˆ’1 πΌπœ„ βˆ— = 𝑅 Ξ› + 𝛽𝐽 βˆ’1 Λ𝑅 π‘ˆ πœ„ βˆ— πœ„ 𝑆 β€’ Effect: rescale along eigenvectors of 𝐼

  25. Effect on the optimal solution Notations: βˆ— = ΰ·₯ πœ„ βˆ— = π‘₯ βˆ— , πœ„ 𝑆 π‘₯ Figure from Deep Learning , Goodfellow, Bengio and Courville

  26. π‘š 1 regularization 𝑀 𝑆 πœ„ = ΰ·  ΰ·  min 𝑀(πœ„) + 𝛽| πœ„ | 1 πœ„ β€’ Effect on (stochastic) gradient descent β€’ Effect on the optimal solution

  27. Effect on gradient descent β€’ Gradient of regularized objective 𝛼෠ 𝑀 𝑆 πœ„ = 𝛼෠ 𝑀 πœ„ + 𝛽 sign(πœ„) where sign applies to each element in πœ„ β€’ Gradient descent update πœ„ ← πœ„ βˆ’ πœƒπ›Όΰ·  𝑀 𝑆 πœ„ = πœ„ βˆ’ πœƒ 𝛼෠ 𝑀 πœ„ βˆ’ πœƒπ›½ sign(πœ„)

  28. Effect on the optimal solution β€’ Consider a quadratic approximation around πœ„ βˆ— 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ βˆ— + πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝛼෠ 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ— 𝑀 πœ„ βˆ— = 0 β€’ Since πœ„ βˆ— is optimal, 𝛼෠ 𝑀 πœ„ βˆ— + 1 𝑀 πœ„ β‰ˆ ΰ·  ΰ·  2 πœ„ βˆ’ πœ„ βˆ— π‘ˆ 𝐼 πœ„ βˆ’ πœ„ βˆ—

  29. Effect on the optimal solution β€’ Further assume that 𝐼 is diagonal and positive (𝐼 𝑗𝑗 > 0, βˆ€π‘—) β€’ not true in general but assume for getting some intuition β€’ The regularized objective is (ignoring constants) 1 βˆ— 2 + 𝛽 |πœ„ 𝑗 | ΰ·  𝑀 𝑆 πœ„ β‰ˆ ෍ 2 𝐼 𝑗𝑗 πœ„ 𝑗 βˆ’ πœ„ 𝑗 𝑗 βˆ— β€’ The optimal πœ„ 𝑆 βˆ— βˆ’ 𝛽 βˆ— β‰₯ 0 max πœ„ 𝑗 , 0 if πœ„ 𝑗 𝐼 𝑗𝑗 βˆ— ) 𝑗 β‰ˆ (πœ„ 𝑆 βˆ— + 𝛽 βˆ— < 0 min πœ„ 𝑗 , 0 if πœ„ 𝑗 𝐼 𝑗𝑗

  30. Effect on the optimal solution β€’ Effect: induce sparsity βˆ— ) 𝑗 (πœ„ 𝑆 (πœ„ βˆ— ) 𝑗 𝛽 βˆ’ 𝛽 𝐼 𝑗𝑗 𝐼 𝑗𝑗

  31. Effect on the optimal solution β€’ Further assume that 𝐼 is diagonal βˆ— β€’ Compact expression for the optimal πœ„ 𝑆 βˆ— βˆ’ 𝛽 βˆ— max{ πœ„ 𝑗 βˆ— ) 𝑗 β‰ˆ sign πœ„ 𝑗 (πœ„ 𝑆 , 0} 𝐼 𝑗𝑗

  32. Bayesian view β€’ π‘š 1 regularization corresponds to Laplacian prior π‘ž πœ„ ∝ exp(𝛽 ෍ |πœ„ 𝑗 |) 𝑗 log π‘ž πœ„ = 𝛽 ෍ |πœ„ 𝑗 | + constant = 𝛽| πœ„ | 1 + constant 𝑗

Recommend


More recommend