neural network part 2
play

Neural Network Part 2: Regularization Yingyu Liang Computer - PowerPoint PPT Presentation

Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page,


  1. Neural Network Part 2: Regularization Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

  2. Goals for the lecture you should understand the following concepts • regularization • different views of regularization • norm constraint • data augmentation • early stopping • dropout • batch normalization 2

  3. What is regularization? • In general: any method to prevent overfitting or help the optimization • Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization

  4. Overfitting example: regression using polynomials 𝑢 = sin 2𝜌𝑦 + 𝜗 Figure from Machine Learning and Pattern Recognition , Bishop

  5. Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition , Bishop

  6. Overfitting • Key: empirical loss and expected loss are different • Smaller the data set, larger the difference between the two • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Thus has small training error but large test error (overfitting) • Larger data set helps • Throwing away useless hypotheses also helps (regularization)

  7. Different views of regularization

  8. Regularization as hard constraint • Training objective 𝑜 𝑀 𝑔 = 1 ෠ min 𝑜 ෍ 𝑚(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) 𝑔 𝑗=1 subject to: 𝑔 ∈ 𝓘 • When parametrized 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝜄 ∈ 𝛻

  9. Regularization as hard constraint • When 𝛻 measured by some quantity 𝑆 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 subject to: 𝑆 𝜄 ≤ 𝑠 • Example: 𝑚 2 regularization 𝑜 𝑀 𝜄 = 1 ෠ min 𝑜 ෍ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) 𝜄 𝑗=1 2 ≤ 𝑠 2 subject to: | 𝜄| 2

  10. Regularization as soft constraint • The hard-constraint optimization is equivalent to soft-constraint 𝑜 𝑀 𝑆 𝜄 = 1 ෠ 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ∗ 𝑆(𝜄) min 𝑜 ෍ 𝜄 𝑗=1 for some regularization parameter 𝜇 ∗ > 0 • Example: 𝑚 2 regularization 𝑜 𝑀 𝑆 𝜄 = 1 ෠ 2 𝑚(𝜄, 𝑦 𝑗 , 𝑧 𝑗 ) + 𝜇 ∗ | 𝜄| 2 min 𝑜 ෍ 𝜄 𝑗=1

  11. Regularization as soft constraint • Showed by Lagrangian multiplier method ℒ 𝜄, 𝜇 ≔ ෠ 𝑀 𝜄 + 𝜇[𝑆 𝜄 − 𝑠] • Suppose 𝜄 ∗ is the optimal for hard-constraint optimization 𝜄 ∗ = argmin 𝜇≥0 ℒ 𝜄, 𝜇 ≔ ෠ max 𝑀 𝜄 + 𝜇[𝑆 𝜄 − 𝑠] 𝜄 • Suppose 𝜇 ∗ is the corresponding optimal for max 𝜄 ∗ = argmin ℒ 𝜄, 𝜇 ∗ ≔ ෠ 𝑀 𝜄 + 𝜇 ∗ [𝑆 𝜄 − 𝑠] 𝜄

  12. Regularization as Bayesian prior • Bayesian view: everything is a distribution • Prior over the hypotheses: 𝑞 𝜄 • Posterior over the hypotheses: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } • Likelihood: 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 })

  13. Regularization as Bayesian prior • Bayesian rule: 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = 𝑞 𝜄 𝑞 𝑦 𝑗 , 𝑧 𝑗 𝜄) 𝑞({𝑦 𝑗 , 𝑧 𝑗 }) • Maximum A Posteriori (MAP): max log 𝑞 𝜄 | {𝑦 𝑗 , 𝑧 𝑗 } = max log 𝑞 𝜄 + log 𝑞 𝑦 𝑗 , 𝑧 𝑗 | 𝜄 𝜄 𝜄 Regularization MLE loss

  14. Regularization as Bayesian prior • Example: 𝑚 2 loss with 𝑚 2 regularization 𝑜 𝑀 𝑆 𝜄 = 1 𝜄 𝑦 𝑗 − 𝑧 𝑗 2 + 𝜇 ∗ | 𝜄| 2 ෠ 2 min 𝑜 ෍ 𝑔 𝜄 𝑗=1 • Correspond to a normal likelihood 𝑞 𝑦, 𝑧 | 𝜄 and a normal prior 𝑞(𝜄)

  15. Three views • Typical choice for optimization: soft-constraint 𝑀 𝑆 𝜄 = ෠ ෠ min 𝑀 𝜄 + 𝜇𝑆(𝜄) 𝜄 • Hard constraint and Bayesian view: conceptual; or used for derivation

  16. Three views • Hard-constraint preferred if • Know the explicit bound 𝑆 𝜄 ≤ 𝑠 • Soft-constraint causes trapped in a local minima while projection back to feasible set leads to stability • Bayesian view preferred if • Domain knowledge easy to represent as a prior

  17. Examples of Regularization

  18. Classical regularization • Norm penalty • 𝑚 2 regularization • 𝑚 1 regularization • Robustness to noise • Noise to the input • Noise to the weights

  19. 𝑚 2 regularization 𝑀(𝜄) + 𝛽 𝑀 𝑆 𝜄 = ෠ ෠ 2 min 2 | 𝜄| 2 𝜄 • Effect on (stochastic) gradient descent • Effect on the optimal solution

  20. Effect on gradient descent • Gradient of regularized objective 𝛼෠ 𝑀 𝑆 𝜄 = 𝛼෠ 𝑀(𝜄) + 𝛽𝜄 • Gradient descent update 𝜄 ← 𝜄 − 𝜃𝛼෠ 𝑀 𝑆 𝜄 = 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 − 𝜃𝛽𝜄 = 1 − 𝜃𝛽 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 • Terminology: weight decay

  21. Effect on the optimal solution • Consider a quadratic approximation around 𝜄 ∗ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ∗ + 𝜄 − 𝜄 ∗ 𝑈 𝛼෠ 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝑀 𝜄 ∗ = 0 • Since 𝜄 ∗ is optimal, 𝛼෠ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝛼෠ 𝑀 𝜄 ≈ 𝐼 𝜄 − 𝜄 ∗

  22. Effect on the optimal solution • Gradient of regularized objective 𝑀 𝑆 𝜄 ≈ 𝐼 𝜄 − 𝜄 ∗ + 𝛽𝜄 𝛼෠ ∗ • On the optimal 𝜄 𝑆 ∗ ≈ 𝐼 𝜄 𝑆 ∗ − 𝜄 ∗ + 𝛽𝜄 𝑆 0 = 𝛼෠ ∗ 𝑀 𝑆 𝜄 𝑆 ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ 𝜄 𝑆

  23. Effect on the optimal solution • The optimal ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ 𝜄 𝑆 • Suppose 𝐼 has eigen-decomposition 𝐼 = 𝑅Λ𝑅 𝑈 ∗ ≈ 𝐼 + 𝛽𝐽 −1 𝐼𝜄 ∗ = 𝑅 Λ + 𝛽𝐽 −1 Λ𝑅 𝑈 𝜄 ∗ 𝜄 𝑆 • Effect: rescale along eigenvectors of 𝐼

  24. Effect on the optimal solution Notations: ∗ = ෥ 𝜄 ∗ = 𝑥 ∗ , 𝜄 𝑆 𝑥 Figure from Deep Learning , Goodfellow, Bengio and Courville

  25. 𝑚 1 regularization 𝑀 𝑆 𝜄 = ෠ ෠ min 𝑀(𝜄) + 𝛽| 𝜄 | 1 𝜄 • Effect on (stochastic) gradient descent • Effect on the optimal solution

  26. Effect on gradient descent • Gradient of regularized objective 𝛼෠ 𝑀 𝑆 𝜄 = 𝛼෠ 𝑀 𝜄 + 𝛽 sign(𝜄) where sign applies to each element in 𝜄 • Gradient descent update 𝜄 ← 𝜄 − 𝜃𝛼෠ 𝑀 𝑆 𝜄 = 𝜄 − 𝜃 𝛼෠ 𝑀 𝜄 − 𝜃𝛽 sign(𝜄)

  27. Effect on the optimal solution • Consider a quadratic approximation around 𝜄 ∗ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ∗ + 𝜄 − 𝜄 ∗ 𝑈 𝛼෠ 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗ 𝑀 𝜄 ∗ = 0 • Since 𝜄 ∗ is optimal, 𝛼෠ 𝑀 𝜄 ∗ + 1 𝑀 𝜄 ≈ ෠ ෠ 2 𝜄 − 𝜄 ∗ 𝑈 𝐼 𝜄 − 𝜄 ∗

  28. Effect on the optimal solution • Further assume that 𝐼 is diagonal and positive (𝐼 𝑗𝑗 > 0, ∀𝑗) • not true in general but assume for getting some intuition • The regularized objective is (ignoring constants) 1 ∗ 2 + 𝛽 |𝜄 𝑗 | ෠ 𝑀 𝑆 𝜄 ≈ ෍ 2 𝐼 𝑗𝑗 𝜄 𝑗 − 𝜄 𝑗 𝑗 ∗ • The optimal 𝜄 𝑆 ∗ − 𝛽 ∗ ≥ 0 max 𝜄 𝑗 , 0 if 𝜄 𝑗 𝐼 𝑗𝑗 ∗ ) 𝑗 ≈ (𝜄 𝑆 ∗ + 𝛽 ∗ < 0 min 𝜄 𝑗 , 0 if 𝜄 𝑗 𝐼 𝑗𝑗

  29. Effect on the optimal solution • Effect: induce sparsity ∗ ) 𝑗 (𝜄 𝑆 (𝜄 ∗ ) 𝑗 𝛽 − 𝛽 𝐼 𝑗𝑗 𝐼 𝑗𝑗

  30. Effect on the optimal solution • Further assume that 𝐼 is diagonal ∗ • Compact expression for the optimal 𝜄 𝑆 ∗ − 𝛽 ∗ max{ 𝜄 𝑗 ∗ ) 𝑗 ≈ sign 𝜄 𝑗 (𝜄 𝑆 , 0} 𝐼 𝑗𝑗

  31. Bayesian view • 𝑚 1 regularization corresponds to Laplacian prior 𝑞 𝜄 ∝ exp(𝛽 ෍ |𝜄 𝑗 |) 𝑗 log 𝑞 𝜄 = 𝛽 ෍ |𝜄 𝑗 | + constant = 𝛽| 𝜄 | 1 + constant 𝑗

  32. Multiple optimal solutions? Class +1 𝑥 1 𝑥 2 𝑥 3 Class -1 Prefer 𝑥 2 (higher confidence)

  33. Add noise to the input Class +1 𝑥 2 Class -1 Prefer 𝑥 2 (higher confidence)

  34. Caution: not too much noise Too much noise leads to data points cross the boundary Class +1 𝑥 2 Class -1 Prefer 𝑥 2 (higher confidence)

  35. Equivalence to weight decay • Suppose the hypothesis is 𝑔 𝑦 = 𝑥 𝑈 𝑦 , noise is 𝜗~𝑂(0, 𝜇𝐽) • After adding noise, the loss is 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 + 𝜗 − 𝑧 2 = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 + 𝑥 𝑈 𝜗 − 𝑧 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 − 𝑧 2 + 2 𝔽 𝑦,𝑧,𝜗 𝑥 𝑈 𝜗 𝑔 𝑦 − 𝑧 + 𝔽 𝑦,𝑧,𝜗 𝑥 𝑈 𝜗 2 2 𝑀(𝑔) = 𝔽 𝑦,𝑧,𝜗 𝑔 𝑦 − 𝑧 2 + 𝜇 𝑥

Recommend


More recommend