Deep Learning Basics Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang
What is regularization? β’ In general: any method to prevent overfitting or help the optimization β’ Specifically: additional terms in the training optimization objective to prevent overfitting or help the optimization
Review: overfitting
Overfitting example: regression using polynomials π’ = sin 2ππ¦ + π Figure from Machine Learning and Pattern Recognition , Bishop
Overfitting example: regression using polynomials Figure from Machine Learning and Pattern Recognition , Bishop
Overfitting β’ Empirical loss and expected loss are different β’ Smaller the data set, larger the difference between the two β’ Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two β’ Thus has small training error but large test error (overfitting)
Prevent overfitting β’ Larger data set helps β’ Throwing away useless hypotheses also helps β’ Classical regularization: some principal ways to constrain hypotheses β’ Other types of regularization: data augmentation, early stopping, etc.
Different views of regularization
Regularization as hard constraint β’ Training objective π π π = 1 ΰ· min π ΰ· π(π, π¦ π , π§ π ) π π=1 subject to: π β π β’ When parametrized π π π = 1 ΰ· min π ΰ· π(π, π¦ π , π§ π ) π π=1 subject to: π β π»
Regularization as hard constraint β’ When π» measured by some quantity π π π π = 1 ΰ· min π ΰ· π(π, π¦ π , π§ π ) π π=1 subject to: π π β€ π β’ Example: π 2 regularization π π π = 1 ΰ· min π ΰ· π(π, π¦ π , π§ π ) π π=1 2 β€ π 2 subject to: | π| 2
Regularization as soft constraint β’ The hard-constraint optimization is equivalent to soft-constraint π π π π = 1 ΰ· π(π, π¦ π , π§ π ) + π β π(π) min π ΰ· π π=1 for some regularization parameter π β > 0 β’ Example: π 2 regularization π π π π = 1 ΰ· 2 π(π, π¦ π , π§ π ) + π β | π| 2 min π ΰ· π π=1
Regularization as soft constraint β’ Showed by Lagrangian multiplier method β π, π β ΰ· π π + π[π π β π ] β’ Suppose π β is the optimal for hard-constraint optimization π β = argmin πβ₯0 β π, π β ΰ· max π π + π[π π β π ] π β’ Suppose π β is the corresponding optimal for max π β = argmin β π, π β β ΰ· π π + π β [π π β π ] π
Regularization as Bayesian prior β’ Bayesian view: everything is a distribution β’ Prior over the hypotheses: π π β’ Posterior over the hypotheses: π π | {π¦ π , π§ π } β’ Likelihood: π π¦ π , π§ π π) β’ Bayesian rule: π π | {π¦ π , π§ π } = π π π π¦ π , π§ π π) π({π¦ π , π§ π })
Regularization as Bayesian prior β’ Bayesian rule: π π | {π¦ π , π§ π } = π π π π¦ π , π§ π π) π({π¦ π , π§ π }) β’ Maximum A Posteriori (MAP): max log π π | {π¦ π , π§ π } = max log π π + log π π¦ π , π§ π | π π π Regularization MLE loss
Regularization as Bayesian prior β’ Example: π 2 loss with π 2 regularization π π π π = 1 π π¦ π β π§ π 2 + π β | π| 2 ΰ· 2 min π ΰ· π π π=1 β’ Correspond to a normal likelihood π π¦, π§ | π and a normal prior π(π)
Three views β’ Typical choice for optimization: soft-constraint π π π = ΰ· ΰ· min π π + ππ(π) π β’ Hard constraint and Bayesian view: conceptual; or used for derivation
Three views β’ Hard-constraint preferred if β’ Know the explicit bound π π β€ π β’ Soft-constraint causes trapped in a local minima with small π β’ Projection back to feasible set leads to stability β’ Bayesian view preferred if β’ Know the prior distribution
Some examples
Classical regularization β’ Norm penalty β’ π 2 regularization β’ π 1 regularization β’ Robustness to noise
π 2 regularization π(π) + π½ π π π = ΰ· ΰ· 2 min 2 | π| 2 π β’ Effect on (stochastic) gradient descent β’ Effect on the optimal solution
Effect on gradient descent β’ Gradient of regularized objective πΌΰ· π π π = πΌΰ· π(π) + π½π β’ Gradient descent update π β π β ππΌΰ· π π π = π β π πΌΰ· π π β ππ½π = 1 β ππ½ π β π πΌΰ· π π β’ Terminology: weight decay
Effect on the optimal solution β’ Consider a quadratic approximation around π β π π β + 1 π π β + π β π β π πΌΰ· π π β ΰ· ΰ· 2 π β π β π πΌ π β π β π π β = 0 β’ Since π β is optimal, πΌΰ· π π β + 1 π π β ΰ· ΰ· 2 π β π β π πΌ π β π β πΌΰ· π π β πΌ π β π β
Effect on the optimal solution β’ Gradient of regularized objective π π π β πΌ π β π β + π½π πΌΰ· β β’ On the optimal π π β β πΌ π π β β π β + π½π π 0 = πΌΰ· β π π π π β β πΌ + π½π½ β1 πΌπ β π π
Effect on the optimal solution β’ The optimal β β πΌ + π½π½ β1 πΌπ β π π β’ Suppose πΌ has eigen-decomposition πΌ = π Ξπ π β β πΌ + π½π½ β1 πΌπ β = π Ξ + π½π½ β1 Ξπ π π β π π β’ Effect: rescale along eigenvectors of πΌ
Effect on the optimal solution Notations: β = ΰ·₯ π β = π₯ β , π π π₯ Figure from Deep Learning , Goodfellow, Bengio and Courville
π 1 regularization π π π = ΰ· ΰ· min π(π) + π½| π | 1 π β’ Effect on (stochastic) gradient descent β’ Effect on the optimal solution
Effect on gradient descent β’ Gradient of regularized objective πΌΰ· π π π = πΌΰ· π π + π½ sign(π) where sign applies to each element in π β’ Gradient descent update π β π β ππΌΰ· π π π = π β π πΌΰ· π π β ππ½ sign(π)
Effect on the optimal solution β’ Consider a quadratic approximation around π β π π β + 1 π π β + π β π β π πΌΰ· π π β ΰ· ΰ· 2 π β π β π πΌ π β π β π π β = 0 β’ Since π β is optimal, πΌΰ· π π β + 1 π π β ΰ· ΰ· 2 π β π β π πΌ π β π β
Effect on the optimal solution β’ Further assume that πΌ is diagonal and positive (πΌ ππ > 0, βπ) β’ not true in general but assume for getting some intuition β’ The regularized objective is (ignoring constants) 1 β 2 + π½ |π π | ΰ· π π π β ΰ· 2 πΌ ππ π π β π π π β β’ The optimal π π β β π½ β β₯ 0 max π π , 0 if π π πΌ ππ β ) π β (π π β + π½ β < 0 min π π , 0 if π π πΌ ππ
Effect on the optimal solution β’ Effect: induce sparsity β ) π (π π (π β ) π π½ β π½ πΌ ππ πΌ ππ
Effect on the optimal solution β’ Further assume that πΌ is diagonal β β’ Compact expression for the optimal π π β β π½ β max{ π π β ) π β sign π π (π π , 0} πΌ ππ
Bayesian view β’ π 1 regularization corresponds to Laplacian prior π π β exp(π½ ΰ· |π π |) π log π π = π½ ΰ· |π π | + constant = π½| π | 1 + constant π
Recommend
More recommend