Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1
Learning objectives Learning objectives Basic idea of overfitting and underfitting Regularization (L1 & L2) MLE vs MAP estimation bias and variance trade off evaluation metrics & cross validation 2
Previously... Previously... Linear regression and logistic regression is linear too simple? what if it's not a good fit? how to increase the models expressiveness? create new nonlinear features is there a downside? 3 . 1
Recall: nonlinear basis functions Recall: nonlinear basis functions ( x ) = ∑ d replace original features in f w x w d d ( x ) = ( x ) ∑ d with nonlinear bases f w ϕ w d d linear least squares solution ∗ ⊤ −1 ⊤ w = (Φ Φ) Φ y X with Φ replacing a (nonlinear) feature ⎡ ϕ ⎤ (1) (1) (1) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ 1 2 D ⎢ ⎢ ⎥ ⎥ (2) (2) (2) ( x ), ( x ), ⋯ , ( x ) one instance ⎢ ⎥ ϕ ϕ ϕ 1 2 D ⎢ ⎥ Φ = ⎣ ⋮ ⋮ ⋱ ⋮ ( N ) ⎦ ( N ) ( N ) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ ϕ 1 2 D 3 . 2
Recall: nonlinear basis functions Recall: nonlinear basis functions x ∈ R examples original input is scalar Sigmoid bases polynomial bases Gaussian bases 1 k 2 ( x ) = ( x − μ ) ϕ e − ( x ) = x − μ ( x ) = x k k ϕ s 2 ϕ k 1+ e − k k s 3 . 3 Winter 2020 | Applied Machine Learning (COMP551)
Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k ( n ) ( n ) ( n ) = sin( x ) + cos( ∣ x ∣ ) + y ϵ prediction for a new instance ′ ′ ⊤ ⊤ −1 ⊤ f ( x ) = ϕ ( x ) (Φ Φ) Φ y w found using LLS new instance features evaluated for the new point our fit to data using 10 Gaussian bases 4 . 1
Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k our fit to data using 10 Gaussian bases why not more? 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-(x-mu)**2) 5 mu = np.linspace(0,10,10) #10 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') 4 . 2
Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k using 50 bases 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-(x-mu)**2) 5 mu = np.linspace(0,10,50) #50 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') 4 . 3
Example: Gaussian bases Example : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k using 200, thinner bases (s=.1) J ( w ) cost function is small and we have a " perfect " fit! 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-((x-mu)/.1**2) 5 mu = np.linspace(0,10,200) #200 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') 4 . 4
Generalization Generalization D = 50 D = 5 D = 200 D = 10 lower training error which one of these models performs better at test time? 4 . 5
Overfitting Overfitting which one of these models performs better at test time? underfitting D = 5 ′ f ( x ) predictions of 4 models for the same input lowest test error D = 10 D = 50 overfitting D = 200 y x ′ 4 . 6 Winter 2020 | Applied Machine Learning (COMP551)
Model selection Model selection how to pick the model with lowest expected loss / test error? regularization bound the test error by bounding training error model complexity use a validation set (and a separate test set for final assessment) use for model selection use for final model assessment 5
An observation An observation when overfitting, we often see large weights ( x ) ∀ d dashed lines are w ϕ d d D = 10 D = 15 D = 20 idea : penalize large parameter values 6 . 1
Ridge regression Ridge regression L2 regularized linear least squares regression: 1 2 2 J ( w ) = ∣∣ Xw − y ∣∣ + λ ∣∣ w ∣∣ 2 2 2 2 sum of squared error (squared) L2 norm of w 1 ∑ n ( n ) ⊤ 2 2 ( y − w x ) w w = T ∑ d w 2 regularization parameter controls the strength of regularization λ > 0 a good practice is to not penalize the intercept λ (∣∣ w ∣∣ 2 2 − ) w 2 0 6 . 2
Ridge Ridge regression regression we can set the derivative to zero J ( w ) = 1 ⊤ ⊤ ( Xw − y ) ( Xw − y ) + λ w w 2 2 ⊤ ∇ J ( w ) = X ( Xw − y ) + λw = 0 when using gradient descent, this term ⊤ ⊤ λ I ) w = X y ( X X + reduces the weights at each step (weight decay) ⊤ −1 ⊤ λ I ) w = ( X X + X y the only part different due to regularization makes it invertible! λI we can have linearly dependent features (e.g., D > N) the solution will be unique! 6 . 3
Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k Without regularization: using D=10 we can perfectly fit the data (high test error) degree 2 (D=3) degree 4 (D=5) degree 9 (D=10) 6 . 4
Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k with regularization: fixed D=10, changing the amount of regularization λ = .1 λ = 10 λ = 0 6 . 5
Data normalization Data normalization ~ ( n ) what if we scale the input features, using different factors ( n ) = ∀ d , n x γ x d ~ 1 = ∀ d if we have no regularization: w w d d γ d ~ w ~ 2 2 ∣∣ Xw − y ∣∣ = ∣∣ − y ∣∣ everything remains the same because: X 2 2 ~ 2 with regularization: ∣∣ ∣∣ 2 ∣∣ w ∣∣ = so the optimal w will be different! w 2 features of different mean and variance will be penalized differently { ( n ) 1 = μ x normalization d N d ( n ) 1 2 d 2 = ( x − ) σ μ N −1 d d ( n ) ( n ) − μ x ← makes sure all features have the same mean and variance x d d d σ d 6 . 6 Winter 2020 | Applied Machine Learning (COMP551)
Maximum likelihood Maximum likelihood previously: linear regression & logistic regression maximize log-likelihood linear regression logistic regression ∗ ∗ w = arg max p ( y ∣ w ) w = arg max p ( y ∣ x , w ) N = arg max Bernoulli ( y ; σ (Φ w )) w ∏ n =1 N 2 = arg max N ( y ; Φ w , σ ) w ∏ n =1 ( n ) ⊤ ≡ arg min ( y , σ ( w ϕ ( x ))) n ∑ n L ( n ) ⊤ ( n ) ≡ arg min ( y , w ϕ ( x )) ∑ n L CE 2 idea: maximize the posterior instead of likelihood p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) 7 . 1
Maximum a Posteriori ( Maximum a Posteriori (MAP MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w ≡ arg max log p ( y ∣ w ) + log p ( w ) w prior likelihood: original objective even better would be to estimate the posterior distribution p ( w ∣ y ) more on this later in the course! 7 . 2
Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D σ 2 −1 ⊤ 2 1 2 1 D ≡ arg max ( y − w x ) − ⊤ 2 2 ∑ d =1 ≡ arg min ( y − w x ) + ∑ d =1 w w w 2 σ 2 w 2 2 τ 2 d 2 τ 2 d σ 2 λ = multiple data-points τ 2 1 ∑ n D ( n ) ⊤ ( n ) 2 2 ≡ arg min ( y − ) + λ ∑ d =1 w x w w 2 L2 regularization 2 d L2- regularization is assuming a Gaussian prior on weights the same is true for logistic regression (or any other cost function) 7 . 3
Laplace prior Laplace prior another notable choice of prior is the Laplace distribution 1 = ∣∣ w ∣∣ 1 − log p ( w ) = ∣ w ∣ minimizing negative log-likelihood ∑ d ∑ d 2 β 1 2 β d d L1 norm of w L1 regularization: J ( w ) ← J ( w ) + λ ∣∣ w ∣∣ also called lasso 1 (least absolute shrinkage and selection operator) ∣ w ∣ 1 − p ( w ; β ) = e notice the peak around zero β 2 β w image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4
regularization regularization vs L L 1 2 regularization path shows how change as we change λ { w } d Lasso produces sparse weights (many are zero, rather than small) red-line is the optimal from cross-validation λ Lasso Ridge regression w d ′ w d decreasing regularization coef. λ 7 . 5
regularization regularization vs L L 1 2 min J ( w ) + λ ∣∣ w ∣∣ p ~ w p ~ subject to ∣∣ w ∣∣ for an appropriate choice of λ is equivalent to min J ( w ) p ≤ λ w p figures below show the constraint and the isocontours of J ( w ) optimal solution with L1-regularization is more likely to have zero components J ( w ) any convex cost function J ( w ) w w 2 2 w w MLE MLE w MAP w MAP ~ w w 1 1 ∣∣ w ∣∣ ≤ ~ λ 1 2 ∣∣ w ∣∣ ≤ λ 2 7 . 6
Recommend
More recommend