Applied Machine Learning Applied Machine Learning Regularization - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1

Learning objectives Learning objectives Basic idea of overfitting and underfitting Regularization (L1 & L2) MLE vs MAP estimation bias and variance trade off evaluation metrics & cross validation 2

Previously... Previously... Linear regression and logistic regression is linear too simple? what if it's not a good fit? how to increase the models expressiveness? create new nonlinear features is there a downside? 3 . 1

Recall: nonlinear basis functions Recall: nonlinear basis functions ( x ) = ∑ d replace original features in f w x w d d ( x ) = ( x ) ∑ d with nonlinear bases f w ϕ w d d linear least squares solution ∗ ⊤ −1 ⊤ w = (Φ Φ) Φ y X with Φ replacing a (nonlinear) feature ⎡ ϕ ⎤ (1) (1) (1) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ 1 2 D ⎢ ⎢ ⎥ ⎥ (2) (2) (2) ( x ), ( x ), ⋯ , ( x ) one instance ⎢ ⎥ ϕ ϕ ϕ 1 2 D ⎢ ⎥ Φ = ⎣ ⋮ ⋮ ⋱ ⋮ ( N ) ⎦ ( N ) ( N ) ( x ), ( x ), ⋯ , ( x ) ϕ ϕ ϕ 1 2 D 3 . 2

Recall: nonlinear basis functions Recall: nonlinear basis functions x ∈ R examples original input is scalar Sigmoid bases polynomial bases Gaussian bases 1 k 2 ( x ) = ( x − μ ) ϕ e − ( x ) = x − μ ( x ) = x k k ϕ s 2 ϕ k 1+ e − k k s 3 . 3 Winter 2020 | Applied Machine Learning (COMP551)

Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k ( n ) ( n ) ( n ) = sin( x ) + cos( ∣ x ∣ ) + y ϵ prediction for a new instance ′ ′ ⊤ ⊤ −1 ⊤ f ( x ) = ϕ ( x ) (Φ Φ) Φ y w found using LLS new instance features evaluated for the new point our fit to data using 10 Gaussian bases 4 . 1

Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k our fit to data using 10 Gaussian bases why not more? 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-(x-mu)**2) 5 mu = np.linspace(0,10,10) #10 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') 4 . 2

Example Example: Gaussian bases : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k using 50 bases 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-(x-mu)**2) 5 mu = np.linspace(0,10,50) #50 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') 4 . 3

Example: Gaussian bases Example : Gaussian bases k 2 ( x − μ ) e − ( x ) = ϕ s 2 k using 200, thinner bases (s=.1) J ( w ) cost function is small and we have a " perfect " fit! 1 #x: N 2 #y: N 3 plt.plot(x, y, 'b.') 4 phi = lambda x,mu: np.exp(-((x-mu)/.1**2) 5 mu = np.linspace(0,10,200) #200 Gaussians bases 6 Phi = phi(x[:,None], mu[None,:]) #N x 10 7 w = np.linalg.lstsq(Phi, y)[0] 8 yh = np.dot(Phi,w) 9 plt.plot(x, yh, 'g-') 4 . 4

Generalization Generalization D = 50 D = 5 D = 200 D = 10 lower training error which one of these models performs better at test time? 4 . 5

Overfitting Overfitting which one of these models performs better at test time? underfitting D = 5 ′ f ( x ) predictions of 4 models for the same input lowest test error D = 10 D = 50 overfitting D = 200 y x ′ 4 . 6 Winter 2020 | Applied Machine Learning (COMP551)

Model selection Model selection how to pick the model with lowest expected loss / test error? regularization bound the test error by bounding training error model complexity use a validation set (and a separate test set for final assessment) use for model selection use for final model assessment 5

An observation An observation when overfitting, we often see large weights ( x ) ∀ d dashed lines are w ϕ d d D = 10 D = 15 D = 20 idea : penalize large parameter values 6 . 1

Ridge regression Ridge regression L2 regularized linear least squares regression: 1 2 2 J ( w ) = ∣∣ Xw − y ∣∣ + λ ∣∣ w ∣∣ 2 2 2 2 sum of squared error (squared) L2 norm of w 1 ∑ n ( n ) ⊤ 2 2 ( y − w x ) w w = T ∑ d w 2 regularization parameter controls the strength of regularization λ > 0 a good practice is to not penalize the intercept λ (∣∣ w ∣∣ 2 2 − ) w 2 0 6 . 2

Ridge Ridge regression regression we can set the derivative to zero J ( w ) = 1 ⊤ ⊤ ( Xw − y ) ( Xw − y ) + λ w w 2 2 ⊤ ∇ J ( w ) = X ( Xw − y ) + λw = 0 when using gradient descent, this term ⊤ ⊤ λ I ) w = X y ( X X + reduces the weights at each step (weight decay) ⊤ −1 ⊤ λ I ) w = ( X X + X y the only part different due to regularization makes it invertible! λI we can have linearly dependent features (e.g., D > N) the solution will be unique! 6 . 3

Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k Without regularization: using D=10 we can perfectly fit the data (high test error) degree 2 (D=3) degree 4 (D=5) degree 9 (D=10) 6 . 4

Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k with regularization: fixed D=10, changing the amount of regularization λ = .1 λ = 10 λ = 0 6 . 5

Data normalization Data normalization ~ ( n ) what if we scale the input features, using different factors ( n ) = ∀ d , n x γ x d ~ 1 = ∀ d if we have no regularization: w w d d γ d ~ w ~ 2 2 ∣∣ Xw − y ∣∣ = ∣∣ − y ∣∣ everything remains the same because: X 2 2 ~ 2 with regularization: ∣∣ ∣∣ 2  ∣∣ w ∣∣ = so the optimal w will be different! w 2 features of different mean and variance will be penalized differently { ( n ) 1 = μ x normalization d N d ( n ) 1 2 d 2 = ( x − ) σ μ N −1 d d ( n ) ( n ) − μ x ← makes sure all features have the same mean and variance x d d d σ d 6 . 6 Winter 2020 | Applied Machine Learning (COMP551)

Maximum likelihood Maximum likelihood previously: linear regression & logistic regression maximize log-likelihood linear regression logistic regression ∗ ∗ w = arg max p ( y ∣ w ) w = arg max p ( y ∣ x , w ) N = arg max Bernoulli ( y ; σ (Φ w )) w ∏ n =1 N 2 = arg max N ( y ; Φ w , σ ) w ∏ n =1 ( n ) ⊤ ≡ arg min ( y , σ ( w ϕ ( x ))) n ∑ n L ( n ) ⊤ ( n ) ≡ arg min ( y , w ϕ ( x )) ∑ n L CE 2 idea: maximize the posterior instead of likelihood p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) 7 . 1

Maximum a Posteriori ( Maximum a Posteriori (MAP MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w ≡ arg max log p ( y ∣ w ) + log p ( w ) w prior likelihood: original objective even better would be to estimate the posterior distribution p ( w ∣ y ) more on this later in the course! 7 . 2

Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D σ 2 −1 ⊤ 2 1 2 1 D ≡ arg max ( y − w x ) − ⊤ 2 2 ∑ d =1 ≡ arg min ( y − w x ) + ∑ d =1 w w w 2 σ 2 w 2 2 τ 2 d 2 τ 2 d σ 2 λ = multiple data-points τ 2 1 ∑ n D ( n ) ⊤ ( n ) 2 2 ≡ arg min ( y − ) + λ ∑ d =1 w x w w 2 L2 regularization 2 d L2- regularization is assuming a Gaussian prior on weights the same is true for logistic regression (or any other cost function) 7 . 3

Laplace prior Laplace prior another notable choice of prior is the Laplace distribution 1 = ∣∣ w ∣∣ 1 − log p ( w ) = ∣ w ∣ minimizing negative log-likelihood ∑ d ∑ d 2 β 1 2 β d d L1 norm of w L1 regularization: J ( w ) ← J ( w ) + λ ∣∣ w ∣∣ also called lasso 1 (least absolute shrinkage and selection operator) ∣ w ∣ 1 − p ( w ; β ) = e notice the peak around zero β 2 β w image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4

regularization regularization vs L L 1 2 regularization path shows how change as we change λ { w } d Lasso produces sparse weights (many are zero, rather than small) red-line is the optimal from cross-validation λ Lasso Ridge regression w d ′ w d decreasing regularization coef. λ 7 . 5

regularization regularization vs L L 1 2 min J ( w ) + λ ∣∣ w ∣∣ p ~ w p ~ subject to ∣∣ w ∣∣ for an appropriate choice of λ is equivalent to min J ( w ) p ≤ λ w p figures below show the constraint and the isocontours of J ( w ) optimal solution with L1-regularization is more likely to have zero components J ( w ) any convex cost function J ( w ) w w 2 2 w w MLE MLE w MAP w MAP ~ w w 1 1 ∣∣ w ∣∣ ≤ ~ λ 1 2 ∣∣ w ∣∣ ≤ λ 2 7 . 6

Applied Machine Learning Applied Machine Learning Regularization - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives Basic idea of overfitting and underfitting

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Applied Machine Learning Applied Machine Learning Logistic Regression Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak

A major risk in classification: overfitting Assume we have a small data set We fit a model that

RECSM Summer School: Machine Learning for Social Sciences Session 1.3: Supervised Learning and

MLCC 2019 Local Methods and Bias Variance Trade-Off Lorenzo Rosasco UNIGE-MIT-IIT About this

Introduction to Machine Learning Model Validation and Selection Dr. Ilija Bogunovic Learning

Deep Nets with http://vem.quantumunlimited.org/the-gates-of-horn/ Keras Professor

Lecture 3: Method evaluation and tuning parameter selection Felix Held, Mathematical Sciences

#8: Flux and Gauss Law Flux (in physics) refers to the product of a field crossing an area

A bit of STC history 1969/1970 1969/1970 1999/2000 M P K 1999/2000, M-P Kwan 2002/2003 MJ K