cs 337 arti fi cial intelligence machine learning
play

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: - PowerPoint PPT Presentation

CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019 Recap: Regularization for Generalizability Recall: Complex models could lead to


  1. CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture 8: Regularization, Overfitting, Bias and Variance August 2019

  2. Recap: Regularization for Generalizability Recall: Complex models could lead to over fi tting. How to counter? Regularization: The main idea is to modify the error function so that model complexity is also explicitly penalized Loss reg ( w ) = Loss D ( w ) + λ · Reg ( w ) A squared penalty on the weights, i.e. Reg ( w ) = || w || 2 is a popular penalty function and is known as L 2 regularization.

  3. Recap: MAP objective and regularization Bayesian view of regularization: Regularization can be achieved using di ff erent types of priors on the parameters 1 ( y j − w T x j ) 2 + λ � 2 || w || 2 w MAP = arg min 2 2 σ 2 w j We get an L 2 regularized solution for the linear regression problem using a Gaussian prior on the weights. What happens when || w || 2 2 is replaced with || w || 1 ? Contrast their level curves!

  4. Number of zero w's for di ff erent lambdas lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_0.0001 0 lambda_0.001 0 lambda_0.01 0 lambda_1 0 lambda_5 0 lambda_10 0 lambda_20 0

  5. Contrasting Level Curves

  6. Recap: Lasso Regularized Least Squares Regression The general Penalized Regularized L.S Problem : || Φ w − y || 2 w Reg = arg min 2 + λ Ω ( w ) w Ω ( w ) = || w || 1 ⇒ Lasso Lasso Regression || Φ w − y || 2 w lasso = arg min 2 + λ || w || 1 w Lasso is the MAP estimate of Linear Regression subject to Laplace Prior on w ∼ Laplace (0 , θ ) � � Laplace ( w i | µ, b ) = 1 − | w i − µ | 2 b exp b

  7. Gaussian Hare vs. Laplacian Tortoise Gaussian easier to estimate Laplacian yields more sparsity

  8. Lasso: Iterative Soft Thresholding Algorithm (ISTA) The LASSO Regularized L.S Problem : w Lasso = arg min E Lasso ( w ) = arg min E LS ( w ) + λ | w | 1 w w where E LS ( w ) = || Φ w − y || 2 2 while relative drop in E Lasso ( w t ) across t = k and t = k + 1 is signi fi cant: LS Iterate: w k +1 = w k Lasso − η ∇ E LS ( w k Lasso ) LS Proximal 1 Step: � � � �  w k +1 w k +1 i − λη if i > λη  LS LS  � �  w k +1 � � � � i = w k +1 w k +1 i + λη if i < − λη Lasso LS LS   0 otherwise  1 See slide 1 of https://www.cse.iitb.ac.in/~cs709/notes/enotes/ 24-23-10-2018-generalized-proximal-projected-gradientdescent-examples-geometry-convergence-accelerated-annotated.pdf

  9. Note how LASSO yields greater sparsity NUMBER OF w's that are zeros for di ff erent values of lambda lambda_1e-15 0 lambda_1e-10 0 lambda_1e-08 0 lambda_1e-05 8 lambda_0.0001 10 lambda_0.001 12 lambda_0.01 13 lambda_1 15 lambda_5 15 lambda_10 15

  10. CS 337: Arti fi cial Intelligence & Machine Learning Instructor: Prof. Ganesh Ramakrishnan Lecture: Understanding Generalization and Overfitting through bias & variance August 2019

  11. Evaluating model performance We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error

  12. Evaluating model performance We saw in the last class how to estimate linear predictors by minimizing a squared loss objective function. How do we evaluate whether or not our estimated predictor is good? Measure 1: Training error Measure 2: Test error

  13. Error vs. Model Complexity Prediction 
 Error Model Complexity

  14. Sources of error Three main sources of test error: Bias 1 Variance 2 Noise 3

  15. Example: function

  16. Fitting 50 lines after slight perturbation of points

  17. Variance after slight perturbation of points

  18. Bias (with respect to non-linear fi t)

  19. Noise

  20. Over fi tting Over fi tting: When the proposed hypothesis fi ts the training data too well ����������

  21. Under fi tting Under fi tting: When the hypothesis is insu ffi cient to fi t the training data �����������

  22. Bias/Variance Decomposition for Regression

  23. Bias-Variance Analysis in Regression Say the true underlying function is y = g ( x ) + � where � is a r.v. with mean 0 and variance σ 2 . Given a dataset of m samples, D = { x i , y i } , i = 1 . . . m , we fi t a linear hypothesis parameterized by w : f D ( x ) = w T x to minimize the sum of � ( y i − f D ( x i )) 2 squared errors i Given a new test point ˆ x , whose corresponding ˆ y = g ( ˆ x ) + ˆ � , what is the y ) 2 ]? expected test error for ˆ x , Err( ˆ x ) = E D , ˆ � [( f D ( ˆ x ) − ˆ

  24. Decomposing expected test error x ) 2 + ˆ y 2 − 2 f ( ˆ y ) 2 ] = E [ f ( ˆ E [( f ( ˆ x ) − ˆ x )ˆ y ] x ) 2 ] + E [ˆ y 2 ] − 2 E [ f ( ˆ = E [ f ( ˆ x )] E [ˆ y ] x )) 2 ] + f ( ˆ x ) 2 = E [( f ( ˆ x ) − f ( ˆ y 2 ] − 2 E [ f ( ˆ + E [ˆ x )] E [ˆ y ] x )) 2 ] + f ( ˆ x ) 2 = E [( f ( ˆ x ) − f ( ˆ y 2 ] − 2 f ( ˆ + E [ˆ x ) g ( ˆ x ) (1) + ( E [ x ]) 2 = E [ x 2 ] � ( x − E [ x ]) 2 � where we have used the fact that E

  25. Decomposing expected test error y 2 ], we get Applying the same trick used in Equation (1) to E [ˆ y ) 2 ] = E [( f ( ˆ x )) 2 ] + f ( ˆ x ) 2 E [( f ( ˆ x ) − ˆ x ) − f ( ˆ x )) 2 ] + g ( ˆ x ) 2 + E [(ˆ y − g ( ˆ − 2 f ( ˆ x ) g ( ˆ x )

  26. Bias-variance decomposition y ) 2 ] = E [( f ( ˆ x )) 2 ] E [( f ( ˆ x ) − ˆ x ) − f ( ˆ x )) 2 + ( f ( ˆ x ) − g ( ˆ x )) 2 ] + E [(ˆ y − g ( ˆ x )) 2 + σ 2 y ) 2 ] = Variance( g ( ˆ E [( g ( ˆ x ) − ˆ x )) + Bias( g ( ˆ

  27. Each error term Bias: f ( ˆ x ) − g ( ˆ x ) Average error of f ( ˆ x ) x )) 2 ] Variance: E [( f ( ˆ x ) − f ( ˆ Variance of f ( ˆ x ) across di ff erent training datasets x )) 2 ] E ( � 2 ) = σ 2 Noise: E [(ˆ y − g ( ˆ Irreducible noise

  28. Illustrating bias and variance Image from http://scott.fortmann-roe.com/docs/BiasVariance.html

  29. Model Selection TO BE DISCUSSED IN NEXT LAB SESSION Given the bias-variance tradeo ff , how do we choose the best predictor for the problem at hand? How do we set the model’s parameters?

  30. Measuring bias/variance TO BE DISCUSSED IN NEXT LAB SESSION Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset D b , let V b refer to the left-out samples which will be used for validation. Train on D b to estimate f b and test on each sample in V b

  31. Measuring bias/variance TO BE DISCUSSED IN NEXT LAB SESSION Bootstrap sampling: Repeatedly sample observations from a dataset with replacement For each bootstrap dataset D b , let V b refer to the left-out samples which will be used for validation. Train on D b to estimate f b and test on each sample in V b Compute bias and variance

  32. Train-Validation-Test split TO BE DISCUSSED IN NEXT LAB SESSION Divide the available samples into three sets: Train set: Used to train the learning algorithm 1 Validation/Development set: Used for model selection and tuning 2 hyperparameters Test/Evaluation set: Used for fi nal testing 3

  33. Cross-Validation TO BE DISCUSSED IN NEXT LAB SESSION k -fold Cross-Validation Given: Training set D of m examples, set of parameters Θ learner F , number of folds k Split D into k folds, D 1 , . . . , D k For each θ ∈ Θ , do for i = 1 . . . k , do Estimate f i , θ = F θ ( D \ D i ) � k err θ = 1 i =1 Loss( f i , θ ) k Output: θ ∗ = arg min θ err θ f θ ∗ = F ∗ θ ( D )

Recommend


More recommend