regression and generalization
play

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear


  1. Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019

  2. Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2

  3. Recall: Linear regression (squared loss) } Linear regression functions ๐‘” โˆถ โ„ โ†’ โ„ ๐‘”(๐‘ฆ; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ ๐‘” โˆถ โ„ 0 โ†’ โ„ ๐‘”(๐’š; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ / + . . . ๐‘ฅ 3 ๐‘ฆ 3 ๐’™ = ๐‘ฅ - ,๐‘ฅ / ,...,๐‘ฅ 3 : are the parameters we need to set. } Minimizing the squared loss for linear regression 8 ๐พ(๐’™) = ๐’› โˆ’ ๐’€๐’™ 8 9 = ๐’€ : ๐’€ ;๐Ÿ ๐’€ : ๐’› } We obtain ๐’™ 3

  4. Beyond linear regression } How to extend the linear regression to non-linear functions? } Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained by basis functions) 4

  5. Beyond linear regression } ๐‘› ?@ order polynomial regression (univariate ๐‘” โˆถ โ„ โŸถ โ„ ) ๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ + . . . +๐‘ฅ B;/ ๐‘ฆ B;/ +๐‘ฅ B ๐‘ฆ B ;๐Ÿ ๐’€โ€ฒ : ๐’› C = ๐’€โ€ฒ : ๐’€โ€ฒ } Solution: ๐’™ ๐‘ฆ / I ๐‘ฆ / J ๐‘ฆ / L 1 ๐’™ 9 - โ‹ฏ ๐‘ง / ๐’™ 9 / ๐‘ฆ 8 I ๐‘ฆ 8 J ๐‘ฆ 8 L 1 โ‹ฎ โ‹ฏ ๐’› = ๐’€โ€ฒ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง G ๐’™ 9 B ๐‘ฆ G I ๐‘ฆ G J ๐‘ฆ G I 1 โ‹ฏ 5

  6. Polynomial regression: example ๐‘› = 3 ๐‘› = 1 ๐‘› = 5 ๐‘› = 7 6

  7. Generalized linear } Linear combination of fixed non-linear function of the input vector ๐‘”(๐’š; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐œš / (๐’š)+ . . . ๐‘ฅ B ๐œš B (๐’š) {๐œš / (๐’š), . . . , ๐œš B (๐’š)} : set of basis functions (or features) ๐œš S ๐’š : โ„ 3 โ†’ โ„ 7

  8. Basis functions: examples } Linear } Polynomial (univariate) 8

  9. Basis functions: examples J ๐’š;๐’… Y } Gaussian: ๐œš U ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ J 8Z Y ๐’š;๐’… Y / } Sigmoid: ๐œš U ๐’š = ๐œ ๐œ ๐‘ = /]^_` (;a) Z Y 9

  10. Radial Basis Functions: prototypes } Predictions based on similarity to โ€œprototypesโ€: ๐œš U ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ 1 8 8 ๐’š โˆ’ ๐’… U 2๐œ U } Measuring the similarity to the prototypes ๐’… / , โ€ฆ , ๐’… B } ฯƒ 8 controls how quickly it vanishes as a function of the distance to the prototype. } Training examples themselves could serve as prototypes 10

  11. Generalized linear: optimization G 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐’™ ๐พ ๐’™ = e Sf/ G 8 ๐‘ง S โˆ’ ๐’™ : ๐” ๐’š S = e Sf/ ๐œš B (๐’š (/) ) ๐œš / (๐’š (/) ) 1 โ‹ฏ ๐‘ฅ - ๐‘ง (/) ๐‘ฅ / ๐œš / (๐’š (8) ) โ‹ฏ ๐œš B (๐’š (8) ) 1 ๐’› = ๐šพ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง (G) ๐‘ฅ B ๐œš B (๐’š (G) ) ๐œš / (๐’š (G) ) 1 โ‹ฏ ;๐Ÿ ๐šพ : ๐’› j = ๐šพ : ๐šพ ๐’™ 11

  12. Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 G 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ Training ๐‘œ e โ‰ˆ 0 Sf/ (empirical) loss 8 โ‰ซ 0 Expected E ๐ฒ,q ๐‘ง โˆ’ ๐‘” ๐’š; ๐œพ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 12

  13. Polynomial regression ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 9 ๐‘› = 3 ๐‘ง ๐‘ง 13 [Bishop]

  14. ๏ฟฝ Polynomial regression: training and test error 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ G โˆ‘ Sf/ ๐‘†๐‘๐‘‡๐น = ๐‘œ ๐‘› [Bishop] 14

  15. Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 15

  16. Model complexity } Example: } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง 16 16 [Bishop]

  17. Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐‘› = 9 ๐‘› = 9 ๐‘œ = 15 ๐‘œ = 100 [Bishop] 17

  18. How to evaluate the learnerโ€™s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error are: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 18

  19. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 19

  20. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 20

  21. Evaluation and model selection } Evaluation : } We need to measure how well the learned function can predict the target for unseen examples } Model selection: } Most of the time we need to select among a set of models } Example: polynomials with different degree ๐‘› } and thus we need to evaluate these models first 21

  22. Model Selection } Learning algorithm defines the data-driven search over the hypothesis space } search for good parameters } Hyper-parameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 22 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  23. Model Selection } Model selection is the process by which we choose the โ€œbestโ€ model among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โ€œoutsideโ€ the main training algorithm } Model selection / hyper-parameter optimization is just another form of learning This slide has been adopted from CMU ML course: 23 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  24. ๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐‘ค_๐‘ก๐‘“๐‘ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set 8 ๐‘ง (S) โˆ’ ๐‘” ๐’š (S) ; ๐’™ / ~_โ€ขโ‚ฌ? โˆ‘ } ๐พ ~ ๐’™ = Sโˆˆ~_โ€ขโ‚ฌ? } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set obtains a relatively noisy estimate of performance. 24

  25. Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ~ ๐’™ 9 is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 25 Test

  26. Cross-Validation (CV): Evaluation } ๐‘™ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐‘™ groups of approximately equal size } for ๐‘— = 1 to ๐‘™ } Choose the ๐‘— -th group as the held-out validation group } Train the model on all but the ๐‘— -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐‘™ runs are averaged . } The average error rate can be considered as an estimation of the true performance of the model. โ€ฆ First run โ€ฆ Second run โ€ฆ โ€ฆ (k-1)th run k-th run โ€ฆ 26

  27. Cross-Validation (CV): Model Selection } For each model, we first find the average error by CV. } The model with the best average performance is selected. 27

  28. Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐‘› = 3 ๐‘› = 1 CV: ๐‘๐‘‡๐น = 1.45 CV: ๐‘๐‘‡๐น = 0.30 ๐‘› = 5 ๐‘› = 7 CV: ๐‘๐‘‡๐น = 45.44 CV: ๐‘๐‘‡๐น = 31759 28

  29. Leave-One-Out Cross Validation (LOOCV) } When data is particularly scarce, cross-validation with ๐‘™ = ๐‘‚ } Leave-one-out treats each training sample in turn as a test example and all other samples as the training set. } Use for small datasets } When training data is valuable } LOOCV can be time expensive as ๐‘‚ training steps are required. 29

  30. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 30

  31. Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): G 8 ๐‘ง S โˆ’ ๐’™ : ๐” ๐’š S + ๐œ‡๐’™ : ๐’™ ๐พ ๐’™ = e Sf/ ;๐Ÿ ๐šพ : ๐’› C = ๐šพ : ๐šพ + ๐œ‡๐‘ฑ ๐’™ ๐œš B (๐’š (/) ) ๐œš / (๐’š (/) ) 1 โ‹ฏ ๐‘ฅ - ๐‘ง (/) ๐‘ฅ / ๐œš / (๐’š (8) ) โ‹ฏ ๐œš B (๐’š (8) ) 1 ๐’› = ๐šพ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง (G) ๐‘ฅ B 31 ๐œš B (๐’š (G) ) ๐œš / (๐’š (G) ) 1 โ‹ฏ

  32. Polynomial order } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐‘› . [Bishop] 32

Recommend


More recommend