regression and generalization
play

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear


  1. Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

  2. Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2

  3. Recall: Linear regression (squared loss) } Linear regression functions ๐‘” โˆถ โ„ โ†’ โ„ ๐‘”(๐‘ฆ; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ ๐‘” โˆถ โ„ 0 โ†’ โ„ ๐‘”(๐’š; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ / + . . . ๐‘ฅ 3 ๐‘ฆ 3 ๐’™ = ๐‘ฅ - ,๐‘ฅ / ,...,๐‘ฅ 3 : are the parameters we need to set. } Minimizing the squared loss for linear regression 8 ๐พ(๐’™) = ๐’› โˆ’ ๐’€๐’™ 8 9 = ๐’€ : ๐’€ ;๐Ÿ ๐’€ : ๐’› } We obtain ๐’™ 3

  4. Beyond linear regression } How to extend the linear regression to non-linear functions? } Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained by basis functions) 4

  5. Beyond linear regression } ๐‘› ?@ order polynomial regression (univariate ๐‘” โˆถ โ„ โŸถ โ„ ) ๐‘” ๐‘ฆ; ๐’™ = ๐‘ฅ - + ๐‘ฅ / ๐‘ฆ + . . . +๐‘ฅ B;/ ๐‘ฆ B;/ +๐‘ฅ B ๐‘ฆ B ;๐Ÿ ๐’€โ€ฒ : ๐’› C = ๐’€โ€ฒ : ๐’€โ€ฒ } Solution: ๐’™ ๐‘ฆ / I ๐‘ฆ / J ๐‘ฆ / L 1 ๐’™ 9 - โ‹ฏ ๐‘ง / ๐’™ 9 / ๐‘ฆ 8 I ๐‘ฆ 8 J ๐‘ฆ 8 L 1 โ‹ฎ โ‹ฏ ๐’› = ๐’€โ€ฒ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง G ๐’™ 9 B ๐‘ฆ G I ๐‘ฆ G J ๐‘ฆ G I 1 โ‹ฏ 5

  6. Polynomial regression: example ๐‘› = 3 ๐‘› = 1 ๐‘› = 5 ๐‘› = 7 6

  7. Generalized linear } Linear combination of fixed non-linear function of the input vector ๐‘”(๐’š; ๐’™) = ๐‘ฅ - + ๐‘ฅ / ๐œš / (๐’š)+ . . . ๐‘ฅ B ๐œš B (๐’š) {๐œš / (๐’š), . . . , ๐œš B (๐’š)} : set of basis functions (or features) ๐œš S ๐’š : โ„ 3 โ†’ โ„ 7

  8. Basis functions: examples } Linear } Polynomial (univariate) 8

  9. Basis functions: examples J ๐’š;๐’… Y } Gaussian: ๐œš U ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ J 8Z Y ๐’š;๐’… Y / } Sigmoid: ๐œš U ๐’š = ๐œ ๐œ ๐‘ = /]^_` (;a) Z Y 9

  10. Radial Basis Functions: prototypes } Predictions based on similarity to โ€œprototypesโ€: ๐œš U ๐’š = ๐‘“๐‘ฆ๐‘ž โˆ’ 1 8 8 ๐’š โˆ’ ๐’… U 2๐œ U } Measuring the similarity to the prototypes ๐’… / , โ€ฆ , ๐’… B } ฯƒ 8 controls how quickly it vanishes as a function of the distance to the prototype. } Training examples themselves could serve as prototypes 10

  11. Generalized linear: optimization G 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐’™ ๐พ ๐’™ = e Sf/ G 8 ๐‘ง S โˆ’ ๐’™ : ๐” ๐’š S = e Sf/ ๐œš B (๐’š (/) ) ๐œš / (๐’š (/) ) 1 โ‹ฏ ๐‘ฅ - ๐‘ง (/) ๐‘ฅ / ๐œš / (๐’š (8) ) โ‹ฏ ๐œš B (๐’š (8) ) 1 ๐’› = ๐šพ = ๐’™ = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง (G) ๐‘ฅ B ๐œš B (๐’š (G) ) ๐œš / (๐’š (G) ) 1 โ‹ฏ ;๐Ÿ ๐šพ : ๐’› j = ๐šพ : ๐šพ ๐’™ 11

  12. Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 G 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ Training ๐‘œ e โ‰ˆ 0 Sf/ (empirical) loss 8 โ‰ซ 0 Expected E ๐ฒ,q ๐‘ง โˆ’ ๐‘” ๐’š; ๐œพ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 12

  13. Polynomial regression ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 9 ๐‘› = 3 ๐‘ง ๐‘ง 13 [Bishop]

  14. ๏ฟฝ Polynomial regression: training and test error 8 ๐‘ง S โˆ’ ๐‘” ๐’š S ; ๐œพ G โˆ‘ Sf/ ๐‘†๐‘๐‘‡๐น = ๐‘œ ๐‘› [Bishop] 14

  15. Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 15

  16. Model complexity } Example: } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง 16 16 [Bishop]

  17. Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐‘› = 9 ๐‘› = 9 ๐‘œ = 15 ๐‘œ = 100 [Bishop] 17

  18. How to evaluate the learnerโ€™s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 18

  19. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 19

  20. Evaluation and model selection } Evaluation : } We need to measure how well the learned function can predicts the target for unseen examples } Model selection: } Most of the time we need to select among a set of models } Example: polynomials with different degree ๐‘› } and thus we need to evaluate these models first 20

  21. Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 21 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  22. Model Selection } Model selection is the process by which we choose the โ€œbestโ€ model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โ€œoutsideโ€ the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 22 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  23. ๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐‘ค_๐‘ก๐‘“๐‘ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set 8 ๐‘ง (S) โˆ’ ๐‘” ๐’š (S) ; ๐’™ / ~_โ€ขโ‚ฌ? โˆ‘ } ๐พ ~ ๐’™ = Sโˆˆ~_โ€ขโ‚ฌ? } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 23

  24. Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ~ ๐’™ 9 is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 24 Test

  25. Cross-Validation (CV): Evaluation } ๐‘™ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐‘™ groups of approximately equal size } for ๐‘— = 1 to ๐‘™ } Choose the ๐‘— -th group as the held-out validation group } Train the model on all but the ๐‘— -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐‘™ runs are averaged . } The average error rate can be considered as an estimation of the true performance. โ€ฆ First run โ€ฆ Second run โ€ฆ โ€ฆ (k-1)th run k-th run โ€ฆ 25

  26. Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 26

  27. Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐‘› = 3 ๐‘› = 1 CV: ๐‘๐‘‡๐น = 1.45 CV: ๐‘๐‘‡๐น = 0.30 ๐‘› = 5 ๐‘› = 7 CV: ๐‘๐‘‡๐น = 45.44 CV: ๐‘๐‘‡๐น = 31759 27

  28. Leave-One-Out Cross Validation (LOOCV) } When data is particularly scarce, cross-validation with ๐‘™ = ๐‘‚ } Leave-one-out treats each training sample in turn as a test example and all other samples as the training set. } Use for small datasets } When training data is valuable } LOOCV can be time expensive as ๐‘‚ training steps are required. 28

  29. Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): G 8 ๐‘ง S โˆ’ ๐’™ : ๐” ๐’š S + ๐œ‡๐’™ : ๐’™ ๐พ ๐’™ = e Sf/ ;๐Ÿ ๐šพ : ๐’› C = ๐šพ : ๐šพ + ๐œ‡๐‘ฑ ๐’™ 29

  30. Polynomial order } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐‘› . [Bishop] 30

  31. Regularization parameter ๐‘› = 9 ๐‘ฅ 9 - ๐‘ฅ 9 / ๐‘ฅ 9 8 ๐‘ฅ 9 ห† ๐‘ฅ 9 โ€ฐ ๐‘ฅ 9 ล  ๐‘ฅ 9 โ€น ๐‘ฅ 9 ล’ [Bishop] ๐‘ฅ 9 โ€ข ๐‘ฅ 9 ลฝ ๐‘š๐‘œ๐œ‡ = โˆ’โˆž ๐‘š๐‘œ๐œ‡ = โˆ’18 31

  32. Regularization parameter } Generalization } ๐œ‡ now controls the effective complexity of the model and hence determines the degree of over-fitting 32 [Bishop]

  33. ๏ฟฝ Choosing the regularization parameter } A set of models with different values of ๐œ‡. } Find ๐’™ 9 for each model based on training data } Find ๐พ ~ (๐’™ 9) (or ๐พ โ€™~ (๐’™ 9) ) for each model 8 ๐‘ง (S) โˆ’ ๐‘” ๐‘ฆ (S) ; ๐’™ / G_~ โˆ‘ } ๐พ ~ ๐’™ = Sโˆˆ~_โ€ขโ‚ฌ? } Select the model with the best ๐พ ~ (๐’™ 9) (or ๐พ โ€™~ (๐’™ 9)) 33

  34. The approximation-generailization trade-off } Small true error shows good approximation of ๐‘” out of sample } More complex โ„‹ โ‡’ better chance of approximating ๐‘” } Less complex โ„‹ โ‡’ better chance of generalization out of ๐‘” 34

Recommend


More recommend