lecture 6 overfitting
play

Lecture 6: Overfitting Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution


  1. Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang

  2. Review: machine learning basics

  3. Math formulation โ€ข Given training data ๐‘ฆ ๐‘— , ๐‘ง ๐‘— : 1 โ‰ค ๐‘— โ‰ค ๐‘œ i.i.d. from distribution ๐ธ 1 โ€ข Find ๐‘ง = ๐‘”(๐‘ฆ) โˆˆ ๐“˜ that minimizes เท  ๐‘œ ๐‘œ ฯƒ ๐‘—=1 ๐‘€ ๐‘” = ๐‘š(๐‘”, ๐‘ฆ ๐‘— , ๐‘ง ๐‘— ) โ€ข s.t. the expected loss is small ๐‘€ ๐‘” = ๐”ฝ ๐‘ฆ,๐‘ง ~๐ธ [๐‘š(๐‘”, ๐‘ฆ, ๐‘ง)]

  4. Machine learning 1-2-3 โ€ข Collect data and extract features โ€ข Build model: choose hypothesis class ๐“˜ and loss function ๐‘š โ€ข Optimization: minimize the empirical loss

  5. Feature mapping Machine learning 1-2-3 Maximum Likelihood โ€ข Collect data and extract features โ€ข Build model: choose hypothesis class ๐“˜ and loss function ๐‘š โ€ข Optimization: minimize the empirical loss Occamโ€™s razor Gradient descent; convex optimization

  6. Overfitting

  7. Linear vs nonlinear models 2 ๐‘ฆ 1 2 ๐‘ฆ 2 ๐‘ฆ 1 2๐‘ฆ 1 ๐‘ฆ 2 ๐‘ง = sign(๐‘ฅ ๐‘ˆ ๐œš(๐‘ฆ) + ๐‘) 2๐‘‘๐‘ฆ 1 ๐‘ฆ 2 2๐‘‘๐‘ฆ 2 ๐‘‘ Polynomial kernel

  8. Linear vs nonlinear models โ€ข Linear model: ๐‘” ๐‘ฆ = ๐‘ 0 + ๐‘ 1 ๐‘ฆ โ€ข Nonlinear model: ๐‘” ๐‘ฆ = ๐‘ 0 + ๐‘ 1 ๐‘ฆ + ๐‘ 2 ๐‘ฆ 2 + ๐‘ 3 ๐‘ฆ 3 + โ€ฆ + ๐‘ ๐‘ ๐‘ฆ ๐‘ โ€ข Linear model โŠ† Nonlinear model (since can always set ๐‘ ๐‘— = 0 (๐‘— > 1) ) โ€ข Looks like nonlinear model can always achieve same/smaller error โ€ข Why one use Occamโ€™s razor (choose a smaller hypothesis class)?

  9. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  10. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Regression using polynomial of degree M Figure from Machine Learning and Pattern Recognition , Bishop

  11. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  12. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  13. Example: regression using polynomial curve ๐‘ข = sin 2๐œŒ๐‘ฆ + ๐œ— Figure from Machine Learning and Pattern Recognition , Bishop

  14. Example: regression using polynomial curve Figure from Machine Learning and Pattern Recognition , Bishop

  15. Prevent overfitting โ€ข Empirical loss and expected loss are different โ€ข Also called training error and test/generalization error โ€ข Larger the data set, smaller the difference between the two โ€ข Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two โ€ข Thus has small training error but large test error (overfitting) โ€ข Larger data set helps! โ€ข Throwing away useless hypotheses also helps!

  16. Prevent overfitting โ€ข Empirical loss and expected loss are different โ€ข Also called training error and test error โ€ข Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two Use prior knowledge/model to โ€ข Thus has small training error but large test error (overfitting) prune hypotheses โ€ข Larger the data set, smaller the difference between the two โ€ข Throwing away useless hypotheses also helps! Use experience/data to โ€ข Larger data set helps! prune hypotheses

  17. Prior v.s. data

  18. Prior vs experience โ€ข Super strong prior knowledge: ๐“˜ = {๐‘” โˆ— } โ€ข No data is needed! ๐‘” โˆ— : the best function

  19. Prior vs experience โ€ข Super strong prior knowledge: ๐“˜ = {๐‘” โˆ— , ๐‘” 1 } โ€ข A few data points suffices to detect ๐‘” โˆ— ๐‘” โˆ— : the best function

  20. Prior vs experience โ€ข Super larger data set: infinite data โ€ข Hypothesis class ๐“˜ can be all functions! ๐‘” โˆ— : the best function

  21. Prior vs experience โ€ข Practical scenarios: finite data, ๐“˜ of median capacity, ๐‘” โˆ— in/not in ๐“˜ ๐“˜ 1 ๐“˜ 2 ๐‘” โˆ— : the best function

  22. Prior vs experience โ€ข Practical scenarios lie between the two extreme cases ๐“˜ = {๐‘” โˆ— } practice Infinite data

  23. General Phenomenon Figure from Deep Learning , Goodfellow, Bengio and Courville

  24. Cross validation

  25. Model selection โ€ข How to choose the optimal capacity? โ€ข e.g., choose the best degree for polynomial curve fitting โ€ข Cannot be done by training data alone โ€ข Create held-out data to approx. the test error โ€ข Called validation data set

  26. Model selection: cross validation โ€ข Partition the training data into several groups โ€ข Each time use one group as validation set Figure from Machine Learning and Pattern Recognition , Bishop

  27. Model selection: cross validation โ€ข Also used for selecting other hyper-parameters for model/algorithm โ€ข E.g., learning rate, stopping criterion of SGD, etc. โ€ข Pros: general, simple โ€ข Cons: computationally expensive; even worse when there are more hyper-parameters

Recommend


More recommend