Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang
Review: machine learning basics
Math formulation โข Given training data ๐ฆ ๐ , ๐ง ๐ : 1 โค ๐ โค ๐ i.i.d. from distribution ๐ธ 1 โข Find ๐ง = ๐(๐ฆ) โ ๐ that minimizes เท ๐ ๐ ฯ ๐=1 ๐ ๐ = ๐(๐, ๐ฆ ๐ , ๐ง ๐ ) โข s.t. the expected loss is small ๐ ๐ = ๐ฝ ๐ฆ,๐ง ~๐ธ [๐(๐, ๐ฆ, ๐ง)]
Machine learning 1-2-3 โข Collect data and extract features โข Build model: choose hypothesis class ๐ and loss function ๐ โข Optimization: minimize the empirical loss
Feature mapping Machine learning 1-2-3 Maximum Likelihood โข Collect data and extract features โข Build model: choose hypothesis class ๐ and loss function ๐ โข Optimization: minimize the empirical loss Occamโs razor Gradient descent; convex optimization
Overfitting
Linear vs nonlinear models 2 ๐ฆ 1 2 ๐ฆ 2 ๐ฆ 1 2๐ฆ 1 ๐ฆ 2 ๐ง = sign(๐ฅ ๐ ๐(๐ฆ) + ๐) 2๐๐ฆ 1 ๐ฆ 2 2๐๐ฆ 2 ๐ Polynomial kernel
Linear vs nonlinear models โข Linear model: ๐ ๐ฆ = ๐ 0 + ๐ 1 ๐ฆ โข Nonlinear model: ๐ ๐ฆ = ๐ 0 + ๐ 1 ๐ฆ + ๐ 2 ๐ฆ 2 + ๐ 3 ๐ฆ 3 + โฆ + ๐ ๐ ๐ฆ ๐ โข Linear model โ Nonlinear model (since can always set ๐ ๐ = 0 (๐ > 1) ) โข Looks like nonlinear model can always achieve same/smaller error โข Why one use Occamโs razor (choose a smaller hypothesis class)?
Example: regression using polynomial curve ๐ข = sin 2๐๐ฆ + ๐ Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve ๐ข = sin 2๐๐ฆ + ๐ Regression using polynomial of degree M Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve ๐ข = sin 2๐๐ฆ + ๐ Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve ๐ข = sin 2๐๐ฆ + ๐ Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve ๐ข = sin 2๐๐ฆ + ๐ Figure from Machine Learning and Pattern Recognition , Bishop
Example: regression using polynomial curve Figure from Machine Learning and Pattern Recognition , Bishop
Prevent overfitting โข Empirical loss and expected loss are different โข Also called training error and test/generalization error โข Larger the data set, smaller the difference between the two โข Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two โข Thus has small training error but large test error (overfitting) โข Larger data set helps! โข Throwing away useless hypotheses also helps!
Prevent overfitting โข Empirical loss and expected loss are different โข Also called training error and test error โข Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two Use prior knowledge/model to โข Thus has small training error but large test error (overfitting) prune hypotheses โข Larger the data set, smaller the difference between the two โข Throwing away useless hypotheses also helps! Use experience/data to โข Larger data set helps! prune hypotheses
Prior v.s. data
Prior vs experience โข Super strong prior knowledge: ๐ = {๐ โ } โข No data is needed! ๐ โ : the best function
Prior vs experience โข Super strong prior knowledge: ๐ = {๐ โ , ๐ 1 } โข A few data points suffices to detect ๐ โ ๐ โ : the best function
Prior vs experience โข Super larger data set: infinite data โข Hypothesis class ๐ can be all functions! ๐ โ : the best function
Prior vs experience โข Practical scenarios: finite data, ๐ of median capacity, ๐ โ in/not in ๐ ๐ 1 ๐ 2 ๐ โ : the best function
Prior vs experience โข Practical scenarios lie between the two extreme cases ๐ = {๐ โ } practice Infinite data
General Phenomenon Figure from Deep Learning , Goodfellow, Bengio and Courville
Cross validation
Model selection โข How to choose the optimal capacity? โข e.g., choose the best degree for polynomial curve fitting โข Cannot be done by training data alone โข Create held-out data to approx. the test error โข Called validation data set
Model selection: cross validation โข Partition the training data into several groups โข Each time use one group as validation set Figure from Machine Learning and Pattern Recognition , Bishop
Model selection: cross validation โข Also used for selecting other hyper-parameters for model/algorithm โข E.g., learning rate, stopping criterion of SGD, etc. โข Pros: general, simple โข Cons: computationally expensive; even worse when there are more hyper-parameters
Recommend
More recommend