Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018
Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2
Recall: Linear regression (squared loss) } Linear regression functions ๐ โถ โ โ โ ๐(๐ฆ; ๐) = ๐ฅ - + ๐ฅ / ๐ฆ ๐ โถ โ 0 โ โ ๐(๐; ๐) = ๐ฅ - + ๐ฅ / ๐ฆ / + . . . ๐ฅ 3 ๐ฆ 3 ๐ = ๐ฅ - ,๐ฅ / ,...,๐ฅ 3 : are the parameters we need to set. } Minimizing the squared loss for linear regression 8 ๐พ(๐) = ๐ โ ๐๐ 8 9 = ๐ : ๐ ;๐ ๐ : ๐ } We obtain ๐ 3
Beyond linear regression } How to extend the linear regression to non-linear functions? } Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained by basis functions) 4
Beyond linear regression } ๐ ?@ order polynomial regression (univariate ๐ โถ โ โถ โ ) ๐ ๐ฆ; ๐ = ๐ฅ - + ๐ฅ / ๐ฆ + . . . +๐ฅ B;/ ๐ฆ B;/ +๐ฅ B ๐ฆ B ;๐ ๐โฒ : ๐ C = ๐โฒ : ๐โฒ } Solution: ๐ ๐ฆ / I ๐ฆ / J ๐ฆ / L 1 ๐ 9 - โฏ ๐ง / ๐ 9 / ๐ฆ 8 I ๐ฆ 8 J ๐ฆ 8 L 1 โฎ โฏ ๐ = ๐โฒ = ๐ = โฎ โฎ โฎ โฎ โฎ โฎ ๐ง G ๐ 9 B ๐ฆ G I ๐ฆ G J ๐ฆ G I 1 โฏ 5
Polynomial regression: example ๐ = 3 ๐ = 1 ๐ = 5 ๐ = 7 6
Generalized linear } Linear combination of fixed non-linear function of the input vector ๐(๐; ๐) = ๐ฅ - + ๐ฅ / ๐ / (๐)+ . . . ๐ฅ B ๐ B (๐) {๐ / (๐), . . . , ๐ B (๐)} : set of basis functions (or features) ๐ S ๐ : โ 3 โ โ 7
Basis functions: examples } Linear } Polynomial (univariate) 8
Basis functions: examples J ๐;๐ Y } Gaussian: ๐ U ๐ = ๐๐ฆ๐ โ J 8Z Y ๐;๐ Y / } Sigmoid: ๐ U ๐ = ๐ ๐ ๐ = /]^_` (;a) Z Y 9
Radial Basis Functions: prototypes } Predictions based on similarity to โprototypesโ: ๐ U ๐ = ๐๐ฆ๐ โ 1 8 8 ๐ โ ๐ U 2๐ U } Measuring the similarity to the prototypes ๐ / , โฆ , ๐ B } ฯ 8 controls how quickly it vanishes as a function of the distance to the prototype. } Training examples themselves could serve as prototypes 10
Generalized linear: optimization G 8 ๐ง S โ ๐ ๐ S ; ๐ ๐พ ๐ = e Sf/ G 8 ๐ง S โ ๐ : ๐ ๐ S = e Sf/ ๐ B (๐ (/) ) ๐ / (๐ (/) ) 1 โฏ ๐ฅ - ๐ง (/) ๐ฅ / ๐ / (๐ (8) ) โฏ ๐ B (๐ (8) ) 1 ๐ = ๐พ = ๐ = โฎ โฎ โฑ โฎ โฎ โฎ ๐ง (G) ๐ฅ B ๐ B (๐ (G) ) ๐ / (๐ (G) ) 1 โฏ ;๐ ๐พ : ๐ j = ๐พ : ๐พ ๐ 11
Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 G 8 ๐ง S โ ๐ ๐ S ; ๐พ Training ๐ e โ 0 Sf/ (empirical) loss 8 โซ 0 Expected E ๐ฒ,q ๐ง โ ๐ ๐; ๐พ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 12
Polynomial regression ๐ = 0 ๐ = 1 ๐ง ๐ง ๐ = 9 ๐ = 3 ๐ง ๐ง 13 [Bishop]
๏ฟฝ Polynomial regression: training and test error 8 ๐ง S โ ๐ ๐ S ; ๐พ G โ Sf/ ๐๐๐๐น = ๐ ๐ [Bishop] 14
Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 15
Model complexity } Example: } Polynomials with larger ๐ are becoming increasingly tuned to the random noise on the target values. ๐ = 0 ๐ = 1 ๐ง ๐ง ๐ = 3 ๐ = 9 ๐ง ๐ง 16 16 [Bishop]
Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐ = 9 ๐ = 9 ๐ = 15 ๐ = 100 [Bishop] 17
How to evaluate the learnerโs performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 18
Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโs Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 19
Evaluation and model selection } Evaluation : } We need to measure how well the learned function can predicts the target for unseen examples } Model selection: } Most of the time we need to select among a set of models } Example: polynomials with different degree ๐ } and thus we need to evaluate these models first 20
Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 21 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
Model Selection } Model selection is the process by which we choose the โbestโ model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โoutsideโ the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 22 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐ค_๐ก๐๐ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set 8 ๐ง (S) โ ๐ ๐ (S) ; ๐ / ~_โขโฌ? โ } ๐พ ~ ๐ = Sโ~_โขโฌ? } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 23
Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ~ ๐ 9 is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 24 Test
Cross-Validation (CV): Evaluation } ๐ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐ groups of approximately equal size } for ๐ = 1 to ๐ } Choose the ๐ -th group as the held-out validation group } Train the model on all but the ๐ -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐ runs are averaged . } The average error rate can be considered as an estimation of the true performance. โฆ First run โฆ Second run โฆ โฆ (k-1)th run k-th run โฆ 25
Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 26
Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐ = 3 ๐ = 1 CV: ๐๐๐น = 1.45 CV: ๐๐๐น = 0.30 ๐ = 5 ๐ = 7 CV: ๐๐๐น = 45.44 CV: ๐๐๐น = 31759 27
Leave-One-Out Cross Validation (LOOCV) } When data is particularly scarce, cross-validation with ๐ = ๐ } Leave-one-out treats each training sample in turn as a test example and all other samples as the training set. } Use for small datasets } When training data is valuable } LOOCV can be time expensive as ๐ training steps are required. 28
Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): G 8 ๐ง S โ ๐ : ๐ ๐ S + ๐๐ : ๐ ๐พ ๐ = e Sf/ ;๐ ๐พ : ๐ C = ๐พ : ๐พ + ๐๐ฑ ๐ 29
Polynomial order } Polynomials with larger ๐ are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐ . [Bishop] 30
Regularization parameter ๐ = 9 ๐ฅ 9 - ๐ฅ 9 / ๐ฅ 9 8 ๐ฅ 9 ห ๐ฅ 9 โฐ ๐ฅ 9 ล ๐ฅ 9 โน ๐ฅ 9 ล [Bishop] ๐ฅ 9 โข ๐ฅ 9 ลฝ ๐๐๐ = โโ ๐๐๐ = โ18 31
Regularization parameter } Generalization } ๐ now controls the effective complexity of the model and hence determines the degree of over-fitting 32 [Bishop]
๏ฟฝ Choosing the regularization parameter } A set of models with different values of ๐. } Find ๐ 9 for each model based on training data } Find ๐พ ~ (๐ 9) (or ๐พ โ~ (๐ 9) ) for each model 8 ๐ง (S) โ ๐ ๐ฆ (S) ; ๐ / G_~ โ } ๐พ ~ ๐ = Sโ~_โขโฌ? } Select the model with the best ๐พ ~ (๐ 9) (or ๐พ โ~ (๐ 9)) 33
The approximation-generailization trade-off } Small true error shows good approximation of ๐ out of sample } More complex โ โ better chance of approximating ๐ } Less complex โ โ better chance of generalization out of ๐ 34
Recommend
More recommend