Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2

Recall: Linear regression (squared loss) } Linear regression functions 𝑔 ∶ ℝ → ℝ 𝑔(𝑦; 𝒙) = 𝑥 - + 𝑥 / 𝑦 𝑔 ∶ ℝ 0 → ℝ 𝑔(𝒚; 𝒙) = 𝑥 - + 𝑥 / 𝑦 / + . . . 𝑥 3 𝑦 3 𝒙 = 𝑥 - ,𝑥 / ,...,𝑥 3 : are the parameters we need to set. } Minimizing the squared loss for linear regression 8 𝐾(𝒙) = 𝒛 − 𝒀𝒙 8 9 = 𝒀 : 𝒀 ;𝟐 𝒀 : 𝒛 } We obtain 𝒙 3

Beyond linear regression } How to extend the linear regression to non-linear functions? } Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained by basis functions) 4

Beyond linear regression } 𝑛 ?@ order polynomial regression (univariate 𝑔 ∶ ℝ ⟶ ℝ ) 𝑔 𝑦; 𝒙 = 𝑥 - + 𝑥 / 𝑦 + . . . +𝑥 B;/ 𝑦 B;/ +𝑥 B 𝑦 B ;𝟐 𝒀′ : 𝒛 C = 𝒀′ : 𝒀′ } Solution: 𝒙 𝑦 / I 𝑦 / J 𝑦 / L 1 𝒙 9 - ⋯ 𝑧 / 𝒙 9 / 𝑦 8 I 𝑦 8 J 𝑦 8 L 1 ⋮ ⋯ 𝒛 = 𝒀′ = 𝒙 = ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 𝑧 G 𝒙 9 B 𝑦 G I 𝑦 G J 𝑦 G I 1 ⋯ 5

Polynomial regression: example 𝑛 = 3 𝑛 = 1 𝑛 = 5 𝑛 = 7 6

Generalized linear } Linear combination of fixed non-linear function of the input vector 𝑔(𝒚; 𝒙) = 𝑥 - + 𝑥 / 𝜚 / (𝒚)+ . . . 𝑥 B 𝜚 B (𝒚) {𝜚 / (𝒚), . . . , 𝜚 B (𝒚)} : set of basis functions (or features) 𝜚 S 𝒚 : ℝ 3 → ℝ 7

Basis functions: examples } Linear } Polynomial (univariate) 8

Basis functions: examples J 𝒚;𝒅 Y } Gaussian: 𝜚 U 𝒚 = 𝑓𝑦𝑞 − J 8Z Y 𝒚;𝒅 Y / } Sigmoid: 𝜚 U 𝒚 = 𝜏 𝜏 𝑏 = /]^_` (;a) Z Y 9

Radial Basis Functions: prototypes } Predictions based on similarity to “prototypes”: 𝜚 U 𝒚 = 𝑓𝑦𝑞 − 1 8 8 𝒚 − 𝒅 U 2𝜏 U } Measuring the similarity to the prototypes 𝒅 / , … , 𝒅 B } σ 8 controls how quickly it vanishes as a function of the distance to the prototype. } Training examples themselves could serve as prototypes 10

Generalized linear: optimization G 8 𝑧 S − 𝑔 𝒚 S ; 𝒙 𝐾 𝒙 = e Sf/ G 8 𝑧 S − 𝒙 : 𝝔 𝒚 S = e Sf/ 𝜚 B (𝒚 (/) ) 𝜚 / (𝒚 (/) ) 1 ⋯ 𝑥 - 𝑧 (/) 𝑥 / 𝜚 / (𝒚 (8) ) ⋯ 𝜚 B (𝒚 (8) ) 1 𝒛 = 𝚾 = 𝒙 = ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ 𝑧 (G) 𝑥 B 𝜚 B (𝒚 (G) ) 𝜚 / (𝒚 (G) ) 1 ⋯ ;𝟐 𝚾 : 𝒛 j = 𝚾 : 𝚾 𝒙 11

Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 G 8 𝑧 S − 𝑔 𝒚 S ; 𝜾 Training 𝑜 e ≈ 0 Sf/ (empirical) loss 8 ≫ 0 Expected E 𝐲,q 𝑧 − 𝑔 𝒚; 𝜾 (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 12

Polynomial regression 𝑛 = 0 𝑛 = 1 𝑧 𝑧 𝑛 = 9 𝑛 = 3 𝑧 𝑧 13 [Bishop]

� Polynomial regression: training and test error 8 𝑧 S − 𝑔 𝒚 S ; 𝜾 G ∑ Sf/ 𝑆𝑁𝑇𝐹 = 𝑜 𝑛 [Bishop] 14

Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 15

Model complexity } Example: } Polynomials with larger 𝑛 are becoming increasingly tuned to the random noise on the target values. 𝑛 = 0 𝑛 = 1 𝑧 𝑧 𝑛 = 3 𝑛 = 9 𝑧 𝑧 16 16 [Bishop]

Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. 𝑛 = 9 𝑛 = 9 𝑜 = 15 𝑜 = 100 [Bishop] 17

How to evaluate the learner’s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 18

Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occam’s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 19

Evaluation and model selection } Evaluation : } We need to measure how well the learned function can predicts the target for unseen examples } Model selection: } Most of the time we need to select among a set of models } Example: polynomials with different degree 𝑛 } and thus we need to evaluate these models first 20

Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 21 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

Model Selection } Model selection is the process by which we choose the “best” model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done “outside” the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 22 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

� Simple hold-out: model selection } Steps: } Divide training data into training and validation set 𝑤_𝑡𝑓𝑢 } Use only the training set to train a set of models } Evaluate each learned model on the validation set 8 𝑧 (S) − 𝑔 𝒚 (S) ; 𝒙 / ~_•€? ∑ } 𝐾 ~ 𝒙 = S∈~_•€? } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 23

Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } 𝐾 ~ 𝒙 9 is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 24 Test

Cross-Validation (CV): Evaluation } 𝑙 -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into 𝑙 groups of approximately equal size } for 𝑗 = 1 to 𝑙 } Choose the 𝑗 -th group as the held-out validation group } Train the model on all but the 𝑗 -th group of data } Evaluate the model on the held-out group } Performance scores of the model from 𝑙 runs are averaged . } The average error rate can be considered as an estimation of the true performance. … First run … Second run … … (k-1)th run k-th run … 25

Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 26

Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average 𝑛 = 3 𝑛 = 1 CV: 𝑁𝑇𝐹 = 1.45 CV: 𝑁𝑇𝐹 = 0.30 𝑛 = 5 𝑛 = 7 CV: 𝑁𝑇𝐹 = 45.44 CV: 𝑁𝑇𝐹 = 31759 27

Leave-One-Out Cross Validation (LOOCV) } When data is particularly scarce, cross-validation with 𝑙 = 𝑂 } Leave-one-out treats each training sample in turn as a test example and all other samples as the training set. } Use for small datasets } When training data is valuable } LOOCV can be time expensive as 𝑂 training steps are required. 28

Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): G 8 𝑧 S − 𝒙 : 𝝔 𝒚 S + 𝜇𝒙 : 𝒙 𝐾 𝒙 = e Sf/ ;𝟐 𝚾 : 𝒛 C = 𝚾 : 𝚾 + 𝜇𝑱 𝒙 29

Polynomial order } Polynomials with larger 𝑛 are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing 𝑛 . [Bishop] 30

Regularization parameter 𝑛 = 9 𝑥 9 - 𝑥 9 / 𝑥 9 8 𝑥 9 ˆ 𝑥 9 ‰ 𝑥 9 Š 𝑥 9 ‹ 𝑥 9 Œ [Bishop] 𝑥 9 • 𝑥 9 Ž 𝑚𝑜𝜇 = −∞ 𝑚𝑜𝜇 = −18 31

Regularization parameter } Generalization } 𝜇 now controls the effective complexity of the model and hence determines the degree of over-fitting 32 [Bishop]

� Choosing the regularization parameter } A set of models with different values of 𝜇. } Find 𝒙 9 for each model based on training data } Find 𝐾 ~ (𝒙 9) (or 𝐾 ’~ (𝒙 9) ) for each model 8 𝑧 (S) − 𝑔 𝑦 (S) ; 𝒙 / G_~ ∑ } 𝐾 ~ 𝒙 = S∈~_•€? } Select the model with the best 𝐾 ~ (𝒙 9) (or 𝐾 ’~ (𝒙 9)) 33

The approximation-generailization trade-off } Small true error shows good approximation of 𝑔 out of sample } More complex ℋ ⇒ better chance of approximating 𝑔 } Less complex ℋ ⇒ better chance of generalization out of 𝑔 34

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Generalizing Word Embeddings using Bag of Subwords Jinman Zhao , Sidharth Mudgal, Yingyu Liang

Signed posets and a B -symmetric generalization of Stanleys acyclicity theorem Jake Huryn, Kat

Lack of Generalization Feature Vectors Rather than use every single detail of a state space, we }

The Landscape of Structural Graph Parameters Michael Lampis KTH Royal Institute of Technology

Making Generalization Robust Katrina Ligett HUJI & Caltech joint with Rachel Cummings, Kobbi

mechanized reasoning favonia 1 2 2 2 checked! 2 Peace of Mind 3 *photo credit:

Generalized Cauchy determinant and Schur Pfaffian, and Their Applications Soichi OKADA (Nagoya

A generalization of Quantum Relative Entropy Luiza H.F. Andrade 1 Rui F. Vigelis 2 Charles C.

Sambuz

Useful Links

Newsletter

Mail Us

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

Business Statistics CONTENTS Multiple regression Dummy regressors Assumptions of regression

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Generalizing Word Embeddings using Bag of Subwords Jinman Zhao , Sidharth Mudgal, Yingyu Liang

Signed posets and a B -symmetric generalization of Stanleys acyclicity theorem Jake Huryn, Kat

Lack of Generalization Feature Vectors Rather than use every single detail of a state space, we }

The Landscape of Structural Graph Parameters Michael Lampis KTH Royal Institute of Technology

Making Generalization Robust Katrina Ligett HUJI &amp; Caltech joint with Rachel Cummings, Kobbi

mechanized reasoning favonia 1 2 2 2 checked! 2 Peace of Mind 3 *photo credit:

Generalized Cauchy determinant and Schur Pfaffian, and Their Applications Soichi OKADA (Nagoya

A generalization of Quantum Relative Entropy Luiza H.F. Andrade 1 Rui F. Vigelis 2 Charles C.

Sambuz

Useful Links

Newsletter

Mail Us

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Making Generalization Robust Katrina Ligett HUJI & Caltech joint with Rachel Cummings, Kobbi