Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019

Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2

Recall: Linear regression (squared loss) } Linear regression functions 𝑔 ∶ ℝ → ℝ 𝑔(𝑦; 𝒙) = 𝑥 - + 𝑥 / 𝑦 𝑔 ∶ ℝ 0 → ℝ 𝑔(𝒚; 𝒙) = 𝑥 - + 𝑥 / 𝑦 / + . . . 𝑥 3 𝑦 3 𝒙 = 𝑥 - ,𝑥 / ,...,𝑥 3 : are the parameters we need to set. } Minimizing the squared loss for linear regression 8 𝐾(𝒙) = 𝒛 − 𝒀𝒙 8 9 = 𝒀 : 𝒀 ;𝟐 𝒀 : 𝒛 } We obtain 𝒙 3

Beyond linear regression } How to extend the linear regression to non-linear functions? } Transform the data using basis functions } Learn a linear regression on the new feature vectors (obtained by basis functions) 4

Beyond linear regression } 𝑛 ?@ order polynomial regression (univariate 𝑔 ∶ ℝ ⟶ ℝ ) 𝑔 𝑦; 𝒙 = 𝑥 - + 𝑥 / 𝑦 + . . . +𝑥 B;/ 𝑦 B;/ +𝑥 B 𝑦 B ;𝟐 𝒀′ : 𝒛 C = 𝒀′ : 𝒀′ } Solution: 𝒙 𝑦 / I 𝑦 / J 𝑦 / L 1 𝒙 9 - ⋯ 𝑧 / 𝒙 9 / 𝑦 8 I 𝑦 8 J 𝑦 8 L 1 ⋮ ⋯ 𝒛 = 𝒀′ = 𝒙 = ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ 𝑧 G 𝒙 9 B 𝑦 G I 𝑦 G J 𝑦 G I 1 ⋯ 5

Polynomial regression: example 𝑛 = 3 𝑛 = 1 𝑛 = 5 𝑛 = 7 6

Generalized linear } Linear combination of fixed non-linear function of the input vector 𝑔(𝒚; 𝒙) = 𝑥 - + 𝑥 / 𝜚 / (𝒚)+ . . . 𝑥 B 𝜚 B (𝒚) {𝜚 / (𝒚), . . . , 𝜚 B (𝒚)} : set of basis functions (or features) 𝜚 S 𝒚 : ℝ 3 → ℝ 7

Basis functions: examples } Linear } Polynomial (univariate) 8

Basis functions: examples J 𝒚;𝒅 Y } Gaussian: 𝜚 U 𝒚 = 𝑓𝑦𝑞 − J 8Z Y 𝒚;𝒅 Y / } Sigmoid: 𝜚 U 𝒚 = 𝜏 𝜏 𝑏 = /]^_` (;a) Z Y 9

Radial Basis Functions: prototypes } Predictions based on similarity to “prototypes”: 𝜚 U 𝒚 = 𝑓𝑦𝑞 − 1 8 8 𝒚 − 𝒅 U 2𝜏 U } Measuring the similarity to the prototypes 𝒅 / , … , 𝒅 B } σ 8 controls how quickly it vanishes as a function of the distance to the prototype. } Training examples themselves could serve as prototypes 10

Generalized linear: optimization G 8 𝑧 S − 𝑔 𝒚 S ; 𝒙 𝐾 𝒙 = e Sf/ G 8 𝑧 S − 𝒙 : 𝝔 𝒚 S = e Sf/ 𝜚 B (𝒚 (/) ) 𝜚 / (𝒚 (/) ) 1 ⋯ 𝑥 - 𝑧 (/) 𝑥 / 𝜚 / (𝒚 (8) ) ⋯ 𝜚 B (𝒚 (8) ) 1 𝒛 = 𝚾 = 𝒙 = ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ 𝑧 (G) 𝑥 B 𝜚 B (𝒚 (G) ) 𝜚 / (𝒚 (G) ) 1 ⋯ ;𝟐 𝚾 : 𝒛 j = 𝚾 : 𝚾 𝒙 11

Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 G 8 𝑧 S − 𝑔 𝒚 S ; 𝜾 Training 𝑜 e ≈ 0 Sf/ (empirical) loss 8 ≫ 0 Expected E 𝐲,q 𝑧 − 𝑔 𝒚; 𝜾 (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 12

Polynomial regression 𝑛 = 0 𝑛 = 1 𝑧 𝑧 𝑛 = 9 𝑛 = 3 𝑧 𝑧 13 [Bishop]

� Polynomial regression: training and test error 8 𝑧 S − 𝑔 𝒚 S ; 𝜾 G ∑ Sf/ 𝑆𝑁𝑇𝐹 = 𝑜 𝑛 [Bishop] 14

Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 15

Model complexity } Example: } Polynomials with larger 𝑛 are becoming increasingly tuned to the random noise on the target values. 𝑛 = 0 𝑛 = 1 𝑧 𝑧 𝑛 = 3 𝑛 = 9 𝑧 𝑧 16 16 [Bishop]

Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. 𝑛 = 9 𝑛 = 9 𝑜 = 15 𝑜 = 100 [Bishop] 17

How to evaluate the learner’s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error are: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 18

Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occam’s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function } Bayesian approach 19

Evaluation and model selection } Evaluation : } We need to measure how well the learned function can predict the target for unseen examples } Model selection: } Most of the time we need to select among a set of models } Example: polynomials with different degree 𝑛 } and thus we need to evaluate these models first 21

Model Selection } Learning algorithm defines the data-driven search over the hypothesis space } search for good parameters } Hyper-parameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 22 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

Model Selection } Model selection is the process by which we choose the “best” model among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done “outside” the main training algorithm } Model selection / hyper-parameter optimization is just another form of learning This slide has been adopted from CMU ML course: 23 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

� Simple hold-out: model selection } Steps: } Divide training data into training and validation set 𝑤_𝑡𝑓𝑢 } Use only the training set to train a set of models } Evaluate each learned model on the validation set 8 𝑧 (S) − 𝑔 𝒚 (S) ; 𝒙 / ~_•€? ∑ } 𝐾 ~ 𝒙 = S∈~_•€? } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set obtains a relatively noisy estimate of performance. 24

Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } 𝐾 ~ 𝒙 9 is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 25 Test

Cross-Validation (CV): Evaluation } 𝑙 -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into 𝑙 groups of approximately equal size } for 𝑗 = 1 to 𝑙 } Choose the 𝑗 -th group as the held-out validation group } Train the model on all but the 𝑗 -th group of data } Evaluate the model on the held-out group } Performance scores of the model from 𝑙 runs are averaged . } The average error rate can be considered as an estimation of the true performance of the model. … First run … Second run … … (k-1)th run k-th run … 26

Cross-Validation (CV): Model Selection } For each model, we first find the average error by CV. } The model with the best average performance is selected. 27

Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average 𝑛 = 3 𝑛 = 1 CV: 𝑁𝑇𝐹 = 1.45 CV: 𝑁𝑇𝐹 = 0.30 𝑛 = 5 𝑛 = 7 CV: 𝑁𝑇𝐹 = 45.44 CV: 𝑁𝑇𝐹 = 31759 28

Leave-One-Out Cross Validation (LOOCV) } When data is particularly scarce, cross-validation with 𝑙 = 𝑂 } Leave-one-out treats each training sample in turn as a test example and all other samples as the training set. } Use for small datasets } When training data is valuable } LOOCV can be time expensive as 𝑂 training steps are required. 29

Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): G 8 𝑧 S − 𝒙 : 𝝔 𝒚 S + 𝜇𝒙 : 𝒙 𝐾 𝒙 = e Sf/ ;𝟐 𝚾 : 𝒛 C = 𝚾 : 𝚾 + 𝜇𝑱 𝒙 𝜚 B (𝒚 (/) ) 𝜚 / (𝒚 (/) ) 1 ⋯ 𝑥 - 𝑧 (/) 𝑥 / 𝜚 / (𝒚 (8) ) ⋯ 𝜚 B (𝒚 (8) ) 1 𝒛 = 𝚾 = 𝒙 = ⋮ ⋮ ⋱ ⋮ ⋮ ⋮ 𝑧 (G) 𝑥 B 31 𝜚 B (𝒚 (G) ) 𝜚 / (𝒚 (G) ) 1 ⋯

Polynomial order } Polynomials with larger 𝑛 are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing 𝑛 . [Bishop] 32

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear

Regression and generalization CE-717: Machine Learning Sharif University of Technology M.

Lecture 4: Linear Regression Optimization Generalization Model complexity

Lecture 4: Linear Regression (contd.) Optimization Generalization Model

Outline IAML: Overfitting and Capacity Control Generalization error Estimating

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

CS4501: Introduction to Computer Vision Max-Margin Classifier, Regularization, Generalization,

Outline Learning from Examples 1 Motivation Supervised Learning Aspects of Supervised Learning

BBM406 Fundamentals of Machine Learning Lecture 4: Linear Regression, Optimization,

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Good Machine Learning = Generalization Goal of machine learning: build models that generalize

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Introduction to Regression and Correlation James H. Steiger Department of Psychology and Human

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

The generalization error of random features model: Precise asymptotics and double descent curve

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

A generalization of unitaries T. S. S. R. K. Rao StatMath Unit Indian Statistical Institute

Efficient Regression for Computational Imaging: from Color Management to Omnidirectional

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

decomposing generalization models of generic, habitual and episodic statements Venkata S

Regression and generalization CE-717: Machine Learning Sharif - PowerPoint PPT Presentation

Regression and generalization CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2019 Topics } Beyond linear regression models } Evaluation & model selection } Regularization } Bias-Variance 2 Recall: Linear

Regression and generalization CE-717: Machine Learning Sharif University of Technology M.

Lecture 4: Linear Regression Optimization Generalization Model complexity

Lecture 4: Linear Regression (contd.) Optimization Generalization Model

Outline IAML: Overfitting and Capacity Control Generalization error Estimating

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

CS4501: Introduction to Computer Vision Max-Margin Classifier, Regularization, Generalization,

Outline Learning from Examples 1 Motivation Supervised Learning Aspects of Supervised Learning

BBM406 Fundamentals of Machine Learning Lecture 4: Linear Regression, Optimization,

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Machine learning theory Regression Hamid Beigy Sharif university of technology June 1, 2020

Good Machine Learning = Generalization Goal of machine learning: build models that generalize

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Introduction to Regression and Correlation James H. Steiger Department of Psychology and Human

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

The generalization error of random features model: Precise asymptotics and double descent curve

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

A generalization of unitaries T. S. S. R. K. Rao StatMath Unit Indian Statistical Institute

Efficient Regression for Computational Imaging: from Color Management to Omnidirectional

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Planning and Optimization B2. Regression: Introduction &amp; STRIPS Case Malte Helmert and

Multiple Regression and Logistic Regression I Dajiang Liu @PHS 525 Apr-14-2016 Multiple

decomposing generalization models of generic, habitual and episodic statements Venkata S

Planning and Optimization B2. Regression: Introduction & STRIPS Case Malte Helmert and