Lecture 6: Overfitting Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang

Review: machine learning basics

Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 1 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 that minimizes ෠ 𝑜 𝑜 σ 𝑗=1 𝑀 𝑔 = 𝑚(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) • s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [𝑚(𝑔, 𝑦, 𝑧)]

Machine learning 1-2-3 • Collect data and extract features • Build model: choose hypothesis class 𝓘 and loss function 𝑚 • Optimization: minimize the empirical loss

Feature mapping Machine learning 1-2-3 Maximum Likelihood • Collect data and extract features • Build model: choose hypothesis class 𝓘 and loss function 𝑚 • Optimization: minimize the empirical loss Occam’s razor Gradient descent; convex optimization

Overfitting

Linear vs nonlinear models 2 𝑦 1 2 𝑦 2 𝑦 1 2𝑦 1 𝑦 2 𝑧 = sign(𝑥 𝑈 𝜚(𝑦) + 𝑐) 2𝑑𝑦 1 𝑦 2 2𝑑𝑦 2 𝑑 Polynomial kernel

Linear vs nonlinear models • Linear model: 𝑔 𝑦 = 𝑏 0 + 𝑏 1 𝑦 • Nonlinear model: 𝑔 𝑦 = 𝑏 0 + 𝑏 1 𝑦 + 𝑏 2 𝑦 2 + 𝑏 3 𝑦 3 + … + 𝑏 𝑁 𝑦 𝑁 • Linear model ⊆ Nonlinear model (since can always set 𝑏 𝑗 = 0 (𝑗 > 1) ) • Looks like nonlinear model can always achieve same/smaller error • Why one use Occam’s razor (choose a smaller hypothesis class)?

Example: regression using polynomial curve 𝑢 = sin 2𝜌𝑦 + 𝜗 Figure from Machine Learning and Pattern Recognition , Bishop

Example: regression using polynomial curve 𝑢 = sin 2𝜌𝑦 + 𝜗 Regression using polynomial of degree M Figure from Machine Learning and Pattern Recognition , Bishop

Example: regression using polynomial curve 𝑢 = sin 2𝜌𝑦 + 𝜗 Figure from Machine Learning and Pattern Recognition , Bishop

Example: regression using polynomial curve Figure from Machine Learning and Pattern Recognition , Bishop

Prevent overfitting • Empirical loss and expected loss are different • Also called training error and test/generalization error • Larger the data set, smaller the difference between the two • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two • Thus has small training error but large test error (overfitting) • Larger data set helps! • Throwing away useless hypotheses also helps!

Prevent overfitting • Empirical loss and expected loss are different • Also called training error and test error • Larger the hypothesis class, easier to find a hypothesis that fits the difference between the two Use prior knowledge/model to • Thus has small training error but large test error (overfitting) prune hypotheses • Larger the data set, smaller the difference between the two • Throwing away useless hypotheses also helps! Use experience/data to • Larger data set helps! prune hypotheses

Prior v.s. data

Prior vs experience • Super strong prior knowledge: 𝓘 = {𝑔 ∗ } • No data is needed! 𝑔 ∗ : the best function

Prior vs experience • Super strong prior knowledge: 𝓘 = {𝑔 ∗ , 𝑔 1 } • A few data points suffices to detect 𝑔 ∗ 𝑔 ∗ : the best function

Prior vs experience • Super larger data set: infinite data • Hypothesis class 𝓘 can be all functions! 𝑔 ∗ : the best function

Prior vs experience • Practical scenarios: finite data, 𝓘 of median capacity, 𝑔 ∗ in/not in 𝓘 𝓘 1 𝓘 2 𝑔 ∗ : the best function

Prior vs experience • Practical scenarios lie between the two extreme cases 𝓘 = {𝑔 ∗ } practice Infinite data

General Phenomenon Figure from Deep Learning , Goodfellow, Bengio and Courville

Cross validation

Model selection • How to choose the optimal capacity? • e.g., choose the best degree for polynomial curve fitting • Cannot be done by training data alone • Create held-out data to approx. the test error • Called validation data set

Model selection: cross validation • Partition the training data into several groups • Each time use one group as validation set Figure from Machine Learning and Pattern Recognition , Bishop

Model selection: cross validation • Also used for selecting other hyper-parameters for model/algorithm • E.g., learning rate, stopping criterion of SGD, etc. • Pros: general, simple • Cons: computationally expensive; even worse when there are more hyper-parameters

Lecture 6: Overfitting Princeton University COS 495 Instructor: - PowerPoint PPT Presentation

Machine Learning Basics Lecture 6: Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data , : 1 i.i.d. from distribution

Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen Overfitting Can Happen

The Problem of Overfitting The Problem of Overfitting BR data: neural network with 20%

Learning From Data Lecture 11 Overfitting What is Overfitting When does Overfitting Occur

Overfitting Validation process. Overfitting Ettore Lanzarone March 18, 2020 LESSON 3 Lesson 3

CSE 446: Week 3: Decision Trees (Apr 4) Instructor: Sergey Levine I. Overfitting idea 1: holdout

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

Regularization The problem of overfitting Machine Learning Example: Linear regression (housing

Overfitting Many hypotheses consistent with/close to the data About this class With enough

The Paradox of Overfitting Volker Nannen February 1, 2003 Artificial Intelligence

Lecture 22 Interpolation, Overfitting, Ridgeless Regression, and Neural Networks Sasha Rakhlin

Overfitting + k-Nearest Neighbors Matt Gormley Lecture 4 Jan. 27, 2020 1 Course Staff 3

Data Mining Lecture 05: Overfitting Evaluation: accuracy, precision, recall, ROC

recap: Overfitting Fitting the data more than is warranted Learning From Data Data Lecture 12

Learning From Data Lecture 22 Neural Networks and Overfitting Approximation vs. Generalization

Summary Overfitting arises when we evaluate and train on the same data. We can bound error

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Compsci 201 201 Percol olation, U , Union Find, S d, Sorting and d Prior ority Q y Queues

Welcome to Punggol Cove Primary School Primary 1 Orientation Welcome By Principal Mr. Dennis

Drivers in High-Level Languages Paul Emmerich , Simon Ellmann , Fabian Bonk, Alex Egger, Alexander

Introducing the Leadership Team Purpose of the day To provide an overview of the council and

Water Bishops Environmental Club has dedicated this years Rock 4 concert to water and

Homework Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression

Dirk Helbing Scientific Coordinator Steven Bishop Management Coordinator

The Bishops Initiative LET US TALK What is the Personality and Character of your