CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer Science University of Virginia

Overview

Polynominals Polynomial regression (a) d � 1 (b) d � 3 (c) d � 15 2

Boosting Adaboost combines T weak classifiers to form a (strong) classifier T � sign ( w t h t ( x )) � h ( x ) (1) t � 1 where T controls the model complexity [Mohri et al., 2018, Page 147] 3

Structural Risk Minimization Take linear regression with ℓ 2 as an example. Let H λ represents the hypothesis space defined with the following objective function m L S ,ℓ 2 ( h w ) � 1 � ( h w ( x i ) − y i ) 2 + λ � w � 2 (2) m i � 1 where λ is the regularization parameter 4

Structural Risk Minimization Take linear regression with ℓ 2 as an example. Let H λ represents the hypothesis space defined with the following objective function m L S ,ℓ 2 ( h w ) � 1 � ( h w ( x i ) − y i ) 2 + λ � w � 2 (2) m i � 1 where λ is the regularization parameter ◮ The basic idea of SRM is to start from a small hypothesis space (e.g., H λ with a small λ , then gradually increase λ to have a larger H λ 4

Structural Risk Minimization Take linear regression with ℓ 2 as an example. Let H λ represents the hypothesis space defined with the following objective function m L S ,ℓ 2 ( h w ) � 1 � ( h w ( x i ) − y i ) 2 + λ � w � 2 (2) m i � 1 where λ is the regularization parameter ◮ The basic idea of SRM is to start from a small hypothesis space (e.g., H λ with a small λ , then gradually increase λ to have a larger H λ ◮ Another example: Support Vector Machines (next lecture) 4

Model Evaluation and Selection Since we cannot compute the true error of any given hypothesis h ∈ H ◮ How to evaluate the performance for a given model? ◮ How to select the best model among a few candidates? 5

Model Validation

Validation Set The simplest way to estimate the true error of a predictor h ◮ Independently sample an additional set of examples V with size m v V � {( x 1 , y 1 ) , . . . , ( x m v , y m v )} (3) ◮ Evaluate the predictor h on this validation set L V ( h ) � |{ i ∈ [ m v ] : h ( x ) � y i }| . (4) m v Usually, L V ( h ) is a good approximation to L D ( h ) 7

Theorem Let h be some predictor and assume that the loss function is in [ 0 , 1 ] . Then, for every δ ∈ ( 0 , 1 ) , with probability of at least 1 − δ over the choice of a validation set V of size m v , we have � log ( 2 / δ ) | L V ( h ) − L D ( h )| ≤ (5) 2 m v where ◮ L V ( h ) : the validation error ◮ L D ( h ) : the true error [Shalev-Shwartz and Ben-David, 2014, Theorem 11.1] 8

Sample Complexity ◮ The fundamental theorem of learning � C d + log ( 1 / δ ) L D ( h ) ≤ L S ( h ) + (6) m where d is the VC dimension of the corresponding hypothesis space 9

Sample Complexity ◮ The fundamental theorem of learning � C d + log ( 1 / δ ) L D ( h ) ≤ L S ( h ) + (6) m where d is the VC dimension of the corresponding hypothesis space ◮ On the other hand, from the previous theorem � log ( 2 / δ ) L D ( h ) ≤ L V ( h ) + (7) 2 m v ◮ A good validation set should have similar number of examples as in the training set 9

Model Selection

Model Selection Procedure Given the training set S and the validation set V ◮ For each model configuration c , find the best hypothesis h c ( x , S ) L S ( h ′ ( x , S )) h c ( x , S ) � argmin (8) h ′ ∈ H c 11

Model Selection Procedure Given the training set S and the validation set V ◮ For each model configuration c , find the best hypothesis h c ( x , S ) L S ( h ′ ( x , S )) h c ( x , S ) � argmin (8) h ′ ∈ H c ◮ With a collection of best models with different ′ � { h c 1 ( x , S ) , . . . , h c k ( x , S )} , find the configurations H overall best hypothesis L V ( h ′ ( x , S )) h ( x , S ) � argmin (9) h ′ ∈ H ′ 11

Model Selection Procedure Given the training set S and the validation set V ◮ For each model configuration c , find the best hypothesis h c ( x , S ) L S ( h ′ ( x , S )) h c ( x , S ) � argmin (8) h ′ ∈ H c ◮ With a collection of best models with different ′ � { h c 1 ( x , S ) , . . . , h c k ( x , S )} , find the configurations H overall best hypothesis L V ( h ′ ( x , S )) h ( x , S ) � argmin (9) h ′ ∈ H ′ ◮ It is similar to learn with the finite hypothesis space ′ H 11

Model Configuration/Hyperparameters Consider polynomial regression d � { w 0 + w 1 x + · · · + w d x d : w 0 , w 1 , . . . , w d ∈ R } H (10) ◮ the degree of polynomials d ◮ regularization coefficient λ as in λ · � w � 2 2 ◮ the bias term w 0 12

Model Configuration/Hyperparameters Consider polynomial regression d � { w 0 + w 1 x + · · · + w d x d : w 0 , w 1 , . . . , w d ∈ R } H (10) ◮ the degree of polynomials d ◮ regularization coefficient λ as in λ · � w � 2 2 ◮ the bias term w 0 Additional factors during learning ◮ Optimization methods ◮ Dimensionality of inputs, etc. 12

Limitation of Keeping a Validation Set If the validation set is ◮ small, then it could be biased and could not give a good approximation to the true error ◮ large, e.g., the same order of the training set, then we waste the information if do not use the examples for training. 13

k -Fold Cross Validation The basic procedure of k -fold cross validation: ◮ Split the whole data set into k parts Data 14

k -Fold Cross Validation The basic procedure of k -fold cross validation: ◮ Split the whole data set into k parts ◮ For each model configuration, run the learning procedure k times ◮ Each time, pick one part as validation set and the rest as training set Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 14

k -Fold Cross Validation The basic procedure of k -fold cross validation: ◮ Split the whole data set into k parts ◮ For each model configuration, run the learning procedure k times ◮ Each time, pick one part as validation set and the rest as training set ◮ Take the average of k validation errors as the model error Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 14

Cross-Validation Algorithm 1: Input : (1) training set S ; (2) set of parameter values Θ ; (3) learning algorithm A , and (4) integer k 2: Partition S into S 1 , S 2 , . . . , S k 3: for θ ∈ Θ do for i � 1 , . . . , k do 4: h i ,θ � A ( S \ S i ; θ ) 5: end for 6: � k Err ( θ ) � 1 i � 1 L S i ( h i ,θ ) 7: k 8: end for 9: Output : the hypothesis h S ( x ) � sign ( � T t � 1 w t h t ( x )) In practice, k is usually 5 or 10. 15

Train-Validation-Test Split ◮ Training set: used for learning with a pre-selected hypothesis space, such as ◮ logistic regression for classification ◮ polynomial regression with d � 15 and λ � 0 . 1 ◮ Validation set: used for selecting the best hypothesis across multiple hypothesis spaces ◮ Similar to learning with a finite hypothesis space H ′ ◮ Test set: only used for evaluating the overall best hypothesis 16

Train-Validation-Test Split ◮ Training set: used for learning with a pre-selected hypothesis space, such as ◮ logistic regression for classification ◮ polynomial regression with d � 15 and λ � 0 . 1 ◮ Validation set: used for selecting the best hypothesis across multiple hypothesis spaces ′ ◮ Similar to learning with a finite hypothesis space H ◮ Test set: only used for evaluating the overall best hypothesis Typical splits on all available data Train Val Test 16

Train-Validation-Test Split ◮ Training set: used for learning with a pre-selected hypothesis space, such as ◮ logistic regression for classification ◮ polynomial regression with d � 15 and λ � 0 . 1 ◮ Validation set: used for selecting the best hypothesis across multiple hypothesis spaces ′ ◮ Similar to learning with a finite hypothesis space H ◮ Test set: only used for evaluating the overall best hypothesis Typical splits on all available data Fold 3 Fold 4 Fold 5 Fold 1 Fold 2 Test 16

Model Selection in Practice

What To Do If A Learning Fails There are many elements that can help fix the learning procedure ◮ Get a larger sample [Shalev-Shwartz and Ben-David, 2014, Page 151] 18

What To Do If A Learning Fails There are many elements that can help fix the learning procedure ◮ Get a larger sample ◮ Change the hypothesis class by ◮ Enlarging it ◮ Reducing it ◮ Completely changing it ◮ Changing the parameters you consider [Shalev-Shwartz and Ben-David, 2014, Page 151] 18

What To Do If A Learning Fails There are many elements that can help fix the learning procedure ◮ Get a larger sample ◮ Change the hypothesis class by ◮ Enlarging it ◮ Reducing it ◮ Completely changing it ◮ Changing the parameters you consider ◮ Change the feature representation of the data (usually domain dependent) [Shalev-Shwartz and Ben-David, 2014, Page 151] 18

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji - PowerPoint PPT Presentation

CS 6316 Machine Learning Model Selection and Validation Yangfeng Ji Department of Computer Science University of Virginia Overview Polynominals Polynomial regression (a) d 1 (b) d 3 (c) d 15 2 Boosting Adaboost combines T weak

CS 6316 Machine Learning The Bias-Complexity Tradeoff Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Introduction to Learning Theory Yangfeng Ji Department of Computer

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

CS 6316 Machine Learning Boosting Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Clustering Yangfeng Ji Department of Computer Science University of

CS 6316 Machine Learning Support Vector Machines and Kernel Meth- ods Yangfeng Ji Department of

CS 6316 Machine Learning Dimensionality Reduction Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Review of Linear Algebra and Probability Yangfeng Ji Department of

CS 6316 Machine Learning Linear Predictors Yangfeng Ji Department of Computer Science

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Bayesian leave-one-out cross-validation for large data Mns Magnusson (Aalto University) Michael

Machine Learning July 20, 2016 Basic Concepts: Review Example machine learning problem: Decide

Cross-Validation Machine Learning 1 Model selection Very broadly: Choosing the best model using

Seismic landslide hazard zonation By: M.T.J. Terlien Department of Earth Resources Surveys,

Lecture 5: Regularization ML Methodology Aykut Erdem February 2016 Hacettepe University

Time - dela y ed feat u res and a u to - regressi v e models MAC H IN E L E AR N IN G FOR TIME

Week 2 Video 5 Cross-Validation and Over-Fitting Over-Fitting Ive mentioned over-fitting a

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two