validation and testing
play

Validation and Testing COMPSCI 371D Machine Learning COMPSCI 371D - PowerPoint PPT Presentation

Validation and Testing COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Validation and Testing 1 / 19 Outline 1 Training, Testing, and Model Selection 2 A Generative Data Model 3 Model Selection: Validation 4 Model Selection:


  1. Validation and Testing COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Validation and Testing 1 / 19

  2. Outline 1 Training, Testing, and Model Selection 2 A Generative Data Model 3 Model Selection: Validation 4 Model Selection: Cross-Validation 5 Model Selection: The Bootstrap COMPSCI 371D — Machine Learning Validation and Testing 2 / 19

  3. Training, Testing, and Model Selection Training and Testing • Empirical risk is average loss over training set: def 1 L T ( h ) = � ( x , y ) ∈ T ℓ ( y , h ( x )) | T | • Training is Empirical Risk Minimization: ERM T ( H ) ∈ arg min h ∈H L T ( h ) (A fitting problem) • Not enough for machine learning: Must generalize • Small loss on “previously unseen data” • How do we know? Evaluate on a separate test set S • This is called testing the predictor • How do we know that S and T are “related?” COMPSCI 371D — Machine Learning Validation and Testing 3 / 19

  4. Training, Testing, and Model Selection Model Selection • Hyper-parameters: Degree k for polynomials, number k of neighbors in k -NN • How to choose? Why not just include with parameters, and train? • Difficulty 0: k -NN has no training! No big deal • Difficulty 1: k ∈ N , while v ∈ R m for some predictors. Hybrid optimization. Medium deal, just technical difficulty • Difficulty 2: Answer from training would be trivial! • Can always achieve zero risk on T • So k must be chosen separately from training. It tunes generalization • This is what makes it a hyper-parameter • Choosing hyper-parameters is called model selection • Evaluate choices on a separate validation set V COMPSCI 371D — Machine Learning Validation and Testing 4 / 19

  5. Training, Testing, and Model Selection Model Selection, Training, Testing • “Model” = H • Given a parametric family of hypothesis spaces, model selection selects one particular member of the family • Given a specific hypothesis space, training selects one particular predictor out of it • Use V to select model, T to train, S to test • V , T , S are mutually disjoint but “related” • What does “related” mean? • Train on cats and test on horses? COMPSCI 371D — Machine Learning Validation and Testing 5 / 19

  6. A Generative Data Model A Generative Data Model • What does “related” mean? • Every sample ( x , y ) comes from a joint probability distribution p ( x , y ) • True for training, validation, and test data, and for data seen during deployment • For the latter, y is “out there” but unknown • The goal of machine learning: • Define the (statistical) risk ˜ L p ( h ) = E p [ ℓ ( y , h ( x ))] = ℓ ( y , h ( x )) p ( x , y ) d x dy • Learning performs (Statistical) Risk Minimization : RM p ( H ) ∈ arg min h ∈H L p ( h ) def • Lowest risk on H : L p ( H ) = min h ∈H L p ( h ) COMPSCI 371D — Machine Learning Validation and Testing 6 / 19

  7. A Generative Data Model p is Unknown • So, we don’t need training data anymore? • We typically do not know p ( x , y ) • x = image? Or sentence? • Can we not estimate p ? • The curse of dimensionality, again • We typically cannot find RM p ( H ) or L p ( H ) • That’s the goal all the same COMPSCI 371D — Machine Learning Validation and Testing 7 / 19

  8. A Generative Data Model So Why Talk About It? • Why talk about p ( x , y ) if we cannot know it? • L p ( h ) is a mean, and we can estimate means • We can sandwich L p ( h ) or L p ( H ) between bounds over all possible choices of p • What else would we do anyway? • p is conceptually clean and simple • The unattainable holy grail • Think of p as an oracle that sells samples from X × Y • She knows p , we don’t • Samples cost money and effort! [Example: MNIST Database] COMPSCI 371D — Machine Learning Validation and Testing 8 / 19

  9. A Generative Data Model Even More Importantly... • We know what “related” means: T , V , S are all drawn independently from p ( x , y ) • We know what “generalize” means: Find RM p ( H ) ∈ arg min h ∈H L p ( h ) • We know the goal of machine learning COMPSCI 371D — Machine Learning Validation and Testing 9 / 19

  10. Model Selection: Validation Validation • Parametric family of hypothesis spaces H = � π ∈ Π H π • Finding a good vector ˆ π of hyper-parameters is called model selection • A popular method is called validation • Use a validation set V separate from T • Pick a hyper-parameter vector for which the predictor trained on the training set minimizes the validation risk π = arg min ˆ π ∈ Π L V ( ERM T ( H π )) • When the set Π of hyper-parameters is finite, try them all COMPSCI 371D — Machine Learning Validation and Testing 10 / 19

  11. Model Selection: Validation Validation Algorithm procedure V ALIDATION ( H , Π , T , V , ℓ ) ˆ L = ∞ ⊲ Stores the best risk so far on V for π ∈ Π do h ∈ arg min h ′ ∈H π L T ( h ′ ) ⊲ Use loss ℓ to compute best predictor ERM T ( H π ) on T L = L V ( h ) ⊲ Use loss ℓ to evaluate the predictor’s risk on V if L < ˆ L then π , ˆ h , ˆ (ˆ L ) = ( π , h , L ) ⊲ Keep track of the best hyper-parameters, predictor, and risk end if end for π , ˆ h , ˆ return (ˆ L ) ⊲ Return best hyper-parameters, predictor, and risk estimate end procedure COMPSCI 371D — Machine Learning Validation and Testing 11 / 19

  12. Model Selection: Validation Validation for Infinite Sets • When Π is not finite, scan and find a local minimum • Example: Polynomial degree 1.5 5 training risk validation risk 1 k = 1 0.5 k = 2 k = 3 k = 6 k = 9 0 0 0 2 4 6 8 10 0 1 • When Π is not countable, scan a grid and find a local minimum COMPSCI 371D — Machine Learning Validation and Testing 12 / 19

  13. Model Selection: Cross-Validation Resampling Methods for Validation • Validation is good but expensive: needs separate data • A pity not to use V as part of T ! • Resampling methods split T into T k and V k for k = 1 , . . . , K • (Nothing to do with number of classes or polynomial degree!) • For each π , for each k , train on T k , test on V k to measure performance • Average performance over k taken as validation risk for π • Let ˆ π be the best π • Train the predictor in H ˆ π and on all of T • Cross-validation and the bootstrap differ on how splits are made COMPSCI 371D — Machine Learning Validation and Testing 13 / 19

  14. Model Selection: Cross-Validation K -Fold Cross-Validation • V 1 , . . . , V K are a partition of T into approximately equal-sized sets • T k = T \ V k • For π ∈ Π For k = 1 , . . . , K : train on T k , measure performance on V k Average performance over k is validation risk for π • Pick ˆ π as the π with best average performance • Train the predictor in H ˆ π and on all of T • Since performance is an average, we also get a variance! • We don’t have that for standard validation COMPSCI 371D — Machine Learning Validation and Testing 14 / 19

  15. Model Selection: Cross-Validation Cross-Validation Algorithm procedure C ROSS V ALIDATION ( H , Π , T , K , ℓ ) { V 1 , . . . , V K } = S PLIT ( T , K ) ⊲ Split T in K approximately equal-sized sets at random ˆ L = ∞ ⊲ Will hold the lowest risk over Π for π ∈ Π do s , s 2 = 0 , 0 ⊲ Will hold sum of risks and their squares to compute risk mean and variance for k = 1 , . . . , K do T k = T \ V k ⊲ Use all of T except V k as training set h ∈ arg min h ′∈H π L Tk ( h ′ ) ⊲ Use the loss ℓ to compute h = ERM Tk ( H π ) L = L Vk ( h ) ⊲ Use the loss ℓ to compute the risk of h on V k ( s , s 2 ) = ( s + L , s 2 + L 2 ) ⊲ Keep track of quantities to compute risk mean and variance end for L = s / K ⊲ Sample mean of the risk over the K folds if L < ˆ L then σ 2 = ( s 2 − s 2 / K ) / ( K − 1 ) ⊲ Sample variance of the risk over the K folds π , ˆ σ 2 ) = ( π , L , σ 2 ) ( ˆ L , ˆ ⊲ Keep track of the best hyper-parameters and their risk statistics end if end for ˆ h = arg min h ∈H ˆ π L T ( h ) ⊲ Train predictor afresh on all of T with the best hyper-parameters π , ˆ h , ˆ σ 2 ) return ( ˆ L , ˆ ⊲ Return best hyper-parameters, predictor, and risk statistics end procedure COMPSCI 371D — Machine Learning Validation and Testing 15 / 19

  16. Model Selection: Cross-Validation How big is K ? • T k has | T | ( K − 1 ) / K samples, so the predictor in each fold is a bit worse than the final predictor • Smaller K : More pessimistic risk estimate (upward bias b/c we train on smaller T k ) • Bigger K decreases bias of risk estimate • (training on bigger T k ) • Why not K = N ? • LOOCV (Leave-One-Out Cross-Validation) • Train on all but one data point, test on that data point, repeat • Any issue? • Nadeau and Bengio recommend K = 15 COMPSCI 371D — Machine Learning Validation and Testing 16 / 19

  17. Model Selection: The Bootstrap The Bootstrap • Bag or multiset : A set that allows for multiple instances • { a , a , b , b , b , c } has cardinality 6 • Multiplicities : 2 for a , 3 for b , and 1 for c • A set is also a bag: { a , b , c } • Bootstrap: Same as CV, except • T k : N samples drawn uniformly at random from T , with replacement • V k = T \ T k • T k is a bag, V k is a set • Repetitions change training risk to a weighted average: � N � J L T k ( h ) = 1 n = 1 ℓ ( y n , h ( x n )) = 1 j = 1 m j ℓ ( y j , h ( x j )) N N COMPSCI 371D — Machine Learning Validation and Testing 17 / 19

Recommend


More recommend