Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 3 Slides adapted from Thorsten Joachims Machine Learning: Chenhao Tan | Boulder | 1 of 29
Logistics • Homework assignments • Final project Machine Learning: Chenhao Tan | Boulder | 2 of 29
Overview Sample error and generalization error Bias-variance tradeoff Model selection Machine Learning: Chenhao Tan | Boulder | 3 of 29
Sample error and generalization error Outline Sample error and generalization error Bias-variance tradeoff Model selection Machine Learning: Chenhao Tan | Boulder | 4 of 29
Sample error and generalization error Supervised learning • S train → h • Target function f : X → Y ( f is unknown) • Goal: h approximates f Machine Learning: Chenhao Tan | Boulder | 5 of 29
Sample error and generalization error Problem Setup • Instances in a learning problems follow a probability distribution P ( X , Y ) • A sample S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } is independently and identically distributed (i.i.d.) according to P ( X , Y ) . Machine Learning: Chenhao Tan | Boulder | 6 of 29
Sample error and generalization error Problem Setup • Instances in a learning problems follow a probability distribution P ( X , Y ) • A sample S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } is independently and identically distributed (i.i.d.) according to P ( X , Y ) . Examples ◦ training sample S train ◦ test sample S test Machine Learning: Chenhao Tan | Boulder | 6 of 29
Sample error and generalization error Sample Error vs. Generalization Error • Generalization error of a hypothesis h for a learning task P ( X , Y ) : � Err P ( h ) = E [∆( h ( x ) , y )] = ∆( h ( x ) , y ) P ( X = x , Y = y ) x ∈ X , y ∈ Y Machine Learning: Chenhao Tan | Boulder | 7 of 29
Sample error and generalization error Sample Error vs. Generalization Error • Generalization error of a hypothesis h for a learning task P ( X , Y ) : � Err P ( h ) = E [∆( h ( x ) , y )] = ∆( h ( x ) , y ) P ( X = x , Y = y ) x ∈ X , y ∈ Y • Sample error of a hypothesis h for a sample S : n Err S ( h ) = 1 � ∆( h ( x i ) , y i ) n i = 1 Machine Learning: Chenhao Tan | Boulder | 7 of 29
Sample error and generalization error Training error vs. Test error • S train → h Machine Learning: Chenhao Tan | Boulder | 8 of 29
Sample error and generalization error Training error vs. Test error • S train → h • Train error = Err S train ( h ) • test error = Err S test ( h ) Machine Learning: Chenhao Tan | Boulder | 8 of 29
Sample error and generalization error A concrete hypothetical example • Predict flu trends using search data • X : search data, Y : fraction of population with flu Machine Learning: Chenhao Tan | Boulder | 9 of 29
Sample error and generalization error A concrete hypothetical example • Predict flu trends using search data • X : search data, Y : fraction of population with flu • S train = all data before 2012 • S test = all data in 2012 • What is the problem of generalization error estimation? [Lazer et al., 2014] Machine Learning: Chenhao Tan | Boulder | 9 of 29
Sample error and generalization error Overfitting [Friedman et al., 2001] Machine Learning: Chenhao Tan | Boulder | 10 of 29
Bias-variance tradeoff Outline Sample error and generalization error Bias-variance tradeoff Model selection Machine Learning: Chenhao Tan | Boulder | 11 of 29
Bias-variance tradeoff Bias-Variance Tradeoff Assume a simple model y = f ( x ) + ǫ , E ( ǫ ) = 0 , Var( ǫ ) = σ 2 ǫ , Machine Learning: Chenhao Tan | Boulder | 12 of 29
Bias-variance tradeoff Bias-Variance Tradeoff Assume a simple model y = f ( x ) + ǫ , E ( ǫ ) = 0 , Var( ǫ ) = σ 2 ǫ , E [( y − h ( x 0 )) 2 | X = x 0 ] Err( x 0 ) = Machine Learning: Chenhao Tan | Boulder | 12 of 29
Bias-variance tradeoff Bias-Variance Tradeoff Assume a simple model y = f ( x ) + ǫ , E ( ǫ ) = 0 , Var( ǫ ) = σ 2 ǫ , E [( y − h ( x 0 )) 2 | X = x 0 ] Err( x 0 ) = ǫ + [ E h ( x 0 ) − f ( x 0 )] 2 + E [ h ( x 0 ) − E h ( x 0 )] 2 σ 2 = σ 2 ǫ + Bias 2 ( h ( x 0 )) + Var( h ( x 0 )) = Irreducible Error + Bias 2 + Variance = Machine Learning: Chenhao Tan | Boulder | 12 of 29
Bias-variance tradeoff Example order=1 order=5 order=9 1.00 1.00 1.00 0.75 0.75 0.75 0.50 0.50 0.50 0.25 0.25 0.25 0.00 0.00 0.00 y y y 0.25 0.25 0.25 0.50 0.50 0.50 0.75 0.75 0.75 1.00 1.00 1.00 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 1.0 1.2 1.4 1.6 1.8 2.0 x x x Machine Learning: Chenhao Tan | Boulder | 13 of 29
Bias-variance tradeoff Revisit Overfitting http://scott.fortmann-roe.com/docs/BiasVariance.html Machine Learning: Chenhao Tan | Boulder | 14 of 29
Bias-variance tradeoff K-NN Example k � 2 + σ 2 f ( x 0 ) − 1 � σ 2 � Err( x 0 ) = ǫ + f ( x ( l ) ) ǫ k k l = 1 In homework 1! Machine Learning: Chenhao Tan | Boulder | 15 of 29
Model selection Outline Sample error and generalization error Bias-variance tradeoff Model selection Machine Learning: Chenhao Tan | Boulder | 16 of 29
Model selection Model Selection • training: run learning algorithm m times (e.g., parameter search). • validation error: Errors Err S val (ˆ h i ) is an estimate of Err P ( h i ) . • selection: Use h i with min Err S val (ˆ h i ) for prediction on test examples. Machine Learning: Chenhao Tan | Boulder | 17 of 29
Model selection Train-val-test Machine Learning: Chenhao Tan | Boulder | 18 of 29
Model selection K -fold cross validation An estimate using all instances: • Input: a sample S and a learning algorithm A . • Procedure: Randomly split S into K equally-sized folds S 1 , . . . , S K For each S i , apply A to S − i , get ˆ h i , and compute Err S i (ˆ h i ) � K i = 1 Err S i (ˆ • Performance estimates: 1 h i ) K Machine Learning: Chenhao Tan | Boulder | 19 of 29
Model selection K -fold Cross Validation Example use: • Find good features F using S train • Split S train into K folds • For each fold, use the rest training data and features F to build a classifier and estimate prediction error using average error rates on each fold Machine Learning: Chenhao Tan | Boulder | 20 of 29
Model selection K -fold Cross Validation Example use (Wrong!): • Find good features F using S train • Split S train into K folds • For each fold, use the rest training data and features F to build a classifier and estimate prediction error using average error rates on each fold Machine Learning: Chenhao Tan | Boulder | 21 of 29
Model selection K -fold cross validation • select best models from training data • nested cross-validation for performance estimation Machine Learning: Chenhao Tan | Boulder | 22 of 29
Model selection Evaluating learned hypothesis • Goal: Find h with small prediction error Err P ( h ) over P ( X , Y ) • Question: What is Err P (ˆ h ) of ˆ h obtained from training data S train • Training error and test error ◦ Training error: Err S train (ˆ h ) ◦ Test error: Err S test (ˆ h ) is an estimate of Err P (ˆ h ) Machine Learning: Chenhao Tan | Boulder | 23 of 29
Model selection What is the True Error of a Hypothesis? • Apply ˆ h to S test , for each ( x , y ) ∈ S test observer ∆(ˆ h ( x ) , y ) . Machine Learning: Chenhao Tan | Boulder | 24 of 29
Model selection What is the True Error of a Hypothesis? • Apply ˆ h to S test , for each ( x , y ) ∈ S test observer ∆(ˆ h ( x ) , y ) . • Binomial distribution estimates: assume that each toss is independent and the probability of heads is p , then the probability of observing x heads in a sample of n independent coin tosses is n ! x !( n − x )! p x ( 1 − p ) n − x Pr( X = x | p , n ) = Machine Learning: Chenhao Tan | Boulder | 24 of 29
Model selection What is the True Error of a Hypothesis? • Apply ˆ h to S test , for each ( x , y ) ∈ S test observer ∆(ˆ h ( x ) , y ) . • Binomial distribution estimates: assume that each toss is independent and the probability of heads is p , then the probability of observing x heads in a sample of n independent coin tosses is n ! x !( n − x )! p x ( 1 − p ) n − x Pr( X = x | p , n ) = ◦ Normal approximation ◦ Err(ˆ � n i = 1 ∆(ˆ p = 1 h ) = ˆ h ( x i ) , y i ) n � 1 ◦ Confidence interval: ˆ p ± z α n ˆ p ( 1 − ˆ p ) Machine Learning: Chenhao Tan | Boulder | 24 of 29
Model selection Is hypothesis ˆ h 1 better than ˆ h 2 ? Same test sample • Apply ˆ h 1 and ˆ h 2 to S test Machine Learning: Chenhao Tan | Boulder | 25 of 29
Model selection Is hypothesis ˆ h 1 better than ˆ h 2 ? Same test sample • Apply ˆ h 1 and ˆ h 2 to S test • Decide if Err P (ˆ h 1 ) � = Err P (ˆ h 2 ) • Null hypothesis: Err S test (ˆ h 1 ) and Err S test (ˆ h 2 ) come from binomial distributions with same p Binomial Sign Test (McNemar’s Test) Machine Learning: Chenhao Tan | Boulder | 25 of 29
Model selection Is hypothesis ˆ h 1 better than ˆ h 2 ? Different test samples • Apply ˆ h 1 to S test1 and ˆ h 2 to S test2 Machine Learning: Chenhao Tan | Boulder | 26 of 29
Recommend
More recommend