Summary ◮ Overfitting arises when we evaluate and train on the same data. ◮ We can bound error of a fixed function with Hoeffding’s inequality. ◮ Next lecture we’ll get a version sensitive to function class size. 41 / 61
Part 3. . .
Overfitting, in pictures With SVM, the model size scales with C . We had our best test error in the middle. C = 1 . 481101 , λ = 0 . 006752 ; train 0.010000 , test 0.060000 0.08 1.0 -0.800 0.07 0.800 0 0 . 4 0 0 . 0 0 0 -0.400 0.06 -0.800 0 -0.400 -1.200 0.8 0 8 0 . 0.800 misclassification rate - 0.05 0 0 2 1 . . 0 0 0 0 0.400 0.6 -1.200 0.04 -1.200 0.03 0.4 0.02 -0.400 0 0.400 0.800 0 -0.800 0 . 0 0.01 0.2 - 0 1 train . 0 2 2 0 . 0.00 test 0 1 -1 0 1 0.0 10 10 10 C 0.0 0.2 0.4 0.6 0.8 1.0 C = 0 . 040000 , λ = 0 . 250000 ; train 0.050000 , test 0.076000 C = 20 . 480000 , λ = 0 . 000488 ; train 0.000000 , test 0.076000 1.0 -0.080 1.0 - 0 . 8 0 0 -1.600 0 0 0 . 0 1.600 0 0.800 0 . - -0.800 0 0 0 . 0 8 - 1 0 . 6 0 0.8 0.8 0 0.800 0.080 -0.800 1.600 0.000 0 2.400 0.6 . 0.6 2 4 0 -0.240 0 6 1 . 0 0.4 0.4 - 1 . 6 -0.160 0 0 -0.800 0.000 0 0 0 0.080 0.800 0 . 0.160 0.2 0.2 0.000 -1.600 0 0 0.240 -0.080 8 . 0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 42 / 61
Bernoulli walks Let Z i be bernoulli with E Z i = 1 / 2 ; consider � t i =1 (2 Z i − 1) . 80 60 40 20 0 20 40 60 0 200 400 600 800 1000 43 / 61
Bernoulli walks Let Z i be bernoulli with E Z i = 1 / 2 ; consider � t i =1 (2 Z i − 1) . 80 60 40 20 0 20 40 60 0 200 400 600 800 1000 √ Fact: with probability ≥ 1 − 1 / e , position ≤ 2 n . � Thus: with probability ≥ 1 − 1 / e , R ( h ) − � R ( h ) ≤ 1 / 2 n . 43 / 61
Two ways to get that Bernoulli walk “Fact” Theorem (via Chebyshev). Given IID Z i ∈ [ a, b ] , with probability ≥ 1 − δ , � � � � � n � � � E Z 1 − 1 (1 /δ ) � � Z i ≤ ( b − a ) 4 n . � � n � � i =1 Theorem (via Hoeffding). Given IID Z i ∈ [ a, b ] , with probability ≥ 1 − δ , � � n E Z 1 − 1 ln(1 /δ ) Z i ≤ ( b − a ) . n 2 n i =1 Remarks. ◮ Defining Z i := 1 [ h ( X i ) � = Y i ] for a fixed h chosen without seeing (( X i , Y i )) n i =1 , left hand side becomes R ( h ) − � R ( h ) . 44 / 61
Overfitting ◮ These bounds require IID ( Z i ) n i =1 where Z i := 1 [ h ( X i ) � = Y i ] . ◮ If h depends on (( X i , Y i )) n i =1 , we can’t guarantee independence of Z i . ◮ E.g., suppose h memorizes training data, and outputs “bear” on new data; can force � R ( h ) = 0 and R ( h ) = 0 , and also � R ( h ) = 0 but R ( h ) = 1 . 45 / 61
11. Finite classes
Controlling k predictors Theorem. Let Z i,j ∈ [ a, b ] be given, where ( Z i,j ) n i =1 are independent for each j (but nothing is said across j ). With probability at least 1 − δ , � � n j ∈{ 1 ,...,k } E Z 1 ,j − 1 ln k + ln(1 /δ ) max Z i,j ≤ ( b − a ) . n 2 n i =1 Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j ∈ { 1 , . . . , k } . 2 n 46 / 61
Controlling k predictors Theorem. Let Z i,j ∈ [ a, b ] be given, where ( Z i,j ) n i =1 are independent for each j (but nothing is said across j ). With probability at least 1 − δ , � � n j ∈{ 1 ,...,k } E Z 1 ,j − 1 ln k + ln(1 /δ ) max Z i,j ≤ ( b − a ) . n 2 n i =1 Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j ∈ { 1 , . . . , k } . 2 n Remarks. ◮ We pick ( h 1 , . . . , h k ) without seeing data! ◮ This is how all our generalization guarantees will go: we prove a guarantee on all possible things the algorithm can output, and thus avoid the issue of “ h depends on data”. Called “uniform deviations” or “uniform law of large numbers”. ◮ For this approach to work, we must build the tightest possible estimate of what the algorithm considers (on particular data). 46 / 61
Proof of finite class bound Proof. 47 / 61
Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n 47 / 61
Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . 47 / 61
Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: Pr ( ∀ j � ¬ F j ) = 47 / 61
Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 47 / 61
Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 To finish the proof, set δ j = δ / k . � 47 / 61
Proof of finite class bound Proof. Fix any h j and some confidence level δ j > 0 . Define a failure event � � � ln(1 /δ j ) R ( h j ) > � F j := R ( h j ) + ǫ j where ǫ j := . 2 n By Hoeffding’s inequality (which requires independence!), Pr( F j ) ≤ δ j . The events ( F 1 , . . . , F k ) are not independent, but it doesn’t matter: k k � � Pr ( ∀ j � ¬ F j ) = 1 − Pr ( ∃ j � F j ) ≥ 1 − Pr ( F j ) ≥ 1 − δ j . j =1 j =1 To finish the proof, set δ j = δ / k . � We can also prove the first (abstract) Theorem, and then plug in random � � variables Z i,j := 1 h j ( X i ) � = Y i . 47 / 61
Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . 48 / 61
Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . 48 / 61
Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) 48 / 61
Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 1 48 / 61
Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 4 F 5 F 1 F 2 F 3 48 / 61
Picture behind proof Each predictor h j had a failure event F j := [ R ( h j ) > R ( h j ) + ǫ j ] . Concretely, F j is a subset of all possible (( X i , Y i )) n i =1 . Some other h l might have a different failure event F l ! (But F l is still a subset of the same sample space!) F 4 F 5 F 1 F 2 F 3 Looking ahead: for infinitely many predictors, picture still works if failure events overlap! 48 / 61
Finite class bound — summary Theorem. Let predictors ( h 1 , . . . , h k ) be given. With probability ≥ 1 − δ over an IID draw (( X i , Y i )) n i =1 , � ln k + ln(1 /δ ) R ( h j ) ≤ � R ( h j ) + ∀ j. 2 n Remarks. ◮ If we choose ( h 1 , . . . , h k ) before seeing (( X i , Y i )) n i =1 , we can use this bound. ◮ Example: train k classifiers, pick the best on validation set! ◮ This approach “produce bound for all possible algo outputs” may seem sloppy, but it’s the best we have! ◮ Letting F = ( h 1 , . . . , h k ) denote our set of predictors, the bound is: with probability ≥ 1 − δ , every f ∈ F satisfies � ln |F| + ln 1 / δ R ( f ) ≤ � R ( f ) + . 2 n In the next sections, we’ll handle |F| = ∞ by replacing ln |F| with complexity ( F ) , whose meaning will vary. 49 / 61
12. VC Dimension
Recommend
More recommend