generalization bounds and stability
play

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio - PowerPoint PPT Presentation

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23 2011 L. Rosasco/ T.Poggio Generalization and Stability About this class Goal To recall the notion of generalization bounds and show how they can be


  1. Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23 2011 L. Rosasco/ T.Poggio Generalization and Stability

  2. About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability argument. L. Rosasco/ T.Poggio Generalization and Stability

  3. Plan Generalization Bounds Stability Generalization Bounds Using Stability L. Rosasco/ T.Poggio Generalization and Stability

  4. Learning Algorithms A learning algorithm A is a map S �→ f S where S = ( x 1 , y 1 ) . . . . ( x n , y n ) . We assume that: A is deterministic, A does not depend on the ordering of the points in the training set. How can we measure quality of f S ? L. Rosasco/ T.Poggio Generalization and Stability

  5. Error Risks Recall that we’ve defined the expected risk: � I [ f S ] = E ( x , y ) [ V ( f S ( x ) , y )] = V ( f S ( x ) , y ) d µ ( x , y ) and the empirical risk: n I S [ f S ] = 1 � V ( f S ( x i ) , y i ) . n i = 1 Note : we will denote the loss function as V ( f , z ) or as V ( f ( x ) , y ) , where z = ( x , y ) . For example: E z [ V ( f , z )] = E ( x , y ) [ V ( f S ( x ) , y )] L. Rosasco/ T.Poggio Generalization and Stability

  6. Generalization Bounds Goal Choose A so that I [ f S ] is small = ⇒ I [ f S ] depends on the unknown probability distribution. Approach We can measure I S [ f S ] . A generalization bound is a (probabilistic) bound on the defect (generalization error) D [ f S ] = I [ f S ] − I S [ f S ] If we can bound the defect and we can observe that I S [ f S ] is small, then I [ f S ] is likely to be small. L. Rosasco/ T.Poggio Generalization and Stability

  7. Properties of Generalization Bounds A probabilistic bound takes the form P ( I [ f S ] − I S [ f S ] ≥ ǫ ) ≤ δ or equivalenty with confidence 1 − δ I [ f S ] − I S [ f S ] ≤ ǫ L. Rosasco/ T.Poggio Generalization and Stability

  8. Properties of Generalization Bounds (cont.) Complexity A historical approach to generalization bounds is based on controlling the complexity of the hypothesis space (covering numbers, VC-dimension, Rademacher complexities) L. Rosasco/ T.Poggio Generalization and Stability

  9. Necessary and Sufficient Conditions for Learning ERM Consistency Generalization UGC Finite Complexity E mpirical R isk M inimization U niform G livenko C antelli L. Rosasco/ T.Poggio Generalization and Stability

  10. Generalization Bounds By Stability Stability As we saw in class 2, the basic idea of stability is that a good algorithm should not change its solution much if we modify the training set slightly. L. Rosasco/ T.Poggio Generalization and Stability

  11. Necessary and Sufficient Conditions for Learning (cont.) Consistency ERM Generalization UGC Finite Complexity Stability E mpirical R isk M inimization U niform G livenko C antelli L. Rosasco/ T.Poggio Generalization and Stability

  12. Regularization, Stability and Generalization We explain this approach to generalization bounds, and show how to apply it to Tikhonov Reguarization in the next class. Note that we will consider a stronger notion of stability, than the one discussed in class 2. Tikhonov regularization satisfies this stronger notion of stability. L. Rosasco/ T.Poggio Generalization and Stability

  13. Uniform Stability notation: S training set, S i , z training set obtained replacing the i -th example in S with a new point z = ( x , y ) . Definition We say that an algorithm A has uniform stability β (is β -stable) if ∀ ( S , z ) ∈ Z n + 1 , ∀ i , sup | V ( f S , z ′ ) − V ( f S i , z , z ′ ) | ≤ β. z ′ ∈ Z L. Rosasco/ T.Poggio Generalization and Stability

  14. Uniform Stability (cont.) Uniform stability is a strong requirement: a solution has to change very little even when a very unlikely (“bad”) training set is drawn. the coefficient β is a function of n , and should perhaps be written β n . L. Rosasco/ T.Poggio Generalization and Stability

  15. Stability and Concentration Inequalities Given that an algorithm A has stability β , how can we get bounds on its performance? = ⇒ Concentration Inequalities, in particular, McDiarmid’s Inequality. Concentration Inequalities show how a variable is concentrated around its mean. L. Rosasco/ T.Poggio Generalization and Stability

  16. McDiarmid’s Inequality Let V 1 , . . . , V n be random variables. If a function F mapping V 1 , . . . , V n to R satisfies | F ( v 1 , . . . , v n ) − F ( v 1 , . . . , v i − 1 , v ′ sup i , v i + 1 , . . . , v n ) | ≤ c i , v 1 ,..., v n , v ′ i then the following statement holds: � � 2 ǫ 2 P ( | F ( v 1 , . . . , v n ) − E ( F ( v 1 , . . . , v n )) | > ǫ ) ≤ 2 exp − . � n i = 1 c 2 i L. Rosasco/ T.Poggio Generalization and Stability

  17. McDiarmid’s Inequality Let V 1 , . . . , V n be random variables. If a function F mapping V 1 , . . . , V n to R satisfies | F ( v 1 , . . . , v n ) − F ( v 1 , . . . , v i − 1 , v ′ sup i , v i + 1 , . . . , v n ) | ≤ c i , v 1 ,..., v n , v ′ i then the following statement holds: � � 2 ǫ 2 P ( | F ( v 1 , . . . , v n ) − E ( F ( v 1 , . . . , v n )) | > ǫ ) ≤ 2 exp − . � n i = 1 c 2 i L. Rosasco/ T.Poggio Generalization and Stability

  18. Example: Hoeffding’s Inequality Suppose each v i ∈ [ a , b ] , and we define � n F ( v 1 , . . . , v n ) = 1 i = 1 v i , the average of the v i . Then, n c i = 1 n ( b − a ) . Applying McDiarmid’s Inequality, we have that � � 2 ǫ 2 P ( | F ( v ) − E ( F ( v )) | > ǫ ) ≤ 2 exp − � n i = 1 c 2 i � � 2 ǫ 2 = 2 exp − � n i = 1 ( 1 n ( b − a )) 2 2 n ǫ 2 � � = 2 exp − . ( b − a ) 2 L. Rosasco/ T.Poggio Generalization and Stability

  19. Generalization Bounds via McDiarmid’s Inequality We will use β -stability to apply McDiarmid’s inequality to the defect D [ f S ] = I [ f S ] − I S [ f S ] . 2 steps bound the expectation of the defect 1 bound how much the defect can change when we replace 2 an example L. Rosasco/ T.Poggio Generalization and Stability

  20. Bounding The Expectation of The Defect Note that E S = E ( z 1 ,..., z n ) . E S D [ f S ] = E S [ I S [ f S ] − I [ f S ]] n � � 1 � = E ( S , z ) V ( f S , z i ) − V ( f S , z ) n i = 1 � n � 1 � = V ( f S i , z , z ) − V ( f S , z ) E ( S , z ) n i = 1 ≤ β The second equality follows by the “symmetry” of the expectation: the expected value of a training set on a training point doesn’t change when we “rename” the points. L. Rosasco/ T.Poggio Generalization and Stability

  21. Bounding The Deviation of The Defect Assume that there exists an upper bound M on the loss. | D [ f S ] − D [ f S i , z ] | = | I S [ f S ] − I [ f S ] − I S i , z [ f S i , z ] + I [ f S i , z ] | ≤ | I [ f S ] − I [ f S i , z ] | + | I S [ f S ] − I S i , z [ f S i , z ] | β + 1 ≤ n | V ( f S , z i ) − V ( f S i , z , z ) | + 1 � | V ( f S , z j ) − V ( f S i , z , z j ) | n j � = i β + M ≤ n + β 2 β + M = n L. Rosasco/ T.Poggio Generalization and Stability

  22. Applying McDiarmid’s Inequality By McDiarmid’s Inequality, for any ǫ , � � 2 ǫ 2 P ( | D [ f S ] − E D [ f S ] | > ǫ ) ≤ − = 2 exp � n i = 1 ( 2 ( β + M n )) 2 � � ǫ 2 n ǫ 2 � � = 2 exp − = 2 exp − 2 n ( β + M 2 ( n β + M ) 2 n ) 2 L. Rosasco/ T.Poggio Generalization and Stability

  23. A Different Form Of The Bound Let n ǫ 2 � � δ ≡ 2 exp − . 2 ( n β + M ) 2 Solving for ǫ in terms of δ , we find that � 2 ln ( 2 /δ ) ǫ = ( n β + M ) . n We can say that with confidence 1 − δ , � 2 ln ( 2 /δ ) D [ f S ] ≤ E D [ f S ] + ( n β + M ) n But E D [ f S ] ≤ β ...... L. Rosasco/ T.Poggio Generalization and Stability

  24. A Different Form Of The Bound Let n ǫ 2 � � δ ≡ 2 exp − . 2 ( n β + M ) 2 Solving for ǫ in terms of δ , we find that � 2 ln ( 2 /δ ) ǫ = ( n β + M ) . n We can say that with confidence 1 − δ , � 2 ln ( 2 /δ ) D [ f S ] ≤ E D [ f S ] + ( n β + M ) n But E D [ f S ] ≤ β ...... L. Rosasco/ T.Poggio Generalization and Stability

  25. A Different Form Of The Bound (cont.) Finally, recalling the definition, of the defect we have with confidence 1 − δ , � 2 ln ( 2 /δ ) I [ f S ] ≤ I S [ f S ] + β + ( n β + M ) . n L. Rosasco/ T.Poggio Generalization and Stability

  26. Convergence Note that if β = k n for some k , we can restate our bounds as n ǫ 2 � | I [ f S ] − I S [ f S ] | ≥ k � � � n + ǫ ≤ 2 exp − , P 2 ( k + M ) 2 and with probability 1 − δ , � I [ f S ] ≤ I S [ f S ] + k 2 ln ( 2 /δ ) n + ( 2 k + M ) . n L. Rosasco/ T.Poggio Generalization and Stability

  27. Fast Convergence For the uniform stability approach we’ve described, β = k n (for some constant k ) is “good enough”. Obviously, the best possible stability would be β = 0 — the function can’t change at all when you change the training set. An algorithm that always picks the same function, regardless of its training set, is maximally stable and has β = 0. Using β = 0 in the last bound, with probability 1 − δ , � 2 ln ( 2 /δ ) I [ f S ] ≤ I S [ f S ] + M . n � � 1 . So once β = O ( 1 The convergence is still O n ) , further √ n increases in stability don’t change the rate of convergence. L. Rosasco/ T.Poggio Generalization and Stability

Recommend


More recommend