Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23 2011 L. Rosasco/ T.Poggio Generalization and Stability
About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability argument. L. Rosasco/ T.Poggio Generalization and Stability
Plan Generalization Bounds Stability Generalization Bounds Using Stability L. Rosasco/ T.Poggio Generalization and Stability
Learning Algorithms A learning algorithm A is a map S �→ f S where S = ( x 1 , y 1 ) . . . . ( x n , y n ) . We assume that: A is deterministic, A does not depend on the ordering of the points in the training set. How can we measure quality of f S ? L. Rosasco/ T.Poggio Generalization and Stability
Error Risks Recall that we’ve defined the expected risk: � I [ f S ] = E ( x , y ) [ V ( f S ( x ) , y )] = V ( f S ( x ) , y ) d µ ( x , y ) and the empirical risk: n I S [ f S ] = 1 � V ( f S ( x i ) , y i ) . n i = 1 Note : we will denote the loss function as V ( f , z ) or as V ( f ( x ) , y ) , where z = ( x , y ) . For example: E z [ V ( f , z )] = E ( x , y ) [ V ( f S ( x ) , y )] L. Rosasco/ T.Poggio Generalization and Stability
Generalization Bounds Goal Choose A so that I [ f S ] is small = ⇒ I [ f S ] depends on the unknown probability distribution. Approach We can measure I S [ f S ] . A generalization bound is a (probabilistic) bound on the defect (generalization error) D [ f S ] = I [ f S ] − I S [ f S ] If we can bound the defect and we can observe that I S [ f S ] is small, then I [ f S ] is likely to be small. L. Rosasco/ T.Poggio Generalization and Stability
Properties of Generalization Bounds A probabilistic bound takes the form P ( I [ f S ] − I S [ f S ] ≥ ǫ ) ≤ δ or equivalenty with confidence 1 − δ I [ f S ] − I S [ f S ] ≤ ǫ L. Rosasco/ T.Poggio Generalization and Stability
Properties of Generalization Bounds (cont.) Complexity A historical approach to generalization bounds is based on controlling the complexity of the hypothesis space (covering numbers, VC-dimension, Rademacher complexities) L. Rosasco/ T.Poggio Generalization and Stability
Necessary and Sufficient Conditions for Learning ERM Consistency Generalization UGC Finite Complexity E mpirical R isk M inimization U niform G livenko C antelli L. Rosasco/ T.Poggio Generalization and Stability
Generalization Bounds By Stability Stability As we saw in class 2, the basic idea of stability is that a good algorithm should not change its solution much if we modify the training set slightly. L. Rosasco/ T.Poggio Generalization and Stability
Necessary and Sufficient Conditions for Learning (cont.) Consistency ERM Generalization UGC Finite Complexity Stability E mpirical R isk M inimization U niform G livenko C antelli L. Rosasco/ T.Poggio Generalization and Stability
Regularization, Stability and Generalization We explain this approach to generalization bounds, and show how to apply it to Tikhonov Reguarization in the next class. Note that we will consider a stronger notion of stability, than the one discussed in class 2. Tikhonov regularization satisfies this stronger notion of stability. L. Rosasco/ T.Poggio Generalization and Stability
Uniform Stability notation: S training set, S i , z training set obtained replacing the i -th example in S with a new point z = ( x , y ) . Definition We say that an algorithm A has uniform stability β (is β -stable) if ∀ ( S , z ) ∈ Z n + 1 , ∀ i , sup | V ( f S , z ′ ) − V ( f S i , z , z ′ ) | ≤ β. z ′ ∈ Z L. Rosasco/ T.Poggio Generalization and Stability
Uniform Stability (cont.) Uniform stability is a strong requirement: a solution has to change very little even when a very unlikely (“bad”) training set is drawn. the coefficient β is a function of n , and should perhaps be written β n . L. Rosasco/ T.Poggio Generalization and Stability
Stability and Concentration Inequalities Given that an algorithm A has stability β , how can we get bounds on its performance? = ⇒ Concentration Inequalities, in particular, McDiarmid’s Inequality. Concentration Inequalities show how a variable is concentrated around its mean. L. Rosasco/ T.Poggio Generalization and Stability
McDiarmid’s Inequality Let V 1 , . . . , V n be random variables. If a function F mapping V 1 , . . . , V n to R satisfies | F ( v 1 , . . . , v n ) − F ( v 1 , . . . , v i − 1 , v ′ sup i , v i + 1 , . . . , v n ) | ≤ c i , v 1 ,..., v n , v ′ i then the following statement holds: � � 2 ǫ 2 P ( | F ( v 1 , . . . , v n ) − E ( F ( v 1 , . . . , v n )) | > ǫ ) ≤ 2 exp − . � n i = 1 c 2 i L. Rosasco/ T.Poggio Generalization and Stability
McDiarmid’s Inequality Let V 1 , . . . , V n be random variables. If a function F mapping V 1 , . . . , V n to R satisfies | F ( v 1 , . . . , v n ) − F ( v 1 , . . . , v i − 1 , v ′ sup i , v i + 1 , . . . , v n ) | ≤ c i , v 1 ,..., v n , v ′ i then the following statement holds: � � 2 ǫ 2 P ( | F ( v 1 , . . . , v n ) − E ( F ( v 1 , . . . , v n )) | > ǫ ) ≤ 2 exp − . � n i = 1 c 2 i L. Rosasco/ T.Poggio Generalization and Stability
Example: Hoeffding’s Inequality Suppose each v i ∈ [ a , b ] , and we define � n F ( v 1 , . . . , v n ) = 1 i = 1 v i , the average of the v i . Then, n c i = 1 n ( b − a ) . Applying McDiarmid’s Inequality, we have that � � 2 ǫ 2 P ( | F ( v ) − E ( F ( v )) | > ǫ ) ≤ 2 exp − � n i = 1 c 2 i � � 2 ǫ 2 = 2 exp − � n i = 1 ( 1 n ( b − a )) 2 2 n ǫ 2 � � = 2 exp − . ( b − a ) 2 L. Rosasco/ T.Poggio Generalization and Stability
Generalization Bounds via McDiarmid’s Inequality We will use β -stability to apply McDiarmid’s inequality to the defect D [ f S ] = I [ f S ] − I S [ f S ] . 2 steps bound the expectation of the defect 1 bound how much the defect can change when we replace 2 an example L. Rosasco/ T.Poggio Generalization and Stability
Bounding The Expectation of The Defect Note that E S = E ( z 1 ,..., z n ) . E S D [ f S ] = E S [ I S [ f S ] − I [ f S ]] n � � 1 � = E ( S , z ) V ( f S , z i ) − V ( f S , z ) n i = 1 � n � 1 � = V ( f S i , z , z ) − V ( f S , z ) E ( S , z ) n i = 1 ≤ β The second equality follows by the “symmetry” of the expectation: the expected value of a training set on a training point doesn’t change when we “rename” the points. L. Rosasco/ T.Poggio Generalization and Stability
Bounding The Deviation of The Defect Assume that there exists an upper bound M on the loss. | D [ f S ] − D [ f S i , z ] | = | I S [ f S ] − I [ f S ] − I S i , z [ f S i , z ] + I [ f S i , z ] | ≤ | I [ f S ] − I [ f S i , z ] | + | I S [ f S ] − I S i , z [ f S i , z ] | β + 1 ≤ n | V ( f S , z i ) − V ( f S i , z , z ) | + 1 � | V ( f S , z j ) − V ( f S i , z , z j ) | n j � = i β + M ≤ n + β 2 β + M = n L. Rosasco/ T.Poggio Generalization and Stability
Applying McDiarmid’s Inequality By McDiarmid’s Inequality, for any ǫ , � � 2 ǫ 2 P ( | D [ f S ] − E D [ f S ] | > ǫ ) ≤ − = 2 exp � n i = 1 ( 2 ( β + M n )) 2 � � ǫ 2 n ǫ 2 � � = 2 exp − = 2 exp − 2 n ( β + M 2 ( n β + M ) 2 n ) 2 L. Rosasco/ T.Poggio Generalization and Stability
A Different Form Of The Bound Let n ǫ 2 � � δ ≡ 2 exp − . 2 ( n β + M ) 2 Solving for ǫ in terms of δ , we find that � 2 ln ( 2 /δ ) ǫ = ( n β + M ) . n We can say that with confidence 1 − δ , � 2 ln ( 2 /δ ) D [ f S ] ≤ E D [ f S ] + ( n β + M ) n But E D [ f S ] ≤ β ...... L. Rosasco/ T.Poggio Generalization and Stability
A Different Form Of The Bound Let n ǫ 2 � � δ ≡ 2 exp − . 2 ( n β + M ) 2 Solving for ǫ in terms of δ , we find that � 2 ln ( 2 /δ ) ǫ = ( n β + M ) . n We can say that with confidence 1 − δ , � 2 ln ( 2 /δ ) D [ f S ] ≤ E D [ f S ] + ( n β + M ) n But E D [ f S ] ≤ β ...... L. Rosasco/ T.Poggio Generalization and Stability
A Different Form Of The Bound (cont.) Finally, recalling the definition, of the defect we have with confidence 1 − δ , � 2 ln ( 2 /δ ) I [ f S ] ≤ I S [ f S ] + β + ( n β + M ) . n L. Rosasco/ T.Poggio Generalization and Stability
Convergence Note that if β = k n for some k , we can restate our bounds as n ǫ 2 � | I [ f S ] − I S [ f S ] | ≥ k � � � n + ǫ ≤ 2 exp − , P 2 ( k + M ) 2 and with probability 1 − δ , � I [ f S ] ≤ I S [ f S ] + k 2 ln ( 2 /δ ) n + ( 2 k + M ) . n L. Rosasco/ T.Poggio Generalization and Stability
Fast Convergence For the uniform stability approach we’ve described, β = k n (for some constant k ) is “good enough”. Obviously, the best possible stability would be β = 0 — the function can’t change at all when you change the training set. An algorithm that always picks the same function, regardless of its training set, is maximally stable and has β = 0. Using β = 0 in the last bound, with probability 1 − δ , � 2 ln ( 2 /δ ) I [ f S ] ≤ I S [ f S ] + M . n � � 1 . So once β = O ( 1 The convergence is still O n ) , further √ n increases in stability don’t change the rate of convergence. L. Rosasco/ T.Poggio Generalization and Stability
Recommend
More recommend