class 5 stability
play

Class 5 Stability Carlo Ciliberto Department of Computer Science, - PowerPoint PPT Presentation

Class 5 Stability Carlo Ciliberto Department of Computer Science, UCL November 8, 2017 Uniform Stability - Notation Let Z be a set. For any set S = { z 1 , . . . , z n } Z n and for any z Z and i = 1 , . . . , n , we denote S i,z = { z


  1. Class 5 Stability Carlo Ciliberto Department of Computer Science, UCL November 8, 2017

  2. Uniform Stability - Notation Let Z be a set. For any set S = { z 1 , . . . , z n } ∈ Z n and for any z ∈ Z and i = 1 , . . . , n , we denote S i,z = { z 1 , . . . , z i − 1 , z, z i +1 , . . . , z n } ∈ Z n the set obtained by substituting the i -th element in S with z .

  3. Uniform Stability We denote input-output pairs as z = ( x, y ) ∈ Z = X × Y and for any f : X → Y we denote ℓ ( f, z ) = ℓ ( f ( x ) , y ) . For an algorithm A and for any dataset S = ( z i ) n i =1 we write f S = A ( S ) . Uniform β -Stability . An algorithm A is β ( n ) -stable with n ∈ N and β ( n ) > 0 , if for all S ∈ Z n , z ∈ Z and i = 1 , . . . , n sup | ℓ ( f S , ¯ z ) − ℓ ( f S i,z , ¯ z ) | ≤ β ( n ) ¯ z ∈Z

  4. Stability and Generalization Error Theorem . Let A be a uniform β ( n ) -stable algorithm. For any dataset S ∈ Z n denote f S = A ( S ) . Then | E S ∼ ρ n [ E ( f S ) − E n ( f S )] | ≤ β ( n ) where S ∼ ρ n denotes a random dataset with n points sampled independently from ρ . The above result shows that uniform stability of an algorithm allows to directly control its generalization error . Note that this result relies only on the properties of the learning algorithm but does not require any knowledge about the complexity of the hypotheses space (however it is indirectly related).

  5. Stability and Generalization Error (Continued) We begin by providing alternative formulation for: 1) The expectation of the empirical risk E S [ E S ( f S ) ] � n � 1 � E S [ E S ( f S ) ] = E S ℓ ( f S , z i ) n i =1 n n = 1 E S [ ℓ ( f S , z i ) ] = 1 � � i [ ℓ ( f S , z i ) ] E S E z ′ n n i =1 i =1 n � n � = 1 1 � � i , z ′ i , z ′ E S E z ′ i [ ℓ ( f i ) ] = E S E S ′ ℓ ( f i ) Si,z ′ Si,z ′ n n i =1 i =1 2) The expected risk E ( f S ) n n � n � E ( f S ) = E z ′ ℓ ( f S , z ′ ) = 1 E z ′ ℓ ( f S , z ′ ) = 1 1 � � � i ℓ ( f S , z ′ ℓ ( f S , z ′ i ) = E S ′ i ) E z ′ n n n i =1 i =1 i =1

  6. Stability and Generalization Error (Continued) Putting the two together � � n � � 1 � � � � � i , z ′ i ) − ℓ ( f S , z ′ � E S [ E ( f S ) − E n ( f S )] � ≤ ℓ ( f i ) � E S E S ′ � � � � S i,z ′ � � n � i =1 n ≤ E S E S ′ 1 � � � i , z ′ i ) − ℓ ( f S , z ′ � ℓ ( f i ) � ≤ β ( n ) � � S i,z ′ n i =1

  7. Stability of Tikhonov Regularization In the following we focus on the Tikhonov regularization algorithm A = A λ with λ > 0 . In particular, for any S ∈ Z n n 1 � ℓ ( f, z i ) + λ � f � 2 A ( S ) = f S = argmin H n f ∈H i =1 We will show that when H is a reproducing kernel Hilbert space (RKHS), Tikhonov regularization is β ( n ) -stable with � 1 � β ( n ) = O nλ

  8. Error Decomposition for Tikhonov Regularization Define f λ = argmin f ∈H E ( f ) + λ � f � 2 H and decompose the excess risk as E ( f S ) − E ( f ∗ ) = E ( f S ) ± E S ( f S ) ± E S ( f λ ) − E ( f ∗ ) ± λ � f λ � 2 H Now, since ◮ E ( f S ) − E ( f ∗ ) ≤ E ( f S ) − E ( f ∗ ) + λ � f S � 2 H , ◮ f S is the minimizer of the regularized empirical risk E n ( f S ) + λ � f S � 2 H − E n ( f λ ) − λ � f λ � 2 H ≤ 0 , ◮ E S E S ( f λ ) = E ( f λ ) We can conclude E S E ( f S ) − E ( f ∗ ) ≤ E S [ E ( f S ) − E n ( f S )] + E ( f λ ) − E ( f ∗ ) + λ � f λ � 2 H

  9. Error Decomposition for Tikhonov Regularization + E ( f λ ) − E ( f ∗ ) + λ � f λ � 2 E S E ( f S ) − E ( f ∗ ) ≤ E S [ E ( f S ) − E n ( f S )] H � �� � � �� � Generalization Error (related to) Interpolation Error and Approximation Error Stability of Tikhonov regularization O (1 / ( nλ )) + assuming the interpolation/approximation error to be bounded by λ s with s > 0 lead to E S E ( f S ) − E ( f ∗ ) ≤ O (1 / ( nλ )) + λ s We can choose the optimal λ ( n ) and (expected) error rates ǫ ( n ) as 1 s s +1 ) s +1 ) λ ( n ) = O ( n − E S E ( f S ) − E ( f ∗ ) ≤ O ( n − Note. If f ∗ ∈ H it is easy to show that s = 1 and therefore that the expected excess risk goes to zero at least as O ( n − 1 / 2 ) .

  10. Stability of Tikhonov Regularization Let H be a RKHS with associated kernel k : X × X → R . We want to show that for any S ∈ Z n , z ′ ∈ Z and i = 1 , . . . , n | ℓ ( f S , z ) − ℓ ( f S i,z ′ , z ) | ≤ 2 L 2 k 2 sup nλ z ∈Z where L > 0 is the Lipschitz constant of ℓ ( · , y ) (uniformly w.r.t. y ∈ Y ) and k 2 = sup x ∈X k ( x, x ) .

  11. Reproducing Property Recall the reproducing property of RKHS H : ∀ f ∈ H , ∀ x ∈ X f ( x ) = � f, k ( · , x ) � H � In particular, | f ( x ) | ≤ k ( x, x ) � f � H . Therefore, sup | ℓ ( f S , z ) − ℓ ( f S i,z ′ , z ) | ≤ sup | ℓ ( f S ( x ) , y ) − ℓ ( f S i,z ′ ( x ) , y ) | z ∈Z x ∈X ,y ∈Y ≤ L sup | f S ( x ) − f S i,z ′ ( x ) | ≤ Lk � f S − f S i,z ′ � H x ∈X We need to control � f S − f S i,z ′ � H . We will exploit the strong convexity of Tikhonov regularization.

  12. Strong convexity of � · � 2 H Technical observation . For any f, g ∈ H and θ ∈ [0 , 1] we have � θf + (1 − θ ) g � 2 H = θ 2 � f � 2 H = (1 − θ ) 2 � g � 2 H + 2 θ (1 − θ ) � f, g � H = θ (1 − (1 − θ )) � f � 2 H + (1 − θ )(1 − θ ) � g � 2 H + 2 θ (1 − θ ) � f, g � H = θ � f � 2 H + (1 − θ ) � g � 2 H − θ (1 − θ )( � f � 2 H + � g � 2 H − 2 � f, g � H ) = θ � f � 2 H + (1 − θ ) � g � 2 H − θ (1 − θ ) � f − g � 2 H In particular, for any F ′ : H → R convex, if we denote F ( · ) = F ′ ( · ) + λ � · � 2 , we have F ( θf + (1 − θ ) g ) ≤ θF ( f ) + (1 − θ ) F ( g ) − λθ (1 − θ ) � f − g � 2 H

  13. Strong convexity II Let θ = 1 / 2 . Then we have � f + g � ≤ F ( f ) + F ( g ) − λ 2 � f − g � 2 2 F H 2 By subtracting on both sides 2 F ( f ) and adding λ/ 2 � f − g � 2 H we have � f + g � λ 2 � f − g � 2 H + 2 F − 2 F ( f ) ≤ F ( g ) − F ( f ) 2 � � f + g Finally, note that if f = argmin f ∈H F ( f ) we have F − F ( f ) ≥ 0 2 and therefore λ 2 � f − g � 2 H ≤ F ( g ) − F ( f )

  14. Strong Convexity of Tikhonov Regularization Let now define ◮ F 1 ( · ) = E S ( · ) + λ � · � 2 H and ◮ F 2 ( · ) = E S i,z ′ ( · ) + λ � · � 2 H Furthermore, to simplify the notation denote f 1 = f S and f 2 = f S i,z ′ . Recall that by construction f 1 = argmin F 1 ( f ) and f 2 = argmin F 2 ( f ) f ∈H f ∈H

  15. Strong Convexity of Tikhonov Regularization II By our previous observation on strong convexity λ λ 2 � f 1 − f 2 � 2 2 � f 1 − f 2 � 2 H ≤ F 1 ( f 2 ) − F 1 ( f 1 ) and H ≤ F 2 ( f 1 ) − F 2 ( f 2 ) Summing the two inequalities (and rearranging the terms) λ � f 1 − f 2 � 2 H ≤ F 1 ( f 2 ) − F 2 ( f 2 ) + F 2 ( f 1 ) − F 1 ( f 1 ) = E S ( f 2 ) − E S i,z ′ ( f 2 ) + E S i,z ′ ( f 1 ) − E S ( f 1 ) = 1 n ( ℓ ( f 2 , z i ) − ℓ ( f 2 , z ′ ) + ℓ ( f 1 , z ′ ) − ℓ ( f 1 , z i )) = 1 n ( ℓ ( f 2 , z i ) − ℓ ( f 1 , z i ) + ℓ ( f 1 , z ′ ) − ℓ ( f 2 , z ′ ))) ≤ 2 n sup | ℓ ( f 1 , z ) − ℓ ( f 2 , z ) | z where we have used the definitions of F 1 and F 2 and the fact that the risks E S and E S i,z ′ differ only for one point. Therefore, for any function f : X → Y , we have E S ( f ) − E S i,z ′ ( f ) = 1 n ( ℓ ( f, z i ) − ℓ ( f, z ′ )) .

  16. Stability of Tikhonov Regularization (Continued) Since sup z | ℓ ( f 1 , z ) − ℓ ( f 2 , z ) | ≤ Lk � f 1 − f 2 � H , we have H ≤ 2 kL λ � f 1 − f 2 � 2 n � f 1 − f 2 � H which implies � f 1 − f 2 � H ≤ 2 kL nλ and from which we can conclude that | ℓ ( f 1 , z ) − ℓ ( f 2 , z ) | ≤ 2 L 2 k 2 sup nλ z ∈Z proving the β ( n ) = 2 L 2 k 2 uniform stability of Tikhonov regularization. nλ

  17. So far... In previous classes we have studied the excess risk of an estimator (in particular its sample error) by controlling the complexity of the space of functions from which the estimator was sampled (e.g. by Covering numbers). In this class we have investigated an alternative approach that focuses exclusively on properties of the learning algorithm (rather than of the whole space). In particular we have observed how the stability of an estimator allows to control its generalization error in expectation. We have shown in particular that Tikhonov regularization is a stable algorithm. This allowed to immediately derive excess risk bounds.

  18. Stability and Generalization (in Probability) Ok but... what about controlling the generalization error in probability rather than in expectation? We can exploit the following result McDiarmid’s Inequality . Let F : Z n × Z n → R such that for any i = 1 , . . . , n there exists a c i > 0 for which sup S ∈Z n ,z ∈Z | F ( S ) − F ( S i,z ) | ≤ c i . Then, � � 2 ǫ 2 P S ∼ ρ n ( | F ( S ) − E S ′ ∼ ρ n F ( S ′ ) | ≥ ǫ ) ≤ 2 exp − � n i =1 c 2 i

Recommend


More recommend