Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1
Motivation 2
Support vector machines X = R d , Y = {− 1 , +1 } . w ∈ R d to following optimization problem: ◮ Return solution ˆ n λ 2 + 1 � 2 � w � 2 min [1 − y i w T x i ] + . n w ∈ R d i =1 ◮ Loss function is hinge loss ℓ (ˆ y, y ) = [1 − y ˆ y ] + = max { 1 − y ˆ y, 0 } . (Here, we are okay with a real-valued prediction.) ◮ The λ 2 � w � 2 2 term is called Tikhonov regularization , which we’ll discuss later. 3
Basic statistical model for data IID model of data ◮ Training data and test example are independent and identically distributed ( X × Y ) -valued random variables: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P. 4
Basic statistical model for data IID model of data ◮ Training data and test example are independent and identically distributed ( X × Y ) -valued random variables: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P. SVM in the iid model ◮ Return solution ˆ w to following optimization problem: n λ 2 + 1 � 2 � w � 2 [1 − Y i w T X i ] + . min n w ∈ R d i =1 ◮ Therefore, ˆ w is a random variable, depending on ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . 4
Convergence of empirical risk For w that does not depend on training data : Empirical risk n R n ( w ) = 1 � ℓ ( w T X i , Y i ) n i =1 is a sum of iid random variables. 5
Convergence of empirical risk For w that does not depend on training data : Empirical risk n R n ( w ) = 1 � ℓ ( w T X i , Y i ) n i =1 is a sum of iid random variables. Law of Large Numbers gives an asymptotic result: n R n ( w ) = 1 � p ℓ ( w T X i , Y i ) − → E [ ℓ ( w T X, Y )] = R ( w ) . n i =1 (This can be made non-asymptotic.) 5
Uniform convergence of empirical risk However, ˆ w does depend on training data. Empirical risk of ˆ w is not a sum of iid random variables: n w ) = 1 � w T X i , Y i ) . R n ( ˆ ℓ ( ˆ n i =1 6
Uniform convergence of empirical risk However, ˆ w does depend on training data. Empirical risk of ˆ w is not a sum of iid random variables: n w ) = 1 � w T X i , Y i ) . R n ( ˆ ℓ ( ˆ n i =1 Idea : ˆ w could conceivably take any value w , but if p sup w |R n ( w ) − R ( w ) | − → 0 , (1) p then R n ( ˆ w ) − → R ( ˆ w ) as well. (1) is called uniform convergence . 6
Detour: Concentration inequalities 7
Symmetric random walk Rademacher random variables ε 1 , . . . , ε n iid with P ( ε i = − 1) = P ( ε i = 1) = 1 / 2 . Symmetric random walk : position after n steps is n � S n = ε i . i =1 8
Symmetric random walk Rademacher random variables ε 1 , . . . , ε n iid with P ( ε i = − 1) = P ( ε i = 1) = 1 / 2 . Symmetric random walk : position after n steps is n � S n = ε i . i =1 How far from origin? ◮ By independence, var( S n ) = � n i =1 var( ε i ) = n . ◮ So expected distance from origin is var( S n ) ≤ √ n. � E | S n | ≤ 8
Symmetric random walk Rademacher random variables ε 1 , . . . , ε n iid with P ( ε i = − 1) = P ( ε i = 1) = 1 / 2 . Symmetric random walk : position after n steps is n � S n = ε i . i =1 How far from origin? ◮ By independence, var( S n ) = � n i =1 var( ε i ) = n . ◮ So expected distance from origin is var( S n ) ≤ √ n. � E | S n | ≤ How many realizations are ≫ √ n from origin? 8
Markov’s inequality For any random variable X and any t ≥ 0 , P ( | X | ≥ t ) ≤ E | X | . t ◮ Proof: t · 1 {| X | ≥ t } ≤ | X | . 9
Markov’s inequality For any random variable X and any t ≥ 0 , P ( | X | ≥ t ) ≤ E | X | . t ◮ Proof: t · 1 {| X | ≥ t } ≤ | X | . Application to symmetric random walk: P ( | S n | ≥ c √ n ) ≤ E | S n | c √ n ≤ 1 c. 9
Hoeffding’s inequality If X 1 , . . . , X n are independent random variables, with X i taking values in [ a i , b i ] , then for any t ≥ 0 , � � n 2 t 2 � ≤ exp P ( X i − E ( X i )) ≥ t − . � n i =1 ( b i − a i ) 2 i =1 10
Hoeffding’s inequality If X 1 , . . . , X n are independent random variables, with X i taking values in [ a i , b i ] , then for any t ≥ 0 , � � n 2 t 2 � ≤ exp P ( X i − E ( X i )) ≥ t − . � n i =1 ( b i − a i ) 2 i =1 E.g., Rademacher random variables have [ a i , b i ] = [ − 1 , +1] , so P ( S n ≥ t ) ≤ exp( − 2 t 2 / (4 n )) . 10
Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 11
Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 1. Apply Hoeffding to ε 1 , . . . , ε n : P ( S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 11
Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 1. Apply Hoeffding to ε 1 , . . . , ε n : P ( S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 2. Apply Hoeffding to − ε 1 , . . . , − ε n : P ( − S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 11
Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 1. Apply Hoeffding to ε 1 , . . . , ε n : P ( S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 2. Apply Hoeffding to − ε 1 , . . . , − ε n : P ( − S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 3. Therefore, by union bound, P ( | S n | ≥ c √ n ) ≤ 2 exp( − c 2 / 2) . (Compare to bound from Markov’s inequality: 1 /c .) 11
Equivalent form of Hoeffding’s inequality Let X 1 , . . . , X n be independent random variables, with X i taking values in [ a i , b i ] , and let S n = � n i =1 X i . For any δ ∈ (0 , 1) , � � n � 1 � � ( b i − a i ) 2 ln(1 /δ ) S n − E [ S n ] < ≥ 1 − δ. P 2 i =1 This is a “high probability” upper-bound on S n − E [ S n ] . 12
Uniform convergence: Finite classes 13
Back to statistical learning Cast of characters: ◮ feature and outcome spaces: X , Y ◮ function class: F ⊂ Y X ◮ loss function: ℓ : Y × Y → R + (assume bounded by 1 ) ◮ training and test data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P 14
Back to statistical learning Cast of characters: ◮ feature and outcome spaces: X , Y ◮ function class: F ⊂ Y X ◮ loss function: ℓ : Y × Y → R + (assume bounded by 1 ) ◮ training and test data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P We let ˆ f ∈ arg min f ∈F R n ( f ) be minimizer of empirical risk n R n ( f ) = 1 � ℓ ( f ( X i ) , Y i ) . n i =1 14
Back to statistical learning Cast of characters: ◮ feature and outcome spaces: X , Y ◮ function class: F ⊂ Y X ◮ loss function: ℓ : Y × Y → R + (assume bounded by 1 ) ◮ training and test data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P We let ˆ f ∈ arg min f ∈F R n ( f ) be minimizer of empirical risk n R n ( f ) = 1 � ℓ ( f ( X i ) , Y i ) . n i =1 Our worry : over-fitting R ( ˆ f ) ≫ R n ( ˆ f ) . 14
Convergence of empirical risk for fixed function For any fixed function f ∈ F , n n � = E � = R ( f ) . 1 = 1 � � � R n ( f ) � ℓ ( f ( X i ) , Y i ) ℓ ( f ( X i ) , Y i ) E E n n i =1 i =1 15
Convergence of empirical risk for fixed function For any fixed function f ∈ F , n n � = E � = R ( f ) . 1 = 1 � � � R n ( f ) � ℓ ( f ( X i ) , Y i ) ℓ ( f ( X i ) , Y i ) E E n n i =1 i =1 Since R n ( f ) is sum of n independent [0 , 1 n ] -valued random variables, � � 2 t 2 � ≤ 2 exp � |R n ( f ) − R ( f ) | ≥ t = 2 exp( − 2 nt 2 ) − P � n i =1 ( 1 n ) 2 for any t > 0 , by Hoeffding’s inequality and union bound. 15
Convergence of empirical risk for fixed function For any fixed function f ∈ F , n n � = E � = R ( f ) . 1 = 1 � � � R n ( f ) � ℓ ( f ( X i ) , Y i ) ℓ ( f ( X i ) , Y i ) E E n n i =1 i =1 Since R n ( f ) is sum of n independent [0 , 1 n ] -valued random variables, � � 2 t 2 � ≤ 2 exp � |R n ( f ) − R ( f ) | ≥ t = 2 exp( − 2 nt 2 ) − P � n i =1 ( 1 n ) 2 for any t > 0 , by Hoeffding’s inequality and union bound. This argument does not apply to ˆ f , because ˆ f depends on ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . 15
Uniform convergence We cannot directly apply Hoeffding’s inequality to ˆ f , since its empirical risk R n ( ˆ f ) is not average of iid random variables. 16
Uniform convergence We cannot directly apply Hoeffding’s inequality to ˆ f , since its empirical risk R n ( ˆ f ) is not average of iid random variables. One possible solution : ensure empirical risk of every f ∈ F is close to its expected value. This is called uniform convergence . 16
Uniform convergence We cannot directly apply Hoeffding’s inequality to ˆ f , since its empirical risk R n ( ˆ f ) is not average of iid random variables. One possible solution : ensure empirical risk of every f ∈ F is close to its expected value. This is called uniform convergence . ◮ How much data is needed to ensure this? 16
Recommend
More recommend