Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1

Motivation 2

Support vector machines X = R d , Y = {− 1 , +1 } . w ∈ R d to following optimization problem: ◮ Return solution ˆ n λ 2 + 1 � 2 � w � 2 min [1 − y i w T x i ] + . n w ∈ R d i =1 ◮ Loss function is hinge loss ℓ (ˆ y, y ) = [1 − y ˆ y ] + = max { 1 − y ˆ y, 0 } . (Here, we are okay with a real-valued prediction.) ◮ The λ 2 � w � 2 2 term is called Tikhonov regularization , which we’ll discuss later. 3

Basic statistical model for data IID model of data ◮ Training data and test example are independent and identically distributed ( X × Y ) -valued random variables: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P. 4

Basic statistical model for data IID model of data ◮ Training data and test example are independent and identically distributed ( X × Y ) -valued random variables: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P. SVM in the iid model ◮ Return solution ˆ w to following optimization problem: n λ 2 + 1 � 2 � w � 2 [1 − Y i w T X i ] + . min n w ∈ R d i =1 ◮ Therefore, ˆ w is a random variable, depending on ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . 4

Convergence of empirical risk For w that does not depend on training data : Empirical risk n R n ( w ) = 1 � ℓ ( w T X i , Y i ) n i =1 is a sum of iid random variables. 5

Convergence of empirical risk For w that does not depend on training data : Empirical risk n R n ( w ) = 1 � ℓ ( w T X i , Y i ) n i =1 is a sum of iid random variables. Law of Large Numbers gives an asymptotic result: n R n ( w ) = 1 � p ℓ ( w T X i , Y i ) − → E [ ℓ ( w T X, Y )] = R ( w ) . n i =1 (This can be made non-asymptotic.) 5

Uniform convergence of empirical risk However, ˆ w does depend on training data. Empirical risk of ˆ w is not a sum of iid random variables: n w ) = 1 � w T X i , Y i ) . R n ( ˆ ℓ ( ˆ n i =1 6

Uniform convergence of empirical risk However, ˆ w does depend on training data. Empirical risk of ˆ w is not a sum of iid random variables: n w ) = 1 � w T X i , Y i ) . R n ( ˆ ℓ ( ˆ n i =1 Idea : ˆ w could conceivably take any value w , but if p sup w |R n ( w ) − R ( w ) | − → 0 , (1) p then R n ( ˆ w ) − → R ( ˆ w ) as well. (1) is called uniform convergence . 6

Detour: Concentration inequalities 7

Symmetric random walk Rademacher random variables ε 1 , . . . , ε n iid with P ( ε i = − 1) = P ( ε i = 1) = 1 / 2 . Symmetric random walk : position after n steps is n � S n = ε i . i =1 8

Symmetric random walk Rademacher random variables ε 1 , . . . , ε n iid with P ( ε i = − 1) = P ( ε i = 1) = 1 / 2 . Symmetric random walk : position after n steps is n � S n = ε i . i =1 How far from origin? ◮ By independence, var( S n ) = � n i =1 var( ε i ) = n . ◮ So expected distance from origin is var( S n ) ≤ √ n. � E | S n | ≤ 8

Symmetric random walk Rademacher random variables ε 1 , . . . , ε n iid with P ( ε i = − 1) = P ( ε i = 1) = 1 / 2 . Symmetric random walk : position after n steps is n � S n = ε i . i =1 How far from origin? ◮ By independence, var( S n ) = � n i =1 var( ε i ) = n . ◮ So expected distance from origin is var( S n ) ≤ √ n. � E | S n | ≤ How many realizations are ≫ √ n from origin? 8

Markov’s inequality For any random variable X and any t ≥ 0 , P ( | X | ≥ t ) ≤ E | X | . t ◮ Proof: t · 1 {| X | ≥ t } ≤ | X | . 9

Markov’s inequality For any random variable X and any t ≥ 0 , P ( | X | ≥ t ) ≤ E | X | . t ◮ Proof: t · 1 {| X | ≥ t } ≤ | X | . Application to symmetric random walk: P ( | S n | ≥ c √ n ) ≤ E | S n | c √ n ≤ 1 c. 9

Hoeffding’s inequality If X 1 , . . . , X n are independent random variables, with X i taking values in [ a i , b i ] , then for any t ≥ 0 ,   � � n 2 t 2 �  ≤ exp P ( X i − E ( X i )) ≥ t − .  � n i =1 ( b i − a i ) 2 i =1 10

Hoeffding’s inequality If X 1 , . . . , X n are independent random variables, with X i taking values in [ a i , b i ] , then for any t ≥ 0 ,   � � n 2 t 2 �  ≤ exp P ( X i − E ( X i )) ≥ t − .  � n i =1 ( b i − a i ) 2 i =1 E.g., Rademacher random variables have [ a i , b i ] = [ − 1 , +1] , so P ( S n ≥ t ) ≤ exp( − 2 t 2 / (4 n )) . 10

Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 11

Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 1. Apply Hoeffding to ε 1 , . . . , ε n : P ( S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 11

Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 1. Apply Hoeffding to ε 1 , . . . , ε n : P ( S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 2. Apply Hoeffding to − ε 1 , . . . , − ε n : P ( − S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 11

Applying Hoeffding’s inequality to symmetric random walk Union bound : For any events A and B , P ( A ∪ B ) ≤ P ( A ) + P ( B ) . 1. Apply Hoeffding to ε 1 , . . . , ε n : P ( S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 2. Apply Hoeffding to − ε 1 , . . . , − ε n : P ( − S n ≥ c √ n ) ≤ exp( − c 2 / 2) . 3. Therefore, by union bound, P ( | S n | ≥ c √ n ) ≤ 2 exp( − c 2 / 2) . (Compare to bound from Markov’s inequality: 1 /c .) 11

Equivalent form of Hoeffding’s inequality Let X 1 , . . . , X n be independent random variables, with X i taking values in [ a i , b i ] , and let S n = � n i =1 X i . For any δ ∈ (0 , 1) ,   � � n � 1 � � ( b i − a i ) 2 ln(1 /δ )    S n − E [ S n ] <  ≥ 1 − δ. P 2 i =1 This is a “high probability” upper-bound on S n − E [ S n ] . 12

Uniform convergence: Finite classes 13

Back to statistical learning Cast of characters: ◮ feature and outcome spaces: X , Y ◮ function class: F ⊂ Y X ◮ loss function: ℓ : Y × Y → R + (assume bounded by 1 ) ◮ training and test data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P 14

Back to statistical learning Cast of characters: ◮ feature and outcome spaces: X , Y ◮ function class: F ⊂ Y X ◮ loss function: ℓ : Y × Y → R + (assume bounded by 1 ) ◮ training and test data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P We let ˆ f ∈ arg min f ∈F R n ( f ) be minimizer of empirical risk n R n ( f ) = 1 � ℓ ( f ( X i ) , Y i ) . n i =1 14

Back to statistical learning Cast of characters: ◮ feature and outcome spaces: X , Y ◮ function class: F ⊂ Y X ◮ loss function: ℓ : Y × Y → R + (assume bounded by 1 ) ◮ training and test data: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P We let ˆ f ∈ arg min f ∈F R n ( f ) be minimizer of empirical risk n R n ( f ) = 1 � ℓ ( f ( X i ) , Y i ) . n i =1 Our worry : over-fitting R ( ˆ f ) ≫ R n ( ˆ f ) . 14

Convergence of empirical risk for fixed function For any fixed function f ∈ F ,   n n � = E � = R ( f ) .  1  = 1 � � � R n ( f ) � ℓ ( f ( X i ) , Y i ) ℓ ( f ( X i ) , Y i ) E E n n i =1 i =1 15

Convergence of empirical risk for fixed function For any fixed function f ∈ F ,   n n � = E � = R ( f ) .  1  = 1 � � � R n ( f ) � ℓ ( f ( X i ) , Y i ) ℓ ( f ( X i ) , Y i ) E E n n i =1 i =1 Since R n ( f ) is sum of n independent [0 , 1 n ] -valued random variables, � � 2 t 2 � ≤ 2 exp � |R n ( f ) − R ( f ) | ≥ t = 2 exp( − 2 nt 2 ) − P � n i =1 ( 1 n ) 2 for any t > 0 , by Hoeffding’s inequality and union bound. 15

Convergence of empirical risk for fixed function For any fixed function f ∈ F ,   n n � = E � = R ( f ) .  1  = 1 � � � R n ( f ) � ℓ ( f ( X i ) , Y i ) ℓ ( f ( X i ) , Y i ) E E n n i =1 i =1 Since R n ( f ) is sum of n independent [0 , 1 n ] -valued random variables, � � 2 t 2 � ≤ 2 exp � |R n ( f ) − R ( f ) | ≥ t = 2 exp( − 2 nt 2 ) − P � n i =1 ( 1 n ) 2 for any t > 0 , by Hoeffding’s inequality and union bound. This argument does not apply to ˆ f , because ˆ f depends on ( X 1 , Y 1 ) , . . . , ( X n , Y n ) . 15

Uniform convergence We cannot directly apply Hoeffding’s inequality to ˆ f , since its empirical risk R n ( ˆ f ) is not average of iid random variables. 16

Uniform convergence We cannot directly apply Hoeffding’s inequality to ˆ f , since its empirical risk R n ( ˆ f ) is not average of iid random variables. One possible solution : ensure empirical risk of every f ∈ F is close to its expected value. This is called uniform convergence . 16

Uniform convergence We cannot directly apply Hoeffding’s inequality to ˆ f , since its empirical risk R n ( ˆ f ) is not average of iid random variables. One possible solution : ensure empirical risk of every f ∈ F is close to its expected value. This is called uniform convergence . ◮ How much data is needed to ensure this? 16

Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 - PowerPoint PPT Presentation

Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d , Y = { 1 , +1 } . w R d to following optimization problem: Return solution n 2 + 1 2 w 2 min [1 y i

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on

Community-Preserving Generalization of Social Networks Jordi Casas-Roma 1 and Fran cois Rousseau

Generalization Ability of Majority Vote Point classifiers Akshat Agarwal Rahul K Sevakula

Adaptive Oblivious Transfer And Generalization Olivier Blazy, C eline Chevalier, Paul Germouty

Assessing Generalization in Deep Reinforcement Learning Soo Jung Jang Background Before (ex:

Deep Model Generalization for Medical Image Computing at Scale DOU Qi Department of Computer

decomposing generalization models of generic, habitual and episodic statements Venkata S

On the Influence of Input Noise On the Influence of Input Noise on a Generalization Error

Poisson Convergence Will Perkins February 28, 2013 Back to the Birthday Problem On HW # 2, you

Chapter II.2: Basic Probability Theory and Statistics 1. What is a probability? 1.1. Probability

Draft Simulation de chaines de Markov: briser le mur de la convergence en n 1 / 2 Pierre

tt ss str

Asymptotics of representations of classical Lie groups Alexey Bufetov Department of Mathematics,

Representations of classical Lie groups: two regimes of growth Alexey Bufetov University of Bonn

Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems with Gaussian Processes

Effectiveness of Freq Pat Mining Too many patterns! A pattern a 1 a 2 a n contains 2 n

Sambuz

Useful Links

Newsletter

Mail Us