Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio - PowerPoint PPT Presentation

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23 2011 L. Rosasco/ T.Poggio Generalization and Stability

About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability argument. L. Rosasco/ T.Poggio Generalization and Stability

Plan Generalization Bounds Stability Generalization Bounds Using Stability L. Rosasco/ T.Poggio Generalization and Stability

Learning Algorithms A learning algorithm A is a map S �→ f S where S = ( x 1 , y 1 ) . . . . ( x n , y n ) . We assume that: A is deterministic, A does not depend on the ordering of the points in the training set. How can we measure quality of f S ? L. Rosasco/ T.Poggio Generalization and Stability

Error Risks Recall that we’ve defined the expected risk: � I [ f S ] = E ( x , y ) [ V ( f S ( x ) , y )] = V ( f S ( x ) , y ) d µ ( x , y ) and the empirical risk: n I S [ f S ] = 1 � V ( f S ( x i ) , y i ) . n i = 1 Note : we will denote the loss function as V ( f , z ) or as V ( f ( x ) , y ) , where z = ( x , y ) . For example: E z [ V ( f , z )] = E ( x , y ) [ V ( f S ( x ) , y )] L. Rosasco/ T.Poggio Generalization and Stability

Generalization Bounds Goal Choose A so that I [ f S ] is small = ⇒ I [ f S ] depends on the unknown probability distribution. Approach We can measure I S [ f S ] . A generalization bound is a (probabilistic) bound on the defect (generalization error) D [ f S ] = I [ f S ] − I S [ f S ] If we can bound the defect and we can observe that I S [ f S ] is small, then I [ f S ] is likely to be small. L. Rosasco/ T.Poggio Generalization and Stability

Properties of Generalization Bounds A probabilistic bound takes the form P ( I [ f S ] − I S [ f S ] ≥ ǫ ) ≤ δ or equivalenty with confidence 1 − δ I [ f S ] − I S [ f S ] ≤ ǫ L. Rosasco/ T.Poggio Generalization and Stability

Properties of Generalization Bounds (cont.) Complexity A historical approach to generalization bounds is based on controlling the complexity of the hypothesis space (covering numbers, VC-dimension, Rademacher complexities) L. Rosasco/ T.Poggio Generalization and Stability

Necessary and Sufficient Conditions for Learning ERM Consistency Generalization UGC Finite Complexity E mpirical R isk M inimization U niform G livenko C antelli L. Rosasco/ T.Poggio Generalization and Stability

Generalization Bounds By Stability Stability As we saw in class 2, the basic idea of stability is that a good algorithm should not change its solution much if we modify the training set slightly. L. Rosasco/ T.Poggio Generalization and Stability

Necessary and Sufficient Conditions for Learning (cont.) Consistency ERM Generalization UGC Finite Complexity Stability E mpirical R isk M inimization U niform G livenko C antelli L. Rosasco/ T.Poggio Generalization and Stability

Regularization, Stability and Generalization We explain this approach to generalization bounds, and show how to apply it to Tikhonov Reguarization in the next class. Note that we will consider a stronger notion of stability, than the one discussed in class 2. Tikhonov regularization satisfies this stronger notion of stability. L. Rosasco/ T.Poggio Generalization and Stability

Uniform Stability notation: S training set, S i , z training set obtained replacing the i -th example in S with a new point z = ( x , y ) . Definition We say that an algorithm A has uniform stability β (is β -stable) if ∀ ( S , z ) ∈ Z n + 1 , ∀ i , sup | V ( f S , z ′ ) − V ( f S i , z , z ′ ) | ≤ β. z ′ ∈ Z L. Rosasco/ T.Poggio Generalization and Stability

Uniform Stability (cont.) Uniform stability is a strong requirement: a solution has to change very little even when a very unlikely (“bad”) training set is drawn. the coefficient β is a function of n , and should perhaps be written β n . L. Rosasco/ T.Poggio Generalization and Stability

Stability and Concentration Inequalities Given that an algorithm A has stability β , how can we get bounds on its performance? = ⇒ Concentration Inequalities, in particular, McDiarmid’s Inequality. Concentration Inequalities show how a variable is concentrated around its mean. L. Rosasco/ T.Poggio Generalization and Stability

McDiarmid’s Inequality Let V 1 , . . . , V n be random variables. If a function F mapping V 1 , . . . , V n to R satisfies | F ( v 1 , . . . , v n ) − F ( v 1 , . . . , v i − 1 , v ′ sup i , v i + 1 , . . . , v n ) | ≤ c i , v 1 ,..., v n , v ′ i then the following statement holds: � � 2 ǫ 2 P ( | F ( v 1 , . . . , v n ) − E ( F ( v 1 , . . . , v n )) | > ǫ ) ≤ 2 exp − . � n i = 1 c 2 i L. Rosasco/ T.Poggio Generalization and Stability

Example: Hoeffding’s Inequality Suppose each v i ∈ [ a , b ] , and we define � n F ( v 1 , . . . , v n ) = 1 i = 1 v i , the average of the v i . Then, n c i = 1 n ( b − a ) . Applying McDiarmid’s Inequality, we have that � � 2 ǫ 2 P ( | F ( v ) − E ( F ( v )) | > ǫ ) ≤ 2 exp − � n i = 1 c 2 i � � 2 ǫ 2 = 2 exp − � n i = 1 ( 1 n ( b − a )) 2 2 n ǫ 2 � � = 2 exp − . ( b − a ) 2 L. Rosasco/ T.Poggio Generalization and Stability

Generalization Bounds via McDiarmid’s Inequality We will use β -stability to apply McDiarmid’s inequality to the defect D [ f S ] = I [ f S ] − I S [ f S ] . 2 steps bound the expectation of the defect 1 bound how much the defect can change when we replace 2 an example L. Rosasco/ T.Poggio Generalization and Stability

Bounding The Expectation of The Defect Note that E S = E ( z 1 ,..., z n ) . E S D [ f S ] = E S [ I S [ f S ] − I [ f S ]] n � � 1 � = E ( S , z ) V ( f S , z i ) − V ( f S , z ) n i = 1 � n � 1 � = V ( f S i , z , z ) − V ( f S , z ) E ( S , z ) n i = 1 ≤ β The second equality follows by the “symmetry” of the expectation: the expected value of a training set on a training point doesn’t change when we “rename” the points. L. Rosasco/ T.Poggio Generalization and Stability

Applying McDiarmid’s Inequality By McDiarmid’s Inequality, for any ǫ , � � 2 ǫ 2 P ( | D [ f S ] − E D [ f S ] | > ǫ ) ≤ − = 2 exp � n i = 1 ( 2 ( β + M n )) 2 � � ǫ 2 n ǫ 2 � � = 2 exp − = 2 exp − 2 n ( β + M 2 ( n β + M ) 2 n ) 2 L. Rosasco/ T.Poggio Generalization and Stability

A Different Form Of The Bound Let n ǫ 2 � � δ ≡ 2 exp − . 2 ( n β + M ) 2 Solving for ǫ in terms of δ , we find that � 2 ln ( 2 /δ ) ǫ = ( n β + M ) . n We can say that with confidence 1 − δ , � 2 ln ( 2 /δ ) D [ f S ] ≤ E D [ f S ] + ( n β + M ) n But E D [ f S ] ≤ β ...... L. Rosasco/ T.Poggio Generalization and Stability

A Different Form Of The Bound (cont.) Finally, recalling the definition, of the defect we have with confidence 1 − δ , � 2 ln ( 2 /δ ) I [ f S ] ≤ I S [ f S ] + β + ( n β + M ) . n L. Rosasco/ T.Poggio Generalization and Stability

Convergence Note that if β = k n for some k , we can restate our bounds as n ǫ 2 � | I [ f S ] − I S [ f S ] | ≥ k � � � n + ǫ ≤ 2 exp − , P 2 ( k + M ) 2 and with probability 1 − δ , � I [ f S ] ≤ I S [ f S ] + k 2 ln ( 2 /δ ) n + ( 2 k + M ) . n L. Rosasco/ T.Poggio Generalization and Stability

Fast Convergence For the uniform stability approach we’ve described, β = k n (for some constant k ) is “good enough”. Obviously, the best possible stability would be β = 0 — the function can’t change at all when you change the training set. An algorithm that always picks the same function, regardless of its training set, is maximally stable and has β = 0. Using β = 0 in the last bound, with probability 1 − δ , � 2 ln ( 2 /δ ) I [ f S ] ≤ I S [ f S ] + M . n � � 1 . So once β = O ( 1 The convergence is still O n ) , further √ n increases in stability don’t change the rate of convergence. L. Rosasco/ T.Poggio Generalization and Stability

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio - PowerPoint PPT Presentation

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23 2011 L. Rosasco/ T.Poggio Generalization and Stability About this class Goal To recall the notion of generalization bounds and show how they can be

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Michael Spece Departments of Machine Learning and Statistics Carnegie Mellon University June 11,

Generalization Bounds in the Predict-then-Optimize Framework Othman El Balghiti (Rayens Capital),

Data Anonymization - Generalization Algorithms Li Xiong, Slawek Goryczka CS573 Data Privacy and

Data Anonymization - Generalization Algorithms Li Xiong CS573 Data Privacy and Anonymity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is

Generalization of Cycle-Covering Heuristics Clemens B uchner Department of Mathematics and

tail bounds tail bounds For a random variable X, the tails of X are the parts of the PMF/density

Randomness in Computing L ECTURE 10 Last time Chernoff Bounds Today Hoeffding Bounds

Local Substitutability for Sequence Generalization Fran cois Coste , Ga elle Garet , Jacques

CSC321 Lecture 9: Generalization Roger Grosse Roger Grosse CSC321 Lecture 9: Generalization 1 /

A tour on Bridgeland stability Paolo Stellari Hamburg, June 2015 Paolo Stellari A tour on

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks Yuan Cao

Learning Additive Noise Channels: Generalization Bounds and Algorithms Nir Weinberger

Fine-Grained Analysis of Stability and Generalization for SGD Yunwen Lei 1 and Yiming Ying 2 1

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a

Illustrating Agnostic Learning We want a classifier to distinguish between cats and dogs Image 1

Flavor Physics: Past, Present, Future Indirect Searches for NP at the Time of LHC GGI, Florence,

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Foundation of Cryptography (0368-4162-01), Lecture 3 Hardcore Predicates for Any One-way Function

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

Computational Learning Theory 1 / 22 Decidability Computation Decidability which