concentration inequalities
play

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - - PowerPoint PPT Presentation

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH2010 Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random


  1. Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit´ e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH’2010

  2. Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random variables taking their values in some (measurable) space X and f : X n → R a function which value depends on all the variables but not too much on any of them. For example: f ( X 1 , . . . , X n ) = X 1 + ··· + X n or n g ( X 1 ) + · · · + g ( X n ) f ( X 1 , . . . , X n ) = sup n g ∈G

  3. Outline • Asymptotic viewpoint • Non asymptotic – Gaussian approximation – Gaussian processes – Sum of i.i.d. r.v. – Functions with bounded differences – Self-bounding functions

  4. The asymptotic viewpoint • What is the limit of f ( X 1 , . . . , X n ) ? • What is the limit of its centered and scaled version: f ( X 1 , . . . , X n ) − E f ( X 1 , . . . , X n ) ? � V ar f ( X 1 , . . . , X n )

  5. Convergence of random variables d • Convergence in distribution: W n − → n → + ∞ W ⇔ ∀ t ∈ R s.t. F W cont. at t, F W n ( t ) − → n → + ∞ F W ( t ) ⇔ ∀ f : R → R cont. and bounded, E f ( W n ) − → n → + ∞ E f ( W ) (with i 2 = − 1 ) ⇔ ∀ t ∈ R , E e itWn n → + ∞ E e itW − → P • Convergence in probability: W n − → n → + ∞ W ⇔ ∀ ε > 0 , P ( | W n − W | ≥ ε ) − → n → + ∞ 0 a.s. • Almost sure convergence: W n n → + ∞ W ⇔ P ( W n − → − → n → + ∞ W ) = 1 Almost sure cvg ⇒ cvg in probability ⇒ cvg in distribution a.s. • If ∀ ε > 0 , � n ≥ 1 P ( | W n − W | > ε ) < + ∞ , then W n − → n → + ∞ W

  6. Convergence of the empirical mean f ( X 1 , . . . , X n ) = X 1 + ··· + X n n • LLN (1713): If X, X 1 , X 2 , . . . are i.i.d. r.v. with E | X | < + ∞ , then � n a.s. i =1 X i ¯ − → X = n → + ∞ E X n • CLT (1733): If X, X 1 , X 2 , . . . are i.i.d. r.v. with E X 2 < + ∞ , then � ¯ √ n d � X − E X n → + ∞ N (0 , V ar X ) , − → or equivalently: for any t , e − u 2 � ¯ � + ∞ �� � n 2 � X − E X − → > t 2 π du. P √ V ar X t n → + ∞

  7. Slutsky’s lemma (1925) Let ( V n ) and ( W n ) be two sequences of random vectors or variables. d P − → − → If V n n → + ∞ v and W n n → + ∞ W , then d − → 1. V n + W n n → + ∞ v + W d − → 2. V n W n n → + ∞ vW d 3. V − 1 n → + ∞ v − 1 W if v invertible − → n W n

  8. An example of complicated functional: the t -statistics Let √ n ( ¯ X − E X ) f ( X 1 , . . . , X n ) = , S n with n n = 1 � ( X i − ¯ S 2 X ) 2 n i =1 i =1 ( X i − E X ) 2 − ( E X − ¯ � n n = 1 Since S 2 X ) 2 , from the LLN, we have n n → + ∞ V ar X . From the CLT, √ n ( ¯ a.s. d S 2 − → X − E X ) n → + ∞ N (0 , V ar X ) . − → n Thus, from Slutsky’s lemma, d n → + ∞ N (0 , 1) . − → f ( X 1 , . . . , X n ) Appropriate decompositions of complicated functionals allow to compute their asymptotic distribution.

  9. Nonasymptotic bounds Motivations: • When the nonasymptotic regime plays a crucial role (for instance, multi-armed bandit problems, racing algorithms, stopping times problems) • When asymptotic analysis is not achievable through standard arguments • To derive asymptotic results!

  10. The Berry (1941)-Esseen (1942) theorem • X, X 1 , . . . , X n i.i.d. • E | X | 3 < + ∞ and σ 2 = V ar X • ¯ X = X 1 + ··· + X n n • Z ∼ N ( E ¯ X, V ar ¯ X ) � ≤ E | X − E X | 3 1 � P ( ¯ � � √ n sup X > x ) − P ( Z > x ) 2 σ 3 x ∈ R

  11. Slud’s theorem (1977) • X 1 , . . . , X n i.i.d. ∼ B ( p ) with p ≤ 1 2 • Z ∼ N ( E ¯ X, V ar ¯ X ) • for any x ∈ [ p, 1 − p ] P ( ¯ X > x ) ≥ P ( Z > x )

  12. the Paley-Zygmund inequality (1932) • X 1 , . . . , X n i.i.d. • for any 0 ≤ λ < 1 , √ n ( ¯ ( V ar X ) 2 �� � X − E X ) � � 1 � ≥ (1 − λ 2 ) 2 min � � √ > λ 3 , . P � � E ( X − E X ) 4 V ar X � �

  13. Supremum of Gaussian processes (GP) • Gaussian ∈ G process ( W ( g )) g ∈G : for any g 1 , . . . , g d � � W ( g 1 ) , . . . , W ( g d ) is a Gaussian random vector • GP: a powerful flexible probabilistic model parametrized by µ ( g ) = E W ( g ) and K ( g, g ′ ) = Cov W ( g ) , W ( g ′ ) � � g ( X 1 )+ ··· + g ( X n ) • Good intuition on GP ⇒ good intuition on sup g ∈G n g ( X 1 ) + · · · + g ( X n ) ≈ sup sup W ( g ) n g ∈G g ∈G with µ ( g ) = E g ( X ) and K ( g, g ′ ) = 1 g ( X ) , g ′ ( X ) � � n Cov .

  14. The Borell (1975) - Cirel’son et al. (1976) inequality � � • Z = sup g ∈G W ( g ) − E W ( g ) • σ 2 = sup g ∈G V ar W ( g ) = sup g ∈G K ( g, g ) for any λ ∈ R , log E e λ ( Z − E Z ) ≤ λ 2 σ 2 2 for any t > 0 , P ( Z − E Z ≥ t ) ≤ e − t 2 2 σ 2

  15. Dudley’s integral (1967) � • d ( g, g ′ ) = E [ W ( g ) − W ( g ′ )] 2 • N ( ε ) = ε -packing number of ( G , d ) • σ 2 = sup g ∈G V ar W ( g ) = sup g ∈G K ( g, g ) � σ � � � W ( g ) − E W ( g ) ≤ 12 E sup log N ( ε ) dε, g ∈G 0

  16. Another Borell (1975) - Cirel’son et al. (1976) inequality • X 1 , . . . , X n i.i.d. ∼ N (0 , 1) • f : R n → R L -Lipschitz for the Euclidean distance for any x, x ′ in R n , | f ( x ) − f ( x ′ ) | ≤ L � x − x ′ � for any t > 0 , ≤ e − t 2 � � f ( X 1 , . . . , X n ) − E f ( X 1 , . . . , X n ) ≥ t 2 L 2 . P

  17. Some useful probabilistic inequalities • Markov’s inequality: for any r.v. X and a > 0 , since | X | ≥ a 1 | X |≥ a P ( | X | ≥ a ) ≤ 1 a E | X | . • Jensen’s ineq.: for any integrable r.v. X and ϕ : R d → R convex, ϕ ( E X ) ≤ E ϕ ( X ) . � + ∞ • For any r.v. X , E X ≤ P ( X ≥ t ) dt (with equality if X ≥ 0 ) 0 • Markov’s inequality is at the basis of Chernoff’s argument: ∀ s > 0 e sX ≥ e st � ≤ e − st E e sX . � P ( X ≥ t ) = P Control of the Laplace transform ⇒ control of the large deviations.

  18. Hoeffding’s inequality (1963) If X, X 1 , X 2 , . . . are i.i.d. r.v. with a ≤ X ≤ b , then 1. ∀ s ∈ R , s 2( b − a )2 E e s ( X − E X ) ≤ e 8 2. For any t ≥ 0 , − 2 nt 2 � ¯ � ( b − a )2 , X − E X ≥ t ≤ e P or equivalently, for any ε > 0 � log( ε − 1 ) � � ¯ X − E X < ( b − a ) ≥ 1 − ε, P 2 n � log( ε − 1 ) i.e., “w.h.p.” ¯ X − E X < ( b − a ) . 2 n

  19. Log-Laplace upper bound s 2( b − a )2 1. ∀ s ∈ R , E e s ( X − E X ) ≤ e 8 P s ( dω ) = e sX ( ω ) ϕ ( s ) = log E e sX E e sX · P ( dω ) ϕ ′ ( s ) = E P s X ϕ ′′ ( s ) = V ar P s X � 2 ≤ ( b − a ) 2 V ar P s X = inf r ∈ R E P s ( X − r ) 2 ≤ E P s X − a + b � . 2 4 � s ϕ ( s ) = ϕ (0) + sϕ ′ (0) + 0 ( s − t ) ϕ ′′ ( t ) dt � s ( s − t )( b − a ) 2 ⇒ log E e sX ≤ s E X + dt 4 0 ≤ s E X + ( b − a ) 2 s 2 8

  20. Chernoff’s Argument − 2 nt 2 � ¯ � ( b − a )2 . 2. For any t ≥ 0 , X − E X > t ≤ e P e s ( X − E X ) ≥ e st � � P ( X − E X ≥ t ) = P ≤ e − st E [ e s ( X − E X ) ] s � n i =1( Xi − E X ) � � = e − st E e n � n s ( X − E X ) � = e − st E e n ≤ e − st + s 2 b − a 2 n 8 − 2 nt 2 4 nt ( b − a )2 = e by choosing s = ( b − a ) 2 .

  21. Union bound • P ( A ) ≥ 1 − ε and P ( B ) ≥ 1 − ε ⇒ P ( A ∩ B ) ≥ 1 − 2 ε (since P ( A c ∪ B c ) ≤ P ( A c ) + P ( B c ) ) For instance: Hoeffding to X + Hoeffding to − X + union bound � log(2 ε − 1 ) ⇒ with proba ≥ 1 − ε , | ¯ X − E X | < ( b − a ) 2 n (leads to pessimistic but correct confidence intervals unlike the CLT) • If P ( A 1 ) ≥ 1 − ε ,. . . , P ( A m ) ≥ 1 − ε , then � � A 1 ∩ · · · ∩ A m ≥ 1 − mε P

  22. Bernstein’s (1946) inequality Hoeffding’s inequality vs CLT: − α 2 − 2 α 2 V ar X V ar X ( ¯ n n → + ∞ P ( Z > α ) ≈ e 2 ( b − a )2 �� � ≥ P X − E X ) > α − → e √ α 2 π ⇒ Hoeffding’s inequality is imprecise for r.v. having low variance Bernstein’s inequality: If X, X 1 , X 2 , . . . are i.i.d. r.v. with X − E X ≤ c , then • for any ε > 0 , with proba at least 1 − ε , � + c log( ε − 1 ) 2 log( ε − 1 ) V ar X ¯ X ≤ E X + n 3 n • for any t ≥ 0 , � ¯ nt 2 − � X − E X > t ≤ e P 2 V ar X +2 ct/ 3

  23. Empirical Bernstein’s inequality (A., Munos, Szepesv´ ari, 2007; Maurer, Pontil, 2009) If X, X 1 , X 2 , . . . are i.i.d. r.v. with a ≤ X ≤ b , then for any ε > 0 , with proba at least 1 − ε , � + 7( b − a )log( ε − 1 ) 2 log( ε − 1 )ˆ σ 2 E X ≤ ¯ X + 3 n n with � n i =1 ( X i − ¯ X ) 2 σ 2 = ˆ . n − 1 � + ( b − a ) log( ε − 1 ) � � 2 log( ε − 1 ) V ar X to be compared with E X ≤ ¯ X + n 3 n

  24. Hoeffding-Azuma inequalities (McDiarmid’s version, 1989) If for some c ≥ 0 , f ( x 1 , . . . , x n ) − f ( x 1 , . . . , x i − 1 , x, x i +1 , . . . , x n ) ≤ c, sup i ∈{ 1 ,...,n } ( x 1 ,...,x n ) ∈X n x ∈X then, for any λ ∈ R , W = f ( X 1 , . . . , X n ) satisfies nλ 2 c 2 E e λ ( W − E W ) ≤ e 8 and for any t ≥ 0 , ≤ e − 2 t 2 � � W − E W > t nc 2 P

Recommend


More recommend