Concentration inequalities and the entropy method G´ abor Lugosi ICREA and Pompeu Fabra University Barcelona
what is concentration? We are interested in bounding random fluctuations of functions of many independent random variables.
what is concentration? We are interested in bounding random fluctuations of functions of many independent random variables. X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . How large are “typical” deviations of Z from E Z ? In particular, we seek upper bounds for P { Z > E Z + t } and P { Z < E Z − t } for t > 0 .
various approaches - martingales (Yurinskii, 1974; Milman and Schechtman, 1986; Shamir and Spencer, 1987; McDiarmid, 1989,1998); - information theoretic and transportation methods (Alhswede, G´ acs, and K¨ orner, 1976; Marton 1986, 1996, 1997; Dembo 1997); - Talagrand’s induction method, 1996; - logarithmic Sobolev inequalities (Ledoux 1996, Massart 1998, Boucheron, Lugosi, Massart 1999, 2001).
chernoff bounds By Markov’s inequality, if λ > 0 , ≤ E e λ (Z − E Z) � e λ (Z − E Z) > e λ t � P { Z − E Z > t } = P e λ t Next derive bounds for the moment generating function E e λ (Z − E Z) and optimize λ .
chernoff bounds By Markov’s inequality, if λ > 0 , ≤ E e λ (Z − E Z) � e λ (Z − E Z) > e λ t � P { Z − E Z > t } = P e λ t Next derive bounds for the moment generating function E e λ (Z − E Z) and optimize λ . If Z = � n i=1 X i is a sum of independent random variables, n n E e λ Z = E e λ X i = � � E e λ X i i=1 i=1 by independence. It suffices to find bounds for E e λ X i .
chernoff bounds By Markov’s inequality, if λ > 0 , ≤ E e λ (Z − E Z) � e λ (Z − E Z) > e λ t � P { Z − E Z > t } = P e λ t Next derive bounds for the moment generating function E e λ (Z − E Z) and optimize λ . If Z = � n i=1 X i is a sum of independent random variables, n n E e λ Z = E e λ X i = � � E e λ X i i=1 i=1 by independence. It suffices to find bounds for E e λ X i . Serguei Bernstein (1880-1968) Herman Chernoff (1923–)
hoeffding’s inequality If X 1 , . . . , X n ∈ [0 , 1] , then E e λ (X i − E X i ) ≤ e λ 2 / 8 .
hoeffding’s inequality If X 1 , . . . , X n ∈ [0 , 1] , then E e λ (X i − E X i ) ≤ e λ 2 / 8 . We obtain �� n n �� � � 1 1 � � ≤ 2e − 2nt 2 � � P X i − E X i � > t � � � n n � � i=1 i=1 Wassily Hoeffding (1914–1991)
bernstein’s inequality Hoeffding’s inequality is distribution free. It does not take variance information into account. Bernstein’s inequality is an often useful variant: Let X 1 , . . . , X n be independent such that X i ≤ 1 . Let v = � n X 2 � � . Then i=1 E i � n � � � t 2 � P (X i − E X i ) ≥ t ≤ exp − . 2(v + t / 3) i=1
martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z .
martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale representation of Z .
martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale Joseph Leo Doob (1910–2004) representation of Z .
martingale representation: the variance � n � 2 n � � � � � = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n � 2 n � � � � = ∆ 2 Var (Z) = E ∆ i . E i i=1 i=1
martingale representation: the variance � n � 2 n � � � � � = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n � 2 n � � � � = ∆ 2 Var (Z) = E ∆ i . E i i=1 i=1 From this, using independence, it is easy derive the Efron-Stein inequality.
efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only.
efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only. We obtain more useful forms by using that Var (X) = 1 2 E (X − X ′ ) 2 Var (X) ≤ E (X − a) 2 and for any constant a .
efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables.
efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables. If Z = � n i=1 X i then we have an equality. Sums are the “least concentrated” of all functions!
efron-stein inequality (1981) If for some arbitrary functions f i Z i = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then � n � � (Z − Z i ) 2 Var (Z) ≤ E i=1
efron, stein, and steele Mike Steele Charles Stein Bradley Efron
weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1
weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then Var (f(X)) ≤ a E f(X) + b .
self-bounding functions If 0 ≤ f(x) − f i (x (i) ) ≤ 1 and n � � � f(x) − f i (x (i) ) ≤ f(x) , i=1 then f is self-bounding and Var (f(X)) ≤ E f(X) .
self-bounding functions If 0 ≤ f(x) − f i (x (i) ) ≤ 1 and n � � � f(x) − f i (x (i) ) ≤ f(x) , i=1 then f is self-bounding and Var (f(X)) ≤ E f(X) . Rademacher averages, random VC dimension, random VC entropy, longest increasing subsequence in a random permutation, are all examples of self bounding functions.
self-bounding functions If 0 ≤ f(x) − f i (x (i) ) ≤ 1 and n � � � f(x) − f i (x (i) ) ≤ f(x) , i=1 then f is self-bounding and Var (f(X)) ≤ E f(X) . Rademacher averages, random VC dimension, random VC entropy, longest increasing subsequence in a random permutation, are all examples of self bounding functions. Configuration functions.
example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n
example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n regardless of the distribution and the richness of A .
beyond the variance X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Recall the Doob martingale representation: n � Z − E Z = ∆ i where ∆ i = E i Z − E i − 1 Z , i=1 with E i [ · ] = E [ ·| X 1 , . . . , X i ] . To get exponential inequalities, we bound the moment generating function E e λ (Z − E Z) .
azuma’s inequality Suppose that the martingale differences are bounded: | ∆ i | ≤ c i . Then �� n − 1 � E e λ (Z − E Z) = E e λ ( � n i=1 ∆ i ) = EE n e λ i=1 ∆ i + λ ∆ n �� n − 1 � λ i=1 ∆ i E n e λ ∆ n = E e �� n − 1 � λ i=1 ∆ i n / 2 (by Hoeffding) e λ 2 c 2 ≤ E e · · · ≤ e λ 2 ( � n i=1 c 2 i ) / 2 . This is the Azuma-Hoeffding inequality for sums of bounded martingale differences.
Recommend
More recommend