concentration inequalities
play

Concentration inequalities G abor Lugosi ICREA and Pompeu Fabra - PowerPoint PPT Presentation

Concentration inequalities G abor Lugosi ICREA and Pompeu Fabra University Barcelona what is concentration? We are interested in bounding random fluctuations of functions of many independent random variables. what is concentration? We are


  1. example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is � x − X i � n � φ n (x) = 1 K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx .

  2. example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is � x − X i � n � φ n (x) = 1 K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx . It is easy to see that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | � � �� � x − x i � � x − x ′ � � 1 � dx ≤ 2 � i � ≤ � K − K n , nh h h Var (Z) ≤ 2 so we get n .

  3. example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n � P n (A) = 1 P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n

  4. example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n � P n (A) = 1 P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n regardless of the distribution and the richness of A .

  5. ✶ ✶ ✶ bounding the expectation � n i ∈ A and let E ′ denote expectation only n (A) = 1 Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � � � � n � n (A) | = 1 � � | P n (A) − P ′ ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A i=1

  6. bounding the expectation � n i ∈ A and let E ′ denote expectation only n (A) = 1 Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � � � � n � n (A) | = 1 � � | P n (A) − P ′ ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A i=1 Second symmetrization: if ε 1 , . . . , ε n are independent Rademacher variables, then � � � � � � � � n n � � = 1 � ≤ 2 � � � � n E sup ε i ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) n E sup ε i ✶ X i ∈ A � � � � � � � A ∈A A ∈A i=1 i=1

  7. conditional rademacher average If � � � � n � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A

  8. conditional rademacher average If � � � � n � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A R n is a data-dependent quantity!

  9. concentration of conditional rademacher average Define � � � � � � � � � R (i) n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i One can show easily that n � 0 ≤ R n − R (i) (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n � n ) 2 ≤ E R n . (R n − R (i) Var (R n ) ≤ E i=1

  10. concentration of conditional rademacher average Define � � � � � � � � � R (i) n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i One can show easily that n � 0 ≤ R n − R (i) (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n � n ) 2 ≤ E R n . (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n !

  11. concentration of conditional rademacher average Define � � � � � � � � � R (i) n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i One can show easily that n � 0 ≤ R n − R (i) (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n � n ) 2 ≤ E R n . (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n ! Such functions are called self-bounding.

  12. bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n

  13. bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n In particular, � log S(X n 1 , A ) E sup | P n (A) − P(A) | ≤ 2 E . 2n A ∈A

  14. random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) .

  15. random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) . V is also self-bounding: n � (V − V (i) ) 2 ≤ V i=1 so by Efron-Stein, Var (V) ≤ E V

  16. vapnik and chervonenkis Alexey Chervonenkis Vladimir Vapnik

  17. beyond the variance X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Recall the Doob martingale representation: n � Z − E Z = ∆ i where ∆ i = E i Z − E i − 1 Z , i=1 with E i [ · ] = E [ ·| X 1 , . . . , X i ] . To get exponential inequalities, we bound the moment generating function E e λ (Z − E Z) .

  18. azuma’s inequality Suppose that the martingale differences are bounded: | ∆ i | ≤ c i . Then �� n − 1 � � n E e λ (Z − E Z) = E e λ ( i=1 ∆ i ) = EE n e λ i=1 ∆ i + λ ∆ n �� n − 1 � λ i=1 ∆ i E n e λ ∆ n = E e �� n − 1 � λ i=1 ∆ i n / 2 (by Hoeffding) e λ 2 c 2 ≤ E e · · · � n ≤ e λ 2 ( i=1 c 2 i ) / 2 . This is the Azuma-Hoeffding inequality for sums of bounded martingale differences.

  19. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded.

  20. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i .

  21. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i . McDiarmid’s inequality. Colin McDiarmid

  22. hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� � � � � n � ≤ e − (t −√ v) 2 / (2v) . � � P � X i � � > t � i=1

  23. hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� � � � � n � ≤ e − (t −√ v) 2 / (2v) . � � P � X i � � > t � i=1 � � �� n � has the bounded Proof: By the triangle inequality, i=1 X i differences property with constants c , so �� � � �� � � � � � � � � � � � � � � n n n n � � � � � � � � � � � � X i � > t = P X i � − E X i � > t − E X i P � � � � � � � � � � � � � i=1 i=1 i=1 i=1 � � � � �� 2 � �� n t − E i=1 X i ≤ exp − . 2v Also, � � � � � � � � 2 � � � � n � n n � � � � E � X i � 2 ≤ √ v . � � � � � E � X i � ≤ X i = E � � � � � � � i=1 i=1 i=1

  24. bounded differences inequality Easy to use. Distribution free. Often close to optimal (e.g., L 1 error of kernel density estimate). Does not exploit “variance information.” Often too rigid. Other methods are necessary.

  25. shannon entropy If X , Y are random variables taking values in a set of size N , � H(X) = − p(x) log p(x) x H(X | Y)= H(X , Y) − H(Y) � = − p(x , y) log p(x | y) Claude Shannon x , y (1916–2001) H(X) ≤ log N and H(X | Y) ≤ H(X)

  26. han’s inequality If X = (X 1 , . . . , X n ) and X (i) = (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then � � n � H(X) − H(X (i) ) ≤ H(X) i=1 Proof: H(X)= H(X (i) ) + H(X i | X (i) ) ≤ H(X (i) ) + H(X i | X 1 , . . . , X i − 1 ) Since � n i=1 H(X i | X 1 , . . . , X i − 1 ) = H(X) , summing Te Sun Han the inequality, we get n � H(X (i) ) . (n − 1)H(X) ≤ i=1

  27. edge isoperimetric inequality on the hypercube Let A ⊂ {− 1 , 1 } n . Let E(A) be the collection of pairs x , x ′ ∈ A such that d H (x , x ′ ) = 1 . Then | E(A) | ≤ | A | 2 × log 2 | A | . Proof: Let X = (X 1 , . . . , X n ) be uniformly distributed over A . Then p(x) = ✶ x ∈ A / | A | . Clearly, H(X) = log | A | . Also, � H(X) − H(X (i) ) = H(X i | X (i) ) = − p(x) log p(x i | x (i) ) . x ∈ A For x ∈ A , � if x (i) ∈ A 1 / 2 p(x i | x (i) ) = 1 otherwise where x (i) = (x 1 , . . . , x i − 1 , − x i , x i+1 , . . . , x n ) .

  28. � H(X) − H(X (i) ) = log 2 ✶ x , x (i) ∈ A | A | x ∈ A and therefore � � n n � � � = log 2 ✶ x , x (i) ∈ A = | E(A) | H(X) − H(X (i) ) 2 log 2 . | A | | A | i=1 x ∈ A i=1 Thus, by Han’s inequality, � � n � | E(A) | H(X) − H(X (i) ) 2 log 2 = ≤ H(X) = log | A | . | A | i=1

  29. This is equivalent to the edge isoperimetric inequality on the hypercube: if � � (x , x ′ ) : x ∈ A , x ′ ∈ A c , d H (x , x ′ ) = 1 ∂ E (A) = . is the edge boundary of A , then 2 n | ∂ E (A) | ≥ log 2 | A | × | A | Equality is achieved for sub-cubes.

  30. VC entropy is self-bounding Let A is a class of subsets of X and x = (x 1 , . . . , x n ) ∈ X n . Recall that S(x , A ) is the number of different sets of form { x 1 , . . . , x n } ∩ A : A ∈ A Let f n (x) = log 2 S(x , A ) be the VC entropy. Then 0 ≤ f n (x) − f n − 1 (x 1 , . . . , x i − 1 , x i+1 . . . , x n ) ≤ 1 and n � (f n (x) − f n − 1 (x 1 , . . . , x i − 1 , x i+1 . . . , x n )) ≤ f n (x) . i=1 Proof: Put the uniform distribution on the class of sets { x 1 , . . . , x n } ∩ A and use Han’s inequality.

  31. VC entropy is self-bounding Let A is a class of subsets of X and x = (x 1 , . . . , x n ) ∈ X n . Recall that S(x , A ) is the number of different sets of form { x 1 , . . . , x n } ∩ A : A ∈ A Let f n (x) = log 2 S(x , A ) be the VC entropy. Then 0 ≤ f n (x) − f n − 1 (x 1 , . . . , x i − 1 , x i+1 . . . , x n ) ≤ 1 and n � (f n (x) − f n − 1 (x 1 , . . . , x i − 1 , x i+1 . . . , x n )) ≤ f n (x) . i=1 Proof: Put the uniform distribution on the class of sets { x 1 , . . . , x n } ∩ A and use Han’s inequality. Corollary: if X 1 , . . . , X n are independent, then Var (log 2 S(X , A )) ≤ E log 2 S(X , A ) .

  32. subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 .

  33. subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 . Han’s inequality implies the following sub-additivity property. Let X 1 , . . . , X n be independent and let Z = f(X 1 , . . . , X n ) , where f ≥ 0 . Denote Ent (i) (Z) = E (i) Φ(Z) − Φ( E (i) Z) Then n � Ent (i) (Z) . Ent (Z) ≤ E i=1

  34. a logarithmic sobolev inequality on the hypercube Let X = (X 1 , . . . , X n ) be uniformly distributed over {− 1 , 1 } n . If f : {− 1 , 1 } n → R and Z = f(X) , n � Ent (Z 2 ) ≤ 1 (Z − Z ′ i ) 2 2 E i=1 The proof uses subadditivity of the entropy and calculus for the case n = 1 . Implies Efron-Stein.

  35. herbst’s argument: exponential concentration If f : {− 1 , 1 } n → R , the log-Sobolev inequality may be used with g(x) = e λ f(x) / 2 where λ ∈ R . If F( λ ) = E e λ Z is the moment generating function of Z = f(X) , � Ze λ Z � � e λ Z � � Ze λ Z � Ent (g(X) 2 )= λ E − E log E = λ F ′ ( λ ) − F( λ ) log F( λ ) . Differential inequalities are obtained for F( λ ) .

  36. herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � G( λ ) � ′ ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v

  37. herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � G( λ ) � ′ ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v Stronger than the bounded differences inequality!

  38. gaussian log-sobolev inequality Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E (Gross, 1975).

  39. gaussian log-sobolev inequality Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E (Gross, 1975). Proof sketch: By the subadditivity of entropy, it suffices to prove it for n = 1 . Approximate Z = f(X) by � � m � 1 f ε i √ m i=1 where the ε i are i.i.d. Rademacher random variables. Use the log-Sobolev inequality of the hypercube and the central limit theorem.

  40. gaussian concentration inequality Herbst’t argument may now be repeated: Suppose f is Lipschitz: for all x , y ∈ R n , | f(x) − f(y) | ≤ L � x − y � . Then, for all t > 0 , P { f(X) − E f(X) ≥ t } ≤ e − t 2 / (2L 2 ) . (Tsirelson, Ibragimov, and Sudakov, 1976).

  41. an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If � � �� σ 2 = sup X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 )

  42. an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If � � �� σ 2 = sup X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 ) Proof: We may assume T = { 1 , ..., n } . Let Γ be the covariance matrix of X = (X 1 , . . . , X n ) . Let A = Γ 1 / 2 . If Y is a standard normal vector, then distr . f(Y) = i=1 ,..., n (AY) i max = i=1 ,..., n X i max By Cauchy-Schwarz, � �   1 / 2 � � � � � � � � A 2  | (Au) i − (Av) i | = A i , j (u j − v j ) ≤ � u − v � � � i , j � � j j ≤ σ � u − v �

  43. beyond bernoulli and gaussian: the entropy method For general distributions, logarithmic Sobolev inequalities are not available. Solution: modified logarithmic Sobolev inequalities. Suppose X 1 , . . . , X n are independent. Let Z = f(X 1 , . . . , X n ) and Z i = f i (X (i) ) = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) . Let φ (x) = e x − x − 1 . Then for all λ ∈ R , � Ze λ Z � � e λ Z � � e λ Z � λ E − E log E � � n � e λ Z φ ( − λ (Z − Z i )) ≤ . E i=1 Michel Ledoux

  44. the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n � (Z − Z i ) 2 ≤ v . i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) .

  45. the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n � (Z − Z i ) 2 ≤ v . i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) . This implies the bounded differences inequality and much more.

  46. example: the largest eigenvalue of a symmetric matrix Let A = (X i , j ) n × n be symmetric, the X i , j independent ( i ≤ j ) with | X i , j | ≤ 1 . Let u T Au . Z = λ 1 = sup u: � u � =1 and suppose v is such that Z = v T Av . A ′ i , j is obtained by replacing X i , j by x ′ i , j . Then � � v T Av − v T A ′ (Z − Z i , j ) + ≤ i , j v ✶ Z > Z i , j � � � � v T (A − A ′ v i v j (X i , j − X ′ = i , j )v ✶ Z > Z i , j ≤ 2 i , j ) + ≤ 4 | v i v j | . Therefore, � n � 2 � � � 16 | v i v j | 2 ≤ 16 (Z − Z ′ i , j ) 2 v 2 + ≤ = 16 . i 1 ≤ i ≤ j ≤ n 1 ≤ i ≤ j ≤ n i=1

  47. example: convex lipschitz functions Let f : [0 , 1] n → R be a convex function. Let i f(X 1 , . . . , x ′ i , . . . , X n ) and let X ′ i be the value of x ′ Z i = inf x ′ i for which the minimum is achieved. Then, writing (i) = (X 1 , . . . , X i − 1 , X ′ X i , X i+1 , . . . , X n ) , n n � � (i) ) 2 (Z − Z i ) 2 = (f(X) − f(X i=1 i=1 � ∂ f � 2 n � (X i − X ′ i ) 2 ≤ (X) ∂ x i i=1 (by convexity) � ∂ f � 2 n � ≤ (X) ∂ x i i=1 = �∇ f(X) � 2 ≤ L 2 .

  48. convex lipschitz functions If f : [0 , 1] n → R is a convex Lipschitz function and X 1 , . . . , X n are independent taking values in [0 , 1] , Z = f(X 1 , . . . , X n ) satisfies P { Z > E Z + t } ≤ e − t 2 / (2L 2 ) .

  49. convex lipschitz functions If f : [0 , 1] n → R is a convex Lipschitz function and X 1 , . . . , X n are independent taking values in [0 , 1] , Z = f(X 1 , . . . , X n ) satisfies P { Z > E Z + t } ≤ e − t 2 / (2L 2 ) . A similar lower tail bound also holds.

  50. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z)

  51. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages, random VC dimension, random VC entropy, longest increasing subsequence in a random permutation, are all examples of self bounding functions.

  52. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages, random VC dimension, random VC entropy, longest increasing subsequence in a random permutation, are all examples of self bounding functions. Configuration functions.

  53. exponential efron-stein inequality Define E ′ � � n � V + = (Z − Z ′ i ) 2 + i=1 and E ′ � � n � V − = (Z − Z ′ i ) 2 . − i=1 By Efron-Stein, Var (Z) ≤ E V − . Var (Z) ≤ E V + and

  54. exponential efron-stein inequality Define E ′ � � n � V + = (Z − Z ′ i ) 2 + i=1 and E ′ � � n � V − = (Z − Z ′ i ) 2 . − i=1 By Efron-Stein, Var (Z) ≤ E V − . Var (Z) ≤ E V + and The following exponential versions hold for all λ, θ > 0 with λθ < 1 : λθ log E e λ (Z − E Z) ≤ 1 − λθ log E e λ V + /θ . If also Z ′ i − Z ≤ 1 for every i , fhen for all λ ∈ (0 , 1 / 2) , 2 λ 1 − 2 λ log E e λ V − . log E e λ (Z − E Z) ≤

  55. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , � � 2 n � f(x) − f i (x (i) ) ≤ af(x) + b . i=1

  56. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , � � 2 n � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2)

  57. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , � � 2 n � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2) If, in addition, f(x) − f i (x (i) ) ≤ 1 , then for 0 < t ≤ E Z , � � t 2 P { Z ≤ E Z − t } ≤ exp − . 2 (a E Z + b + c − t) where c = (3a − 1) / 6 .

  58. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand

  59. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A]

  60. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A] Concentration of measure!

  61. the isoperimetric view Proof: By the bounded differences inequality, P { E d(X , A) − d(X , A) ≥ t } ≤ e − 2t 2 / n . Taking t = E d(X , A) , we get � n 1 E d(X , A) ≤ 2 log P { A } . By the bounded differences inequality again, � � � n 1 ≤ e − 2t 2 / n P d(X , A) ≥ t + 2 log P { A }

  62. talagrand’s convex distance The weighted Hamming distance is � d α (x , A) = inf y ∈ A d α (x , y) = inf | α i | y ∈ A i:x i � =y i where α = ( α 1 , . . . , α n ) . The same argument as before gives � � � � α � 2 1 ≤ e − 2t 2 / � α � 2 , P d α (X , A) ≥ t + log 2 P { A } This implies min ( P { A } , P { d α (X , A) ≥ t } ) ≤ e − t 2 / 2 . sup α : � α � =1

  63. convex distance inequality convex distance: d T (x , A) = sup d α (x , A) . α ∈ [0 , ∞ ) n : � α � =1

  64. convex distance inequality convex distance: d T (x , A) = sup d α (x , A) . α ∈ [0 , ∞ ) n : � α � =1 Talagrand’s convex distance inequality: P { A } P { d T (X , A) ≥ t } ≤ e − t 2 / 4 .

  65. convex distance inequality convex distance: d T (x , A) = sup d α (x , A) . α ∈ [0 , ∞ ) n : � α � =1 Talagrand’s convex distance inequality: P { A } P { d T (X , A) ≥ t } ≤ e − t 2 / 4 . Follows from the fact that d T (X , A) 2 is (4 , 0) weakly self bounding (by a saddle point representation of d T ). Talagrand’s original proof was different.

  66. ✶ ✶ convex lipschitz functions For A ⊂ [0 , 1] n and x ∈ [0 , 1] n , define D(x , A) = inf y ∈ A � x − y � . If A is convex, then D(x , A) ≤ d T (x , A) .

  67. convex lipschitz functions For A ⊂ [0 , 1] n and x ∈ [0 , 1] n , define D(x , A) = inf y ∈ A � x − y � . If A is convex, then D(x , A) ≤ d T (x , A) . Proof: D(x , A)= ν ∈M (A) � x − E ν Y � inf (since A is convex) � � n � � � � 2 � ≤ inf (since x j , Y j ∈ [0 , 1] ) E ν ✶ x j � =Y j ν ∈M (A) j=1 n � = inf sup α j E ν ✶ x j � =Y j (by Cauchy-Schwarz) ν ∈M (A) α : � α �≤ 1 j=1 = d T (x , A) (by minimax theorem) .

  68. John von Neumann (1903–1957)

  69. Sergei Lvovich Sobolev (1908–1989)

Recommend


More recommend