lectures on learning theory
play

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra - PowerPoint PPT Presentation

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra University Barcelona what is learning theory? A mathematical theory to understand the behavior of learning algorithms and assist their design. what is learning theory? A


  1. johnson-lindenstrauss Suppose A = { a 1 , . . . , a n } ⊂ R D is a finite set, D is large. We would like to embed A in R d where d ≪ D . Is this possible? In what sense? Given ε > 0 , a function f : R D → R d is an ε -isometry if for all a , a ′ ∈ A , � 2 ≤ � 2 ≤ (1 + ε ) � 2 . � � a − a ′ � � � f(a) − f(a ′ ) � � � a − a ′ � (1 − ε ) Johnson-Lindenstrauss lemma: If d ≥ (c /ε 2 ) log n , then there exists an ε -isometry f : R D → R d . Independent of D !

  2. random projections We take f to be linear. How? At random!

  3. random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal.

  4. random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal. For any a = ( α 1 , . . . , α D ) ∈ R D , d D E � f(a) � 2 = 1 i , j = � a � 2 . � � α 2 j E X 2 d i=1 j=1 The expected squared distances are preserved!

  5. random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal. For any a = ( α 1 , . . . , α D ) ∈ R D , d D E � f(a) � 2 = 1 i , j = � a � 2 . � � α 2 j E X 2 d i=1 j=1 The expected squared distances are preserved! � f(a) � 2 / � a � 2 is a weighted sum of squared normals.

  6. random projections Let b = a i − a j for some a i , a j ∈ A . Then √ √  �  � � � f(b) � 2 8 log(n / δ ) + 8 log(n / δ )  � �  P  ∃ b : − 1 � > � � � b � 2 � � d d �  √ √  �  � � � f(b) � 2 � n 8 log(n / δ ) + 8 log(n / δ ) �  � �  ≤ P − 1 � > � � � b � 2 2 � � d d  �  ≤ δ (by a Bernstein-type inequality) . √ If d ≥ (c /ε 2 ) log(n / δ ) , then √ √ � 8 log(n / δ ) + 8 log(n / δ ) ≤ ε d d and f is an ε -isometry with probability ≥ 1 − δ .

  7. martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z .

  8. martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale representation of Z .

  9. martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale Joseph Leo Doob (1910–2004) representation of Z .

  10. martingale representation: the variance � n  � 2  n � � � � �  = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E  i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n  � 2  n � � � �  = ∆ 2 Var (Z) = E ∆ i . E  i i=1 i=1

  11. martingale representation: the variance � n  � 2  n � � � � �  = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E  i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n  � 2  n � � � �  = ∆ 2 Var (Z) = E ∆ i . E  i i=1 i=1 From this, using independence, it is easy derive the Efron-Stein inequality.

  12. efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only.

  13. efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only. We obtain more useful forms by using that Var (X) = 1 2 E (X − X ′ ) 2 Var (X) ≤ E (X − a) 2 and for any constant a .

  14. efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables.

  15. efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables. If Z = � n i=1 X i then we have an equality. Sums are the “least concentrated” of all functions!

  16. efron-stein inequality (1981) If for some arbitrary functions f i Z i = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then � n � � (Z − Z i ) 2 Var (Z) ≤ E i=1

  17. efron, stein, and steele Mike Steele Charles Stein Bradley Efron

  18. example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n

  19. example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n regardless of the distribution and the richness of A .

  20. example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is n φ n (x) = 1 � x − X i � � K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx .

  21. example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is n φ n (x) = 1 � x − X i � � K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx . It is easy to see that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | � x − x ′ 1 � � � x − x i � �� � dx ≤ 2 i � � ≤ � K − K n , � � nh h h Var (Z) ≤ 2 so we get n .

  22. ✶ ✶ ✶ bounding the expectation i ∈ A and let E ′ denote expectation only n (A) = 1 � n Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � n � n (A) | = 1 � � | P n (A) − P ′ � ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A � � i=1

  23. bounding the expectation i ∈ A and let E ′ denote expectation only n (A) = 1 � n Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � n � n (A) | = 1 � � | P n (A) − P ′ � ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A � � i=1 Second symmetrization: if ε 1 , . . . , ε n are independent Rademacher variables, then � n � � n � = 1 � ≤ 2 � � � � � � n E sup ε i ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) n E sup ε i ✶ X i ∈ A � � � � � � � � A ∈A A ∈A � � � i=1 i=1

  24. conditional rademacher average If � n � � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A � � i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A

  25. conditional rademacher average If � n � � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A � � i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A R n is a data-dependent quantity!

  26. concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1

  27. concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n !

  28. concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n ! Such functions are called self-bounding.

  29. bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n

  30. bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n In particular, � log S(X n 1 , A ) E sup | P n (A) − P(A) | ≤ 2 E . 2n A ∈A

  31. random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) .

  32. random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) . V is also self-bounding: n (V − V (i) ) 2 ≤ V � i=1 so by Efron-Stein, Var (V) ≤ E V

  33. vapnik and chervonenkis Alexey Chervonenkis Vladimir Vapnik

  34. beyond the variance X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Recall the Doob martingale representation: n � Z − E Z = ∆ i where ∆ i = E i Z − E i − 1 Z , i=1 with E i [ · ] = E [ ·| X 1 , . . . , X i ] . To get exponential inequalities, we bound the moment generating function E e λ (Z − E Z) .

  35. azuma’s inequality Suppose that the martingale differences are bounded: | ∆ i | ≤ c i . Then �� n − 1 � E e λ (Z − E Z) = E e λ ( � n i=1 ∆ i ) = EE n e λ i=1 ∆ i + λ ∆ n �� n − 1 � λ i=1 ∆ i E n e λ ∆ n = E e �� n − 1 � λ i=1 ∆ i n / 2 (by Hoeffding) e λ 2 c 2 ≤ E e · · · ≤ e λ 2 ( � n i=1 c 2 i ) / 2 . This is the Azuma-Hoeffding inequality for sums of bounded martingale differences.

  36. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded.

  37. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i .

  38. bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i . McDiarmid’s inequality. Colin McDiarmid

  39. hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� n � � ≤ e − (t −√ v) 2 / (2v) . � � � P X i � > t � � � � � i=1

  40. hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� n � � ≤ e − (t −√ v) 2 / (2v) . � � � P X i � > t � � � � � i=1 �� n � has the bounded � � Proof: By the triangle inequality, i=1 X i differences property with constants c , so �� n � � �� n � � n � � n � � � � � � � � � � � � � � X i � > t = P X i � − E X i � > t − E X i P � � � � � � � � � � � � � � � � � � � � � i=1 i=1 i=1 i=1 �� 2 � �� n � � � � t − E i=1 X i ≤ exp − . 2v Also, � � 2 � n � � n � n � � E � X i � 2 ≤ √ v . � � � � � � � � � X i � ≤ X i = E � E � � � � � � � � � � � � i=1 i=1 i=1

  41. bounded differences inequality Easy to use. Distribution free. Often close to optimal (e.g., L 1 error of kernel density estimate). Does not exploit “variance information.” Often too rigid. Other methods are necessary.

  42. shannon entropy If X , Y are random variables taking values in a set of size N , � H(X) = − p(x) log p(x) x H(X | Y)= H(X , Y) − H(Y) � = − p(x , y) log p(x | y) Claude Shannon x , y (1916–2001) H(X) ≤ log N and H(X | Y) ≤ H(X)

  43. han’s inequality If X = (X 1 , . . . , X n ) and X (i) = (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then n � � � H(X) − H(X (i) ) ≤ H(X) i=1 Proof: H(X)= H(X (i) ) + H(X i | X (i) ) ≤ H(X (i) ) + H(X i | X 1 , . . . , X i − 1 ) Since � n i=1 H(X i | X 1 , . . . , X i − 1 ) = H(X) , summing Te Sun Han the inequality, we get n � H(X (i) ) . (n − 1)H(X) ≤ i=1

  44. subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 .

  45. subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 . Han’s inequality implies the following sub-additivity property. Let X 1 , . . . , X n be independent and let Z = f(X 1 , . . . , X n ) , where f ≥ 0 . Denote Ent (i) (Z) = E (i) Φ(Z) − Φ( E (i) Z) Then n � Ent (i) (Z) . Ent (Z) ≤ E i=1

  46. a logarithmic sobolev inequality on the hypercube Let X = (X 1 , . . . , X n ) be uniformly distributed over {− 1 , 1 } n . If f : {− 1 , 1 } n → R and Z = f(X) , n Ent (Z 2 ) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 The proof uses subadditivity of the entropy and calculus for the case n = 1 . Implies Efron-Stein and the edge-isoperimetric inequality.

  47. herbst’s argument: exponential concentration If f : {− 1 , 1 } n → R , the log-Sobolev inequality may be used with g(x) = e λ f(x) / 2 where λ ∈ R . If F( λ ) = E e λ Z is the moment generating function of Z = f(X) , � Ze λ Z � � e λ Z � � Ze λ Z � Ent (g(X) 2 )= λ E − E log E = λ F ′ ( λ ) − F( λ ) log F( λ ) . Differential inequalities are obtained for F( λ ) .

  48. herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � ′ � G( λ ) ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v

  49. herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � ′ � G( λ ) ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v Stronger than the bounded differences inequality!

  50. gaussian log-sobolev and concentration inequalities Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E This can be proved using the central limit theorem and the Bernoulli log-Sobolev inequality.

  51. gaussian log-sobolev and concentration inequalities Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E This can be proved using the central limit theorem and the Bernoulli log-Sobolev inequality. It implies the Gaussian concentration inequality: Suppose f is Lipschitz: for all x , y ∈ R n , | f(x) − f(y) | ≤ L � x − y � . Then, for all t > 0 , P { f(X) − E f(X) ≥ t } ≤ e − t 2 / (2L 2 ) .

  52. an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If σ 2 = sup � � �� X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 )

  53. an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If σ 2 = sup � � �� X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 ) Proof: We may assume T = { 1 , ..., n } . Let Γ be the covariance matrix of X = (X 1 , . . . , X n ) . Let A = Γ 1 / 2 . If Y is a standard normal vector, then distr . f(Y) = i=1 ,..., n (AY) i max = i=1 ,..., n X i max By Cauchy-Schwarz, 1 / 2 � �   � � � � A 2 � � | (Au) i − (Av) i | = A i , j (u j − v j ) ≤ � u − v � � � i , j  � � j j � � ≤ σ � u − v �

  54. beyond bernoulli and gaussian: the entropy method For general distributions, logarithmic Sobolev inequalities are not available. Solution: modified logarithmic Sobolev inequalities. Suppose X 1 , . . . , X n are independent. Let Z = f(X 1 , . . . , X n ) and Z i = f i (X (i) ) = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) . Let φ (x) = e x − x − 1 . Then for all λ ∈ R , � Ze λ Z � � e λ Z � � e λ Z � λ E − E log E n � � � e λ Z φ ( − λ (Z − Z i )) ≤ . E i=1 Michel Ledoux

  55. the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n (Z − Z i ) 2 ≤ v . � i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) .

  56. the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n (Z − Z i ) 2 ≤ v . � i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) . This implies the bounded differences inequality and much more.

  57. example: the largest eigenvalue of a symmetric matrix Let A = (X i , j ) n × n be symmetric, the X i , j independent ( i ≤ j ) with | X i , j | ≤ 1 . Let u T Au . Z = λ 1 = sup u: � u � =1 and suppose v is such that Z = v T Av . A ′ i , j is obtained by replacing X i , j by x ′ i , j . Then � � v T Av − v T A ′ (Z − Z i , j ) + ≤ i , j v ✶ Z > Z i , j � � � � v T (A − A ′ v i v j (X i , j − X ′ = i , j )v ✶ Z > Z i , j ≤ 2 i , j ) + ≤ 4 | v i v j | . Therefore, � n � 2 16 | v i v j | 2 ≤ 16 � � � (Z − Z ′ i , j ) 2 v 2 + ≤ = 16 . i 1 ≤ i ≤ j ≤ n 1 ≤ i ≤ j ≤ n i=1

  58. example: convex lipschitz functions Let f : [0 , 1] n → R be a convex function. Let i f(X 1 , . . . , x ′ i , . . . , X n ) and let X ′ i be the value of x ′ Z i = inf x ′ i for which the minimum is achieved. Then, writing (i) = (X 1 , . . . , X i − 1 , X ′ X i , X i+1 , . . . , X n ) , n n (i) ) 2 � (Z − Z i ) 2 = � (f(X) − f(X i=1 i=1 � ∂ f n � 2 � (X i − X ′ i ) 2 ≤ (X) ∂ x i i=1 (by convexity) � ∂ f n � 2 � ≤ (X) ∂ x i i=1 = �∇ f(X) � 2 ≤ L 2 .

  59. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z)

  60. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages and the random VC dimension are self bounding.

  61. self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages and the random VC dimension are self bounding. Configuration functions.

  62. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1

  63. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2)

  64. weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2) If, in addition, f(x) − f i (x (i) ) ≤ 1 , then for 0 < t ≤ E Z , � � t 2 P { Z ≤ E Z − t } ≤ exp − . 2 (a E Z + b + c − t) where c = (3a − 1) / 6 .

  65. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand

  66. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A]

  67. the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A] Concentration of measure!

  68. the isoperimetric view Proof: By the bounded differences inequality, P { E d(X , A) − d(X , A) ≥ t } ≤ e − 2t 2 / n . Taking t = E d(X , A) , we get � n 1 E d(X , A) ≤ 2 log P { A } . By the bounded differences inequality again, � � � n 1 ≤ e − 2t 2 / n P d(X , A) ≥ t + 2 log P { A }

  69. talagrand’s convex distance The weighted Hamming distance is � d α (x , A) = inf y ∈ A d α (x , y) = inf | α i | y ∈ A i:x i � =y i where α = ( α 1 , . . . , α n ) . The same argument as before gives � � � � α � 2 1 ≤ e − 2t 2 / � α � 2 , P d α (X , A) ≥ t + log 2 P { A } This implies min ( P { A } , P { d α (X , A) ≥ t } ) ≤ e − t 2 / 2 . sup α : � α � =1

Recommend


More recommend