foundations of machine learning
play

Foundations of Machine Learning Learning with Infinite Hypothesis - PowerPoint PPT Presentation

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Motivation With an infinite hypothesis set , the error bounds H of the previous lecture are not informative. Is efficient learning from a finite sample possible when


  1. Foundations of Machine Learning Learning with Infinite Hypothesis Sets

  2. Motivation With an infinite hypothesis set , the error bounds H of the previous lecture are not informative. Is efficient learning from a finite sample possible when is infinite? H Our example of axis-aligned rectangles shows that it is possible. Can we reduce the infinite case to a finite set? Project over finite samples? Are there useful measures of complexity for infinite hypothesis sets? page 2

  3. This lecture Rademacher complexity Growth Function VC dimension Lower bound page 3

  4. Empirical Rademacher Complexity Definition: • family of functions mapping from set to . [ a, b ] G Z • sample . S =( z 1 , . . . , z m ) • (Rademacher variables): independent uniform σ i s random variables taking values in . { − 1 , +1 }   σ 1 �  g ( z 1 ) ��  � m X 1 1 b . R S ( G ) = E sup = E sup σ i g ( z i ) . . . · . m m . . σ σ g ∈ G g ∈ G g ( z m ) σ m i =1 correlation with random noise page 4

  5. Rademacher Complexity Definitions: let be a family of functions mapping G from to . [ a, b ] Z • Empirical Rademacher complexity of : G � � m � 1 � R S ( G ) = E sup σ i g ( z i ) , m σ g ∈ G i =1 where are independent uniform random variables σ i s taking values in and . S =( z 1 , . . . , z m ) { − 1 , +1 } • Rademacher complexity of : G S ∼ D m [ � R m ( G ) = E R S ( G )] . page 5

  6. Rademacher Complexity Bound (Koltchinskii and Panchenko, 2002) Theorem: Let be a family of functions mapping G from to . Then, for any , with probability [0 , 1] δ > 0 Z at least , the following holds for all : g ∈ G 1 − δ ⇥ m log 1 E[ g ( z )] ≤ 1 � δ g ( z i ) + 2 R m ( G ) + 2 m . m i =1 ⇤ m � log 2 E[ g ( z )] ≤ 1 g ( z i ) + 2 ⇥ δ R S ( G ) + 3 2 m . m i =1 Proof: Apply McDiarmid’s inequality to E[ g ] − � Φ ( S ) = sup E S [ g ] . g ∈ G page 6

  7. • Changing one point of changes by at most 1 Φ ( S ) S m . { E[ g ] − b { E[ g ] − b Φ ( S 0 ) − Φ ( S ) = sup E S 0 [ g ] } − sup E S [ g ] } g 2 G g 2 G {{ E[ g ] − b E S 0 [ g ] } − { E[ g ] − b ≤ sup E S [ g ] }} g 2 G { b E S [ g ] − b 1 m )) ≤ 1 m ( g ( z m ) − g ( z 0 = sup E S 0 [ g ] } = sup m . g 2 G g 2 G • Thus, by McDiarmid’s inequality, with probability at least 1 − δ 2 � log 2 Φ ( S ) ≤ E S [ Φ ( S )] + 2 m . δ • We are left with bounding the expectation. page 7

  8. • Series of observations: ⇥ ⇤ E[ g ] − b E S [ Φ ( S )] = E sup E S ( g ) S g 2 G ⇥ ⇤ S 0 [ b E S 0 ( g ) − b = E sup E E S ( g )] S g 2 G ⇥ ⇤ E S 0 ( g ) − b b ( sub-add. of sup) ≤ E sup E S ( g ) S,S 0 g 2 G m X ⇥ ⇤ 1 ( g ( z 0 = E sup i ) − g ( z i )) m S,S 0 g 2 G i =1 m X ⇥ ⇤ 1 ( swap z i and z 0 σ i ( g ( z 0 i ) = E sup i ) − g ( z i )) m σ ,S,S 0 g 2 G i =1 m m X X ⇥ ⇤ ⇥ ⇤ 1 1 σ i g ( z 0 ( sub-additiv. of sup) ≤ E sup i ) + E sup − σ i g ( z i ) m m σ ,S 0 σ ,S g 2 G g 2 G i =1 i =1 m X ⇥ ⇤ 1 = 2 E sup σ i g ( z i ) = 2 R m ( G ) . m σ ,S g 2 G i =1 page 8

  9. • Now, changing one point of makes vary by � R S ( G ) S at most . Thus, again by McDiarmid’s inequality, 1 m with probability at least , 1 − δ 2 ⇥ log 2 R m ( G ) ≤ � δ R S ( G ) + 2 m . • Thus, by the union bound, with probability at least , 1 − δ ⇥ log 2 Φ ( S ) ≤ 2 � δ R S ( G ) + 3 2 m . page 9

  10. Loss Functions - Hypothesis Set Proposition: Let be a family of functions taking H values in , the family of zero-one loss { − 1 , +1 } G functions of : Then, � � H G = ( x, y ) �� 1 h ( x ) � = y : h � H . R m ( G ) = 1 2 R m ( H ) . m Proof: 1 � � � R m ( G ) = E sup σ i 1 h ( x i ) � = y i m S, σ h � H i =1 m 1 1 � � � = E sup 2(1 − y i h ( x i )) σ i m S, σ h � H i =1 m = 1 1 � � � 2 E sup − σ i y i h ( x i ) m S, σ h � H i =1 m = 1 1 � � � 2 E sup σ i h ( x i ) . m S, σ h � H i =1 page 10

  11. Generalization Bounds - Rademacher Corollary: Let be a family of functions taking H values in . Then, for any , with { − 1 , +1 } δ > 0 probability at least , for any , 1 − δ h ∈ H ⇥ log 1 R ( h ) ≤ � δ R ( h ) + R m ( H ) + 2 m . ⇥ log 2 R ( h ) ≤ � R ( h ) + � δ R S ( H ) + 3 2 m . page 11

  12. Remarks First bound distribution-dependent, second data- dependent bound, which makes them attractive. But, how do we compute the empirical Rademacher complexity? Computing requires � m 1 E σ [sup h ∈ H i =1 σ i h ( x i )] m solving ERM problems, typically computationally hard. Relation with combinatorial measures easier to compute? page 12

  13. This lecture Rademacher complexity Growth Function VC dimension Lower bound page 13

  14. Growth Function Definition: the growth function for a Π H : N → N hypothesis set is defined by H ⇧ ⌅⇧ ⇤� ⇥ ∀ m ∈ N , Π H ( m ) = max h ( x 1 ) , . . . , h ( x m ) : h ∈ H ⇧ . ⇧ ⇧ ⇧ { x 1 ,...,x m } ⊆ X Thus, is the maximum number of ways Π H ( m ) m points can be classified using . H page 14

  15. Massart’s Lemma (Massart, 2000) Theorem: Let be a finite set, with , x ∈ A � x � 2 A ⊆ R m R =max then, the following holds: � 1 m ⌅ ⇥ 2 log | A | ≤ R ⇤ σ i x i . E m sup m σ x ∈ A i =1 � � m �� � � m �� Proof: � � exp t E sup ≤ E exp t sup ( Jensen’s ineq. ) σ i x i σ i x i σ x ∈ A σ x ∈ A i =1 i =1 � � m �� � = E sup exp t σ i x i σ x ∈ A i =1 � � m �� m � � � � E exp = E σ (exp [ t σ i x i ]) t σ i x i ≤ σ x ∈ A i =1 x ∈ A i =1 �� m i =1 t 2 (2 | x i | ) 2 � �� t 2 R 2 � 2 . ( Hoeffding’s ineq. ) ≤ exp ≤ | A | e 8 x ∈ A page 15

  16. • Taking the log yields: m + tR 2 � ⇥ ≤ log | A | ⇤ σ i x i 2 . E sup t σ x ∈ A i =1 • Minimizing the bound by choosing √ 2 log | A | t = R gives m � ⇥ ⇤ ⌅ 2 log | A | . σ i x i ≤ R E sup σ x ∈ A i =1 page 16

  17. Growth Function Bound on Rad. Complexity Corollary: Let be a family of functions taking G values in , then the following holds: { − 1 , +1 } � 2 log Π G ( m ) R m ( G ) ≤ . m Proof: � � σ 1 � � g ( z 1 ) �� 1 . . . � . R S ( G ) = E sup · . . m σ g ∈ G g ( z m ) σ m � √ m 2 log |{ ( g ( z 1 ) , . . . , g ( z m )): g ∈ G }| ( Massart’s Lemma ) ≤ m � � √ m 2 log Π G ( m ) 2 log Π G ( m ) = . ≤ m m page 17

  18. Generalization Bound - Growth Function Corollary: Let be a family of functions taking H values in . Then, for any , with { − 1 , +1 } δ > 0 probability at least , for any , 1 − δ h ∈ H ⇤ ⇥ log 1 2 log Π H ( m ) R ( h ) ≤ � δ R ( h ) + + 2 m . m But, how do we compute the growth function? Relationship with the VC-dimension (Vapnik- Chervonenkis dimension). page 18

  19. This lecture Rademacher complexity Growth Function VC dimension Lower bound page 19

  20. VC Dimension (Vapnik & Chervonenkis, 1968-1971; Vapnik, 1982, 1995, 1998) Definition: the VC-dimension of a hypothesis set H is defined by VCdim( H ) = max { m : Π H ( m ) = 2 m } . Thus, the VC-dimension is the size of the largest set that can be fully shattered by . H Purely combinatorial notion. page 20

  21. Examples In the following, we determine the VC dimension for several hypothesis sets. To give a lower bound for , it suffices VCdim( H ) d to show that a set of cardinality can be S d shattered by . H To give an upper bound, we need to prove that no set of cardinality can be shattered by , S d +1 H which is typically more difficult. page 21

  22. Intervals of The Real Line Observations: • Any set of two points can be shattered by four intervals - - + - - + + + • No set of three points can be shattered since the following dichotomy “+ - +” is not realizable (by definition of intervals): + - + • Thus, . VCdim( intervals in R )=2 page 22

  23. Hyperplanes Observations: • Any three non-collinear points can be shattered: - + + • Unrealizable dichotomies for four points: + + - - - + + + • Thus, . VCdim( hyperplanes in R d )= d +1 page 23

  24. Axis-Aligned Rectangles in the Plane Observations: • The following four points can be shattered: + + + + - - + - - - - + + + - - • No set of five points can be shattered: label negatively the point that is not near the sides. + + + - + • Thus, . VCdim( axis-aligned rectangles )=4 page 24

Recommend


More recommend