Foundations of Machine Learning Learning with Infinite Hypothesis Sets
Motivation With an infinite hypothesis set , the error bounds H of the previous lecture are not informative. Is efficient learning from a finite sample possible when is infinite? H Our example of axis-aligned rectangles shows that it is possible. Can we reduce the infinite case to a finite set? Project over finite samples? Are there useful measures of complexity for infinite hypothesis sets? page 2
This lecture Rademacher complexity Growth Function VC dimension Lower bound page 3
Empirical Rademacher Complexity Definition: • family of functions mapping from set to . [ a, b ] G Z • sample . S =( z 1 , . . . , z m ) • (Rademacher variables): independent uniform σ i s random variables taking values in . { − 1 , +1 } σ 1 � g ( z 1 ) �� � m X 1 1 b . R S ( G ) = E sup = E sup σ i g ( z i ) . . . · . m m . . σ σ g ∈ G g ∈ G g ( z m ) σ m i =1 correlation with random noise page 4
Rademacher Complexity Definitions: let be a family of functions mapping G from to . [ a, b ] Z • Empirical Rademacher complexity of : G � � m � 1 � R S ( G ) = E sup σ i g ( z i ) , m σ g ∈ G i =1 where are independent uniform random variables σ i s taking values in and . S =( z 1 , . . . , z m ) { − 1 , +1 } • Rademacher complexity of : G S ∼ D m [ � R m ( G ) = E R S ( G )] . page 5
Rademacher Complexity Bound (Koltchinskii and Panchenko, 2002) Theorem: Let be a family of functions mapping G from to . Then, for any , with probability [0 , 1] δ > 0 Z at least , the following holds for all : g ∈ G 1 − δ ⇥ m log 1 E[ g ( z )] ≤ 1 � δ g ( z i ) + 2 R m ( G ) + 2 m . m i =1 ⇤ m � log 2 E[ g ( z )] ≤ 1 g ( z i ) + 2 ⇥ δ R S ( G ) + 3 2 m . m i =1 Proof: Apply McDiarmid’s inequality to E[ g ] − � Φ ( S ) = sup E S [ g ] . g ∈ G page 6
• Changing one point of changes by at most 1 Φ ( S ) S m . { E[ g ] − b { E[ g ] − b Φ ( S 0 ) − Φ ( S ) = sup E S 0 [ g ] } − sup E S [ g ] } g 2 G g 2 G {{ E[ g ] − b E S 0 [ g ] } − { E[ g ] − b ≤ sup E S [ g ] }} g 2 G { b E S [ g ] − b 1 m )) ≤ 1 m ( g ( z m ) − g ( z 0 = sup E S 0 [ g ] } = sup m . g 2 G g 2 G • Thus, by McDiarmid’s inequality, with probability at least 1 − δ 2 � log 2 Φ ( S ) ≤ E S [ Φ ( S )] + 2 m . δ • We are left with bounding the expectation. page 7
• Series of observations: ⇥ ⇤ E[ g ] − b E S [ Φ ( S )] = E sup E S ( g ) S g 2 G ⇥ ⇤ S 0 [ b E S 0 ( g ) − b = E sup E E S ( g )] S g 2 G ⇥ ⇤ E S 0 ( g ) − b b ( sub-add. of sup) ≤ E sup E S ( g ) S,S 0 g 2 G m X ⇥ ⇤ 1 ( g ( z 0 = E sup i ) − g ( z i )) m S,S 0 g 2 G i =1 m X ⇥ ⇤ 1 ( swap z i and z 0 σ i ( g ( z 0 i ) = E sup i ) − g ( z i )) m σ ,S,S 0 g 2 G i =1 m m X X ⇥ ⇤ ⇥ ⇤ 1 1 σ i g ( z 0 ( sub-additiv. of sup) ≤ E sup i ) + E sup − σ i g ( z i ) m m σ ,S 0 σ ,S g 2 G g 2 G i =1 i =1 m X ⇥ ⇤ 1 = 2 E sup σ i g ( z i ) = 2 R m ( G ) . m σ ,S g 2 G i =1 page 8
• Now, changing one point of makes vary by � R S ( G ) S at most . Thus, again by McDiarmid’s inequality, 1 m with probability at least , 1 − δ 2 ⇥ log 2 R m ( G ) ≤ � δ R S ( G ) + 2 m . • Thus, by the union bound, with probability at least , 1 − δ ⇥ log 2 Φ ( S ) ≤ 2 � δ R S ( G ) + 3 2 m . page 9
Loss Functions - Hypothesis Set Proposition: Let be a family of functions taking H values in , the family of zero-one loss { − 1 , +1 } G functions of : Then, � � H G = ( x, y ) �� 1 h ( x ) � = y : h � H . R m ( G ) = 1 2 R m ( H ) . m Proof: 1 � � � R m ( G ) = E sup σ i 1 h ( x i ) � = y i m S, σ h � H i =1 m 1 1 � � � = E sup 2(1 − y i h ( x i )) σ i m S, σ h � H i =1 m = 1 1 � � � 2 E sup − σ i y i h ( x i ) m S, σ h � H i =1 m = 1 1 � � � 2 E sup σ i h ( x i ) . m S, σ h � H i =1 page 10
Generalization Bounds - Rademacher Corollary: Let be a family of functions taking H values in . Then, for any , with { − 1 , +1 } δ > 0 probability at least , for any , 1 − δ h ∈ H ⇥ log 1 R ( h ) ≤ � δ R ( h ) + R m ( H ) + 2 m . ⇥ log 2 R ( h ) ≤ � R ( h ) + � δ R S ( H ) + 3 2 m . page 11
Remarks First bound distribution-dependent, second data- dependent bound, which makes them attractive. But, how do we compute the empirical Rademacher complexity? Computing requires � m 1 E σ [sup h ∈ H i =1 σ i h ( x i )] m solving ERM problems, typically computationally hard. Relation with combinatorial measures easier to compute? page 12
This lecture Rademacher complexity Growth Function VC dimension Lower bound page 13
Growth Function Definition: the growth function for a Π H : N → N hypothesis set is defined by H ⇧ ⌅⇧ ⇤� ⇥ ∀ m ∈ N , Π H ( m ) = max h ( x 1 ) , . . . , h ( x m ) : h ∈ H ⇧ . ⇧ ⇧ ⇧ { x 1 ,...,x m } ⊆ X Thus, is the maximum number of ways Π H ( m ) m points can be classified using . H page 14
Massart’s Lemma (Massart, 2000) Theorem: Let be a finite set, with , x ∈ A � x � 2 A ⊆ R m R =max then, the following holds: � 1 m ⌅ ⇥ 2 log | A | ≤ R ⇤ σ i x i . E m sup m σ x ∈ A i =1 � � m �� � � m �� Proof: � � exp t E sup ≤ E exp t sup ( Jensen’s ineq. ) σ i x i σ i x i σ x ∈ A σ x ∈ A i =1 i =1 � � m �� � = E sup exp t σ i x i σ x ∈ A i =1 � � m �� m � � � � E exp = E σ (exp [ t σ i x i ]) t σ i x i ≤ σ x ∈ A i =1 x ∈ A i =1 �� m i =1 t 2 (2 | x i | ) 2 � �� t 2 R 2 � 2 . ( Hoeffding’s ineq. ) ≤ exp ≤ | A | e 8 x ∈ A page 15
• Taking the log yields: m + tR 2 � ⇥ ≤ log | A | ⇤ σ i x i 2 . E sup t σ x ∈ A i =1 • Minimizing the bound by choosing √ 2 log | A | t = R gives m � ⇥ ⇤ ⌅ 2 log | A | . σ i x i ≤ R E sup σ x ∈ A i =1 page 16
Growth Function Bound on Rad. Complexity Corollary: Let be a family of functions taking G values in , then the following holds: { − 1 , +1 } � 2 log Π G ( m ) R m ( G ) ≤ . m Proof: � � σ 1 � � g ( z 1 ) �� 1 . . . � . R S ( G ) = E sup · . . m σ g ∈ G g ( z m ) σ m � √ m 2 log |{ ( g ( z 1 ) , . . . , g ( z m )): g ∈ G }| ( Massart’s Lemma ) ≤ m � � √ m 2 log Π G ( m ) 2 log Π G ( m ) = . ≤ m m page 17
Generalization Bound - Growth Function Corollary: Let be a family of functions taking H values in . Then, for any , with { − 1 , +1 } δ > 0 probability at least , for any , 1 − δ h ∈ H ⇤ ⇥ log 1 2 log Π H ( m ) R ( h ) ≤ � δ R ( h ) + + 2 m . m But, how do we compute the growth function? Relationship with the VC-dimension (Vapnik- Chervonenkis dimension). page 18
This lecture Rademacher complexity Growth Function VC dimension Lower bound page 19
VC Dimension (Vapnik & Chervonenkis, 1968-1971; Vapnik, 1982, 1995, 1998) Definition: the VC-dimension of a hypothesis set H is defined by VCdim( H ) = max { m : Π H ( m ) = 2 m } . Thus, the VC-dimension is the size of the largest set that can be fully shattered by . H Purely combinatorial notion. page 20
Examples In the following, we determine the VC dimension for several hypothesis sets. To give a lower bound for , it suffices VCdim( H ) d to show that a set of cardinality can be S d shattered by . H To give an upper bound, we need to prove that no set of cardinality can be shattered by , S d +1 H which is typically more difficult. page 21
Intervals of The Real Line Observations: • Any set of two points can be shattered by four intervals - - + - - + + + • No set of three points can be shattered since the following dichotomy “+ - +” is not realizable (by definition of intervals): + - + • Thus, . VCdim( intervals in R )=2 page 22
Hyperplanes Observations: • Any three non-collinear points can be shattered: - + + • Unrealizable dichotomies for four points: + + - - - + + + • Thus, . VCdim( hyperplanes in R d )= d +1 page 23
Axis-Aligned Rectangles in the Plane Observations: • The following four points can be shattered: + + + + - - + - - - - + + + - - • No set of five points can be shattered: label negatively the point that is not near the sides. + + + - + • Thus, . VCdim( axis-aligned rectangles )=4 page 24
Recommend
More recommend