Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms Marius Kloft Joint work with Yunwen Lei (CU Hong Kong), Urun Dogan (Microsoft Research), and Alexander Binder (Singapore). Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 1 Extreme Classification Many modern applications involve a huge number of classes . ◮ E.g., image annotation (Deng, Dong, Socher, Li, Li, and Fei-Fei, 2009) ◮ Still growing datasets Need for theory and algorithms for extreme classification (multi-class classification with huge amount of classes). Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 2
Discrepancy of Theory and Algorithms in Extreme Classification ◮ Algorithms for handling huge class sizes ◮ (stochastic) dual coordinate ascent (Keerthi et al., 2008; Shalev-Shwartz and Zhang, (to appear) ◮ Theory not prepared for extreme classification ◮ Data-dependent bounds scale at least linearly with the number of classes (Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al., 2014) Questions ◮ Can we get bounds with mild dependence on #classes? ◮ What would we learn from such bounds? ⇒ Novel algorithms? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 3 Theory Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 4
Multi-class Classification Given : i.i.d. ◮ Training data z 1 = ( x 1 , y 1 ) , . . . , z n = ( x n , y n ) ∼ P � �� � ∈X×Y ◮ Y := { 1 , 2 , . . . , c } ◮ c = number of classes aeroplane bicycle bottle bird boat bus car cat chair cow diningtable horse person dog motorbike pottedplant sheep sofa train tvmonitor Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 5 Formal Problem Setting Aim : ◮ Define a hypothesis class H of functions h = ( h 1 , . . . , h c ) ◮ Find an h ∈ H that “predicts well” via ˆ y := arg max y ∈Y h y ( x ) Multi-class SVMs : ◮ h y ( x ) = � w y , φ ( x ) � ◮ Introduce notion of the (multi-class) margin ρ h ( x , y ) := h y ( x ) − max y ′ : y ′ � = y h y ′ ( x ) ◮ the larger the margin, the better Want : large expected margin E ρ h ( X , Y ) . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 6
Types of Generalization bounds for Multi-class Classification Data-independent bounds ◮ based on covering numbers (Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007) - conservative ◮ unable to adapt to data Data-dependent bounds ◮ based on Rademacher complexity (Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013; Kuznetsov et al., 2014) + tighter ◮ able to capture the real data ◮ computable from the data Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 7 Rademacher & Gaussian Complexity Definition ◮ Let σ 1 , . . . , σ n be independent Rademacher variables (taking only values ± 1 , with equal probability). ◮ The Rademacher complexity (RC) is defined as n 1 � � � R ( H ) := E σ σ i h ( z i ) sup n h ∈ H i = 1 Definition ◮ Let g 1 , . . . , g n ∼ N ( 0 , 1 ) . ◮ The Gaussian complexity (GC) is defined as n 1 � � � G ( H ) = E g g i h ( z i ) sup n h ∈ H i = 1 Interpretation: RC and GC reflect the ability of the hypothesis class to correlate with random noise . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 8 Theorem ( (Ledoux and Talagrand, 1991) ) � �
Existing Data-Dependent Analysis The key step is estimating R ( { ρ h : h ∈ H } ) induced from the margin operator ρ h and class H . Existing bounds build on the structural result: c � R ( max { h 1 , . . . , h c } : h j ∈ H j , j = 1 , . . . , c ) ≤ R ( H j ) (1) j = 1 The correlation among class-wise components is ignored. Best known dependence on the number of classes: ◮ quadratic dependence Koltchinskii and Panchenko (2002); Mohri et al. (2012); Cortes et al. (2013) ◮ linear dependence Kuznetsov et al. (2014) Can we do better? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 9 A New Structural Lemma on Gaussian Complexities We consider Gaussian complexity. ◮ H is a vector-valued function class, g 11 , . . . , g nc ∼ N ( 0 , 1 ) ◮ We show: � � { max { h 1 , . . . , h c } : h = ( h 1 , . . . , h c ) ∈ H } ≤ G n c 1 � � g ij h j ( x i ) . (2) n E g sup h =( h 1 ,..., h c ) ∈ H i = 1 j = 1 Core idea: Comparison inequality on GPs: (Slepian, 1962) n n c � � � X h := g i max { h 1 ( x i ) , . . . , h c ( x i ) } , Y h := g ij h j ( x i ) , ∀ h ∈ H . i = 1 i = 1 j = 1 θ ) 2 ] ≤ E [( Y θ − Y ¯ θ ) 2 ] = E [( X θ − X ¯ ⇒ E [ sup X θ ] ≤ E [ sup Y θ ] . θ ∈ Θ θ ∈ Θ Eq. (2) preserves the coupling among class-wise components! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 10
Example on Comparison of the Structural Lemma ◮ Consider H := { ( x 1 , x 2 ) → ( h 1 , h 2 )( x 1 , x 2 ) = ( w 1 x 1 , w 2 x 2 ) : � ( w 1 , w 2 ) � 2 ≤ 1 } ◮ For the function class { max { h 1 , h 2 } : h = ( h 1 , h 2 ) ∈ H } , � n i = 1 σ i h 1 ( x i ) + sup n � ( h 1 , h 2 ) ∈ H [ g i 1 h 1 ( x i ) + g i 2 h 2 ( x i )] sup � n i = 1 σ i h 2 ( x i ) sup ( h 1 , h 2 ) ∈ H i = 1 ( h 1 , h 2 ) ∈ H Preserving the coupling means supremum in a smaller space! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 11 Estimating Multi-class Gaussian Complexity ◮ Consider a vector-valued function class defined by H := { h w = ( � w 1 , φ ( x ) � , . . . , � w c , φ ( x ) � ) : f ( w ) ≤ Λ } , where f is β -strongly convex w.r.t. � · � ◮ f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) − β 2 α ( 1 − α ) � x − y � 2 . Theorem � � n c n � � � c � � 2 π Λ 1 j ( x i ) ≤ 1 2 � � � � g ij h w � � g ij φ ( x i ) (3) n E g sup β E g ∗ , � � n j = 1 h w ∈ H i = 1 j = 1 i = 1 where � · � ∗ is the dual norm of � · � . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 12
Features of the complexity bound ◮ Applies to a general function class defined through a strongly-convex regularizer f ◮ Class-wise components h 1 , . . . , h c are correlated through � � � � c 2 � � the term g ij φ ( x i ) � � j = 1 ∗ ◮ Consider class H p , Λ := { h w : � w � 2 , p ≤ Λ } , ( 1 p + 1 p ∗ = 1 ) ; then: � � n c n j ( x i ) ≤ Λ 1 � � � � g ij h w k ( x i , x i ) × n E g sup � n h w ∈ H p , Λ i = 1 j = 1 i = 1 √ e ( 4 log c ) 1 + 1 if p ∗ ≥ 2 log c , 2 log c , 2 p ∗ � 1 + 1 � 1 p ∗ c p ∗ , otherwise . The dependence is sublinear for 1 ≤ p ≤ 2 , and even logarithmic when p approaches to 1 ! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 13 Algorithms Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 14
ℓ p -norm Multi-class SVM Motivated by the mild dependence on c as p → 1 , we consider ( ℓ p -norm) Multi-class SVM, 1 ≤ p ≤ 2 c n � 2 � 1 � p + C � � w j � p ( 1 − t i ) + , min 2 2 w (P) j = 1 i = 1 s.t. t i = � w y i , φ ( x i ) � − max y : y � = y i � w y , φ ( x i ) � , Dual Problem c n n � 2 ( p − 1 ) � p α ∈ R n × c − 1 � � � � � p p − 1 α ij φ ( x i ) + sup α iy i � � 2 2 (D) j = 1 i = 1 i = 1 s.t. α i ≤ e y i · C ∧ α i · 1 = 0 , ∀ i = 1 , . . . , n . (D) is not quadratic if p � = 2 ; how to optimize? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 15 Equivalent Formulation We introduce class weights β 1 , . . . , β c to get quadratic dual � � w j � 2 � c + λ � β � p � w j � 2 . 1 p + 1 has optimum for β j ∝ min β j = 1 p 2 β j Equivalent Problem c n � w j � 2 � � 2 + C ( 1 − t i ) + min 2 β j w , β j = 1 i = 1 (E) s.t. t i ≤ � w y i , φ ( x i ) � − � w y , φ ( x i ) � , y � = y i , i = 1 , . . . , n , p = p ( 2 − p ) − 1 , β j ≥ 0 . � β � ¯ p ≤ 1 , ¯ Alternating optimization w.r.t. β and to w Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 16
Recommend
More recommend