1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th , 2015
2/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion
2/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion
3/21 Multiclass classification : emerging problems ❑ The number of classes, K , in new emerging multiclass problems, for example in text and image classification, may reach 10 5 to 10 6 categories. ❑ For example
4/21 Large-scale classification : power law distribution of classes Collection K d DMOZ 7500 594158 4000 DMOZ-7500 3500 3000 2500 # Classes 2000 1500 1000 500 0 2-5 6-10 11-30 31-100 101-200 >200 # Documents
5/21 Multiclass classification approaches ❑ Uncombined approaches, i.e. MSVM or MLP. The number of parameters, M , is at least O ( K × d ) . ❑ Combined approaches based on binary classification : ❑ One-Vs-one - M ≥ O ( K 2 × d ) ❑ One-Vs-Rest - M ≥ O ( K × d ) ❑ For K >> 1 and d >> 1 traditional approaches do not pass the scale.
6/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion
7/21 Learning objective ❑ Large-scale multiclass classification, ❑ Hypothesis : Observations x y = ( x , y ) ∈ X × Y are i.i.d with respect to a distribution D , ❑ For a class of H = { h : X × Y → R } , a ranking instanstaneous loss h ∈ H over an example x y by : 1 � e ( h , x y ) = ✶ h ( x y ) ≤ h ( x y ′ ) , K − 1 y ′ ∈Y\{ y } ❑ The aim is to find a function h ∈ H that minimizes the generalization error L ( h ) : L ( h ) = E x y ∼D [ e ( h , x y )] . ❑ Empirical error of a function h ∈ H over a training � � m x y i set S = i = 1 is i m L m ( h , S ) = 1 � ˆ e ( h , x y i i ) m i = 1
8/21 Reduction strategy ❑ Consider the empirical loss m 1 � � ˆ L m ( h , S ) = ✶ h ( x i ) ≤ h ( x y ′ yi m ( K − 1 ) i ) i = 1 y ′ ∈Y\{ y i } n 1 � = ✶ ˜ y i g ( Z i ) ≤ 0 n i = 1 � �� � L T n ( g , T ( S )) where n = m ( K − 1 ) , Z i is a pair of couples costituted by a couple of example and its class and the couple corresponding to the example and another class, ˜ y i = 1 if the first couple in Z i is the true couple and − 1 otherwise, and g ( x y , x y ′ ) = h ( x y ) − h ( x y ′ ) .
9/21 Reduction strategy for the class of linear functions
9/21 Reduction strategy for the class of linear functions Problems : ❑ How to define Φ( x y ) , ❑ Consistency of the ERM principle with interdependant data.
10/21 Consistency of the ERM principle with interdependant data ❑ Different statistical tools for extending concentration inequalities to the case of interdependent data, ❑ tools based on colorable graphs proposed by (Janson, 2004) 1 . x 1 x 2 x 3 S 2 3 1 ( x 1 1 , x 2 1 ) ( x 1 1 , x 3 1 ) ( x 2 2 , x 1 ( x 2 2 , x 3 T ( S ) 2 ) 2 ) ( x 3 3 , x 1 3 ) ( x 3 3 , x 2 3 ) ( x 1 1 , x 2 1 ) ( x 2 2 , x 1 2 ) ( x 3 3 , x 1 3 ) ( C 1 , α 1 = 1) ( x 1 1 , x 3 1 ) ( x 2 2 , x 3 2 ) ( x 3 3 , x 2 3 ) ( C 2 , α 2 = 1) 1. S. Janson. Large deviations for sums of partly dependent random variables. Random Structures and Algorithms, 24(3) :234–248, 2004.
11/21 Theorem Let S = ( x y i i = 1 ∈ ( X × Y ) m be a training set constituted of m examples i ) m generated i.i.d. with respect to a probability distribution D over X × Y and i = 1 ∈ ( Z × {− 1 , 1 } ) n the transformed set obtained with y i )) n T ( S ) = (( Z i , ˜ application T. Let κ : Z → R by a PSD kernel, and Φ : X × Y → H the associated mapping function. For all 1 > δ > 0 , and all g w ∈ G B = { x �→ � w , Φ( x ) � | || w || ≤ B } with probability at least ( 1 − δ ) over T ( S ) we have then : � ln ( 2 δ ) n ( g w , T ( S )) + 2 B G ( T ( S )) L T ( g w ) ≤ ǫ T m √ + 3 (1) K − 1 2 m n � n ( g w , T ( S )) = 1 where ǫ T L (˜ y i g w ( Z i )) with a surrogate Hinge loss n i = 1 L : t �→ min ( 1 , max ( 1 − t , 0 )) , L T ( g w ) = E T ( S ) [ L T n ( g w , T ( S ))] et �� n G ( T ( S )) = i = 1 d κ ( Z i ) with d κ ( x y , x y ′ ) = κ ( x y , x y ) + κ ( x y ′ , x y ′ ) − 2 κ ( x y , x y ′ )
12/21 Key Features of Algorithm ❑ Data dependent bound : If the feature representation of (x,y) pairs is independent of original dimension, then : G ( T ( S )) ≤ √ n × Constant ≈ � m × ( K − 1 ) × Constant ❑ Non-trivial joint feature representation (example-class pair) ❑ Same for any number of class ❑ Same parameter vector for all classes
13/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion
14/21 Feature representation Φ( x y ) Features � � ln ( 1 + l S 1. ln ( 1 + y t ) 2. ) S t t ∈ y ∩ x t ∈ y ∩ x � � ln ( 1 + y t 3. 4. ) I t | y | t ∈ y ∩ x t ∈ y ∩ x � ln ( 1 + y t � ln ( 1 + y t | y | . l S 5. | y | . I t ) 6. ) S t t ∈ y ∩ x t ∈ y ∩ x � � y t 7. 1 8. | y | . I t t ∈ y ∩ x t ∈ y ∩ x d 1 ( x y ) 10. d 2 ( x y ) 9. ❑ x t : number of occurrences of terme t in document x , ❑ V : Number of distinct terms in S , ❑ y t = � x ∈ y x t , | y | = � t ∈V y t , S t = � x ∈S x t , l S = � t ∈V S t . ❑ I t : idf of the terme t ,
15/21 Experimental results on text classification Collection K d m Test size DMOZ 7500 594158 394756 104263 WIKIPEDIA 7500 346299 456886 81262 K × d = O ( 10 9 ) ❑ Random samples of 100, 500, 1000, 3000, 5000 and 7500
16/21 Experimental Setup Implementation and comparison : ❑ SVM with linear kernel as binary classification algorithm ❑ Value of C chosen by cross-validation ❑ Comparison with OVA, OVO, M-SVM, LogT Performance Evaluation : ❑ Accuracy : Correctly classified examples in test dataset ❑ Macro F-Measure : Harmonic mean of precision and recall
17/21 Experimental Results Result for 7500 class : ❑ OVO and M-SVM did not pass the scale for 7500 classes ❑ N c : Proportion of classes for which at leaset one TP document found ❑ mRb covers 6-9.5% classes than OVA ( 500 - 700 classes)
18/21 # of Classes Vs. Macro F-Measure
19/21 # of Classes Vs. Macro F-Measure
20/21 Conclusion ❑ A new method of large-scale multiclass classification based on reduction of multiclass classification to binary classification. ❑ Efficiency of deduced algorithm comparable or better than the state of the art multiclass classification approaches.
Recommend
More recommend