Multi-class to Binary reduction of Large-scale classification - PowerPoint PPT Presentation

1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th , 2015

2/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

3/21 Multiclass classification : emerging problems ❑ The number of classes, K , in new emerging multiclass problems, for example in text and image classification, may reach 10 5 to 10 6 categories. ❑ For example

4/21 Large-scale classification : power law distribution of classes Collection K d DMOZ 7500 594158 4000 DMOZ-7500 3500 3000 2500 # Classes 2000 1500 1000 500 0 2-5 6-10 11-30 31-100 101-200 >200 # Documents

5/21 Multiclass classification approaches ❑ Uncombined approaches, i.e. MSVM or MLP. The number of parameters, M , is at least O ( K × d ) . ❑ Combined approaches based on binary classification : ❑ One-Vs-one - M ≥ O ( K 2 × d ) ❑ One-Vs-Rest - M ≥ O ( K × d ) ❑ For K >> 1 and d >> 1 traditional approaches do not pass the scale.

7/21 Learning objective ❑ Large-scale multiclass classification, ❑ Hypothesis : Observations x y = ( x , y ) ∈ X × Y are i.i.d with respect to a distribution D , ❑ For a class of H = { h : X × Y → R } , a ranking instanstaneous loss h ∈ H over an example x y by : 1 � e ( h , x y ) = ✶ h ( x y ) ≤ h ( x y ′ ) , K − 1 y ′ ∈Y\{ y } ❑ The aim is to find a function h ∈ H that minimizes the generalization error L ( h ) : L ( h ) = E x y ∼D [ e ( h , x y )] . ❑ Empirical error of a function h ∈ H over a training � � m x y i set S = i = 1 is i m L m ( h , S ) = 1 � ˆ e ( h , x y i i ) m i = 1

8/21 Reduction strategy ❑ Consider the empirical loss m 1 � � ˆ L m ( h , S ) = ✶ h ( x i ) ≤ h ( x y ′ yi m ( K − 1 ) i ) i = 1 y ′ ∈Y\{ y i } n 1 � = ✶ ˜ y i g ( Z i ) ≤ 0 n i = 1 � �� L T n ( g , T ( S )) where n = m ( K − 1 ) , Z i is a pair of couples costituted by a couple of example and its class and the couple corresponding to the example and another class, ˜ y i = 1 if the first couple in Z i is the true couple and − 1 otherwise, and g ( x y , x y ′ ) = h ( x y ) − h ( x y ′ ) .

9/21 Reduction strategy for the class of linear functions

9/21 Reduction strategy for the class of linear functions Problems : ❑ How to define Φ( x y ) , ❑ Consistency of the ERM principle with interdependant data.

10/21 Consistency of the ERM principle with interdependant data ❑ Different statistical tools for extending concentration inequalities to the case of interdependent data, ❑ tools based on colorable graphs proposed by (Janson, 2004) 1 . x 1 x 2 x 3 S 2 3 1 ( x 1 1 , x 2 1 ) ( x 1 1 , x 3 1 ) ( x 2 2 , x 1 ( x 2 2 , x 3 T ( S ) 2 ) 2 ) ( x 3 3 , x 1 3 ) ( x 3 3 , x 2 3 ) ( x 1 1 , x 2 1 ) ( x 2 2 , x 1 2 ) ( x 3 3 , x 1 3 ) ( C 1 , α 1 = 1) ( x 1 1 , x 3 1 ) ( x 2 2 , x 3 2 ) ( x 3 3 , x 2 3 ) ( C 2 , α 2 = 1) 1. S. Janson. Large deviations for sums of partly dependent random variables. Random Structures and Algorithms, 24(3) :234–248, 2004.

11/21 Theorem Let S = ( x y i i = 1 ∈ ( X × Y ) m be a training set constituted of m examples i ) m generated i.i.d. with respect to a probability distribution D over X × Y and i = 1 ∈ ( Z × {− 1 , 1 } ) n the transformed set obtained with y i )) n T ( S ) = (( Z i , ˜ application T. Let κ : Z → R by a PSD kernel, and Φ : X × Y → H the associated mapping function. For all 1 > δ > 0 , and all g w ∈ G B = { x �→ � w , Φ( x ) � | || w || ≤ B } with probability at least ( 1 − δ ) over T ( S ) we have then : � ln ( 2 δ ) n ( g w , T ( S )) + 2 B G ( T ( S )) L T ( g w ) ≤ ǫ T m √ + 3 (1) K − 1 2 m n � n ( g w , T ( S )) = 1 where ǫ T L (˜ y i g w ( Z i )) with a surrogate Hinge loss n i = 1 L : t �→ min ( 1 , max ( 1 − t , 0 )) , L T ( g w ) = E T ( S ) [ L T n ( g w , T ( S ))] et �� n G ( T ( S )) = i = 1 d κ ( Z i ) with d κ ( x y , x y ′ ) = κ ( x y , x y ) + κ ( x y ′ , x y ′ ) − 2 κ ( x y , x y ′ )

12/21 Key Features of Algorithm ❑ Data dependent bound : If the feature representation of (x,y) pairs is independent of original dimension, then : G ( T ( S )) ≤ √ n × Constant ≈ � m × ( K − 1 ) × Constant ❑ Non-trivial joint feature representation (example-class pair) ❑ Same for any number of class ❑ Same parameter vector for all classes

14/21 Feature representation Φ( x y ) Features � � ln ( 1 + l S 1. ln ( 1 + y t ) 2. ) S t t ∈ y ∩ x t ∈ y ∩ x � � ln ( 1 + y t 3. 4. ) I t | y | t ∈ y ∩ x t ∈ y ∩ x � ln ( 1 + y t � ln ( 1 + y t | y | . l S 5. | y | . I t ) 6. ) S t t ∈ y ∩ x t ∈ y ∩ x � � y t 7. 1 8. | y | . I t t ∈ y ∩ x t ∈ y ∩ x d 1 ( x y ) 10. d 2 ( x y ) 9. ❑ x t : number of occurrences of terme t in document x , ❑ V : Number of distinct terms in S , ❑ y t = � x ∈ y x t , | y | = � t ∈V y t , S t = � x ∈S x t , l S = � t ∈V S t . ❑ I t : idf of the terme t ,

15/21 Experimental results on text classification Collection K d m Test size DMOZ 7500 594158 394756 104263 WIKIPEDIA 7500 346299 456886 81262 K × d = O ( 10 9 ) ❑ Random samples of 100, 500, 1000, 3000, 5000 and 7500

16/21 Experimental Setup Implementation and comparison : ❑ SVM with linear kernel as binary classification algorithm ❑ Value of C chosen by cross-validation ❑ Comparison with OVA, OVO, M-SVM, LogT Performance Evaluation : ❑ Accuracy : Correctly classified examples in test dataset ❑ Macro F-Measure : Harmonic mean of precision and recall

17/21 Experimental Results Result for 7500 class : ❑ OVO and M-SVM did not pass the scale for 7500 classes ❑ N c : Proportion of classes for which at leaset one TP document found ❑ mRb covers 6-9.5% classes than OVA ( 500 - 700 classes)

18/21 # of Classes Vs. Macro F-Measure

19/21 # of Classes Vs. Macro F-Measure

20/21 Conclusion ❑ A new method of large-scale multiclass classification based on reduction of multiclass classification to binary classification. ❑ Efficiency of deduced algorithm comparable or better than the state of the art multiclass classification approaches.

Multi-class to Binary reduction of Large-scale classification - PowerPoint PPT Presentation

1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th ,

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

binary search trees Oct. 30, 2017 1 (binary search) tree binary (search tree) 2 class

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

Trees Linear Vs non-linear data structures Types of binary trees Binary tree traversals

Week 8 Oliver Kullmann Binary trees The notion BinaryTrees of binary search tree Tree

Sparse stochastic processes and biomedical image reconstruction Michael Unser Biomedical Imaging

Fuzzing the Phone in Your Phone Charlie MIller Collin Mulliner Independent Security Evaluators

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

PIPS Is not (just) Polyhedral Software Mehdi A MINI 1 , 2 Corinne A NCOURT 2 Fabien C OELHO 2

with Individual Atoms Christopher Monroe Univ. Maryland, JQI, QuICS, and IonQ Atomic Qubit ( 171

Simulating Human Carer with an Avatar to Improve Medication Adherence Kerry Y. FANG a,1 , Heidi

A homology theory for Smale spaces Ian F. Putnam, University of Victoria 1 Smale spaces (D.

HIGH-DEF FUZZING EXPLORING VULNERABILITIES IN HDMI-CEC name = "Joshua Smith" job =

Multi-class to Binary reduction of Large-scale classification - PowerPoint PPT Presentation

1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th ,

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

CMSC 206 Binary Search Trees 1 Binary Search Tree n A Binary Search Tree is a Binary Tree in

Binary Search Trees and Balanced Binary Search Trees using AVL Trees Mark Redekopp David Kempe

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

binary search trees Oct. 30, 2017 1 (binary search) tree binary (search tree) 2 class

Binary Search Trees A binary search tree is a binary tree T such that - each internal node

Trees Linear Vs non-linear data structures Types of binary trees Binary tree traversals

Week 8 Oliver Kullmann Binary trees The notion BinaryTrees of binary search tree Tree

Sparse stochastic processes and biomedical image reconstruction Michael Unser Biomedical Imaging

Fuzzing the Phone in Your Phone Charlie MIller Collin Mulliner Independent Security Evaluators

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley

PIPS Is not (just) Polyhedral Software Mehdi A MINI 1 , 2 Corinne A NCOURT 2 Fabien C OELHO 2

with Individual Atoms Christopher Monroe Univ. Maryland, JQI, QuICS, and IonQ Atomic Qubit ( 171

Simulating Human Carer with an Avatar to Improve Medication Adherence Kerry Y. FANG a,1 , Heidi

A homology theory for Smale spaces Ian F. Putnam, University of Victoria 1 Smale spaces (D.

HIGH-DEF FUZZING EXPLORING VULNERABILITIES IN HDMI-CEC name = &quot;Joshua Smith&quot; job =

HIGH-DEF FUZZING EXPLORING VULNERABILITIES IN HDMI-CEC name = "Joshua Smith" job =