Foundations of Machine Learning Multi-Class Classification
Motivation Real-world problems often have multiple classes: text, speech, image, biological sequences. Algorithms studied so far: designed for binary classification problems. How do we design multi-class classification algorithms? • can the algorithms used for binary classification be generalized to multi-class classification? • can we reduce multi-class classification to binary classification? Mehryar Mohri - Foundations of Machine Learning page 2
Multi-Class Classification Problem Training data: sample drawn i.i.d. from set X according to some distribution , D S =(( x 1 , y 1 ) , . . . , ( x m , y m )) ∈ X × Y, • mono-label case: . Card( Y )= k • multi-label case: . Y = { − 1 , +1 } k Problem: find classifier in with small h : X → Y H generalization error, • mono-label case: . R ( h )=E x ⇠ D [1 h ( x ) 6 = f ( x ) ] • multi-label case: . ⇥ 1 P k ⇤ R ( h )=E x ⇠ D l =1 1 [ h ( x )] l 6 =[ f ( x )] l k Mehryar Mohri - Foundations of Machine Learning page
Notes In most tasks considered, number of classes k ≤ 100 . For large, problem often not treated as a multi- k class classification problem (ranking or density estimation, e.g., automatic speech recognition). Computational efficiency issues arise for larger s. k In general, classes not balanced. Mehryar Mohri - Foundations of Machine Learning page 4
Multi-Class Classification - Margin Hypothesis set : H • functions . h : X × Y → R • label returned: . x �� argmax h ( x, y ) y ∈ Y Margin: • . y � � = y h ( x, y � ) ρ h ( x, y ) = h ( x, y ) − max • error: . 1 ρ h ( x,y ) ≤ 0 ≤ Φ ρ ( ρ h ( x, y )) • empirical margin loss: m � R ρ ( h ) = 1 � Φ ρ ( ρ h ( x, y )) . m i =1 Mehryar Mohri - Foundations of Machine Learning page 5
Multi-Class Margin Bound (MM et al. 2012; Kuznetsov, MM, and Syed, 2014) Theorem: let with . Fix . H ⊆ R X × Y Y = { 1 , . . ., k } ρ > 0 Then, for any , with probability at least , the δ > 0 1 − δ following multi-class classification bound holds for all : h ∈ H � log 1 R ρ ( h ) + 4 k R ( h ) ≤ � δ ρ R m ( Π 1 ( H )) + 2 m , with Π 1 ( H ) = { x �� h ( x, y ): y � Y, h � H } . Mehryar Mohri - Foundations of Machine Learning page 6
Kernel Based Hypotheses Hypothesis set : H K,p • feature mapping associated to PDS kernel . Φ K • functions , . ( x, y ) �� w y · Φ ( x ) y ∈ { 1 , . . . , k } • label returned: . x �� argmax w y · Φ ( x ) • for any , y ∈ { 1 ,...,k } p ≥ 1 H K,p = { ( x, y ) � X � [1 , k ] �� w y · Φ ( x ): W = ( w 1 , . . . , w k ) � , � W � H ,p � Λ } . Mehryar Mohri - Foundations of Machine Learning page 7
Multi-Class Margin Bound - Kernels (MM et al. 2012) Theorem: let be a PDS kernel and K : X × X → R let be a feature mapping associated to . Φ : X → H K Fix . Then, for any , with probability at δ > 0 ρ > 0 least , the following multiclass bound holds for 1 − δ all : h ∈ H K,p � � log 1 r 2 Λ 2 R ( h ) ≤ � δ R ρ ( h ) + 4 k ρ 2 m + 2 m , where r 2 = sup K ( x, x ) . x ∈ X Mehryar Mohri - Foundations of Machine Learning page 8
Approaches Single classifier: • Multi-class SVMs. • AdaBoost.MH. • Conditional Maxent. • Decision trees. Combination of binary classifiers: • One-vs-all. • One-vs-one. • Error-correcting codes. Mehryar Mohri - Foundations of Machine Learning page
Multi-Class SVMs (Weston and Watkins, 1999; Crammer and Singer, 2001) Optimization problem: k m 1 � w l � 2 + C � � min ξ i 2 w , ξ i =1 l =1 subject to: w y i · x i + δ y i ,l � w l · x i + 1 � ξ i ( i, l ) � [1 , m ] � Y. Decision function: h : x �� argmax ( w l · x ) . l ∈ Y Mehryar Mohri - Foundations of Machine Learning page 10
Notes Directly based on generalization bounds. Comparison with (Weston and Watkins, 1999) : single slack variable per point, maximum of slack variables (penalty for worst class): k k � ξ il → l =1 ξ il . max l =1 PDS kernel instead of inner product Optimization: complex constraints, -size problem. mk • specific solution based on decomposition into m disjoint sets of constraints (Crammer and Singer, 2001) . Mehryar Mohri - Foundations of Machine Learning page 11
Dual Formulation Optimization problem: th row of matrix . α ∈ R m × k α i i m m α i · e y i � 1 � � max ( α i · α j )( x i · x j ) 2 α =[ α ij ] i =1 i =1 subject to: � i � [1 , m ] , (0 � α iy i � C ) � ( � j � = y i , α ij � 0) � ( α i · 1 = 0) . Decision function: � m � k � h ( x ) = argmax α il ( x i · x ) . l =1 i =1 Mehryar Mohri - Foundations of Machine Learning page 12
AdaBoost (Schapire and Singer, 2000) Training data (multi-label case): ( x 1 , y 1 ) , . . . , ( x m , y m ) ∈ X × { − 1 , 1 } k . Reduction to binary classification: • each example leads to binary examples: k ( x i , y i ) → (( x i , 1) , y i [1]) , . . . , (( x i , k ) , y i [ k ]) , i ∈ [1 , m ] . • apply AdaBoost to the resulting problem. • choice of . α t Computational cost: distribution updates at mk each round. Mehryar Mohri - Foundations of Machine Learning page 13
AdaBoost.MH H ⊆ ( { − 1 , +1 } k ) ( X × Y ) . AdaBoost.MH ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 for i � 1 to m do 2 for l � 1 to k do 1 3 D 1 ( i, l ) � mk 4 for t � 1 to T do 5 h t � base classifier in H with small error � t =Pr D t [ h t ( x i , l ) � = y i [ l ]] 6 � t � choose � to minimize Z t 7 Z t � � i,l D t ( i, l ) exp( � � t y i [ l ] h t ( x i , l )) 8 for i � 1 to m do 9 for l � 1 to k do D t +1 ( i, l ) � D t ( i,l ) exp( − α t y i [ l ] h t ( x i ,l )) 10 Z t f T � � T 11 t =1 � t h t 12 return h T = sgn( f T ) Mehryar Mohri - Foundations of Machine Learning page 14
Bound on Empirical Error Theorem: The empirical error of the classifier output by AdaBoost.MH verifies: T � � R ( h ) ≤ Z t . t =1 Proof: similar to the proof for AdaBoost. Choice of : α t • for , as for AdaBoost, H ⊆ ( { − 1 , +1 } k ) X × Y α t = 1 2 log 1 − � t � t . • for , same choice: minimize upper H ⊆ ([ − 1 , 1] k ) X × Y bound. • other cases: numerical/approximation method. Mehryar Mohri - Foundations of Machine Learning page 15
Notes Objective function: m k m k e − y i [ l ] f n ( x i ,l ) = e − y i [ l ] P n � � � � t =1 α t h t ( x i ,l ) . F ( α ) = i =1 i =1 l =1 l =1 All comments and analysis given for AdaBoost apply here. Alternative: Adaboost.MR, which coincides with a special case of RankBoost (ranking lecture). Mehryar Mohri - Foundations of Machine Learning page 16
Decision Trees X 2 X1 < a1 R 2 R 5 a 4 X1 < a2 X2 < a3 R 3 a 3 R 1 X2 < a4 R3 R4 R5 R 4 a 2 a 1 X 1 R1 R2 Mehryar Mohri - Foundations of Machine Learning page
Different Types of Questions Decision trees • : categorical questions. X ∈ { blue , white , red } • : continuous variables. X ≤ a Binary space partition (BSP) trees: • : partitioning with convex � n i =1 α i X i ≤ a polyhedral regions. Sphere trees: • : partitioning with pieces of spheres. || X − a 0 || ≤ a Mehryar Mohri - Foundations of Machine Learning page 18
Hypotheses In each region , R t • classification: majority vote - ties broken arbitrarily, y t = argmax |{ x i ∈ R t : i ∈ [1 , m ] , y i = y }| . � y ∈ Y • regression: average value, � 1 y t = � y i . | S ∩ R t | x i ∈ R t i ∈ [1 ,m ] Form of hypotheses: � h : x �� y t 1 x ∈ R t . � t Mehryar Mohri - Foundations of Machine Learning page 19
Training Problem: general problem of determining partition with minimum empirical error is NP-hard. Heuristics: greedy algorithm. • for all , , j ∈ [1 , N ] θ ∈ R R + ( j, θ )= { x i ∈ R : x i [ j ] ≥ θ , i ∈ [1 , m ] } R − ( j, θ )= { x i ∈ R : x i [ j ] < θ , i ∈ [1 , m ] } . Decision-Trees ( S =(( x 1 , y 1 ) , . . . , ( x m , y m ))) 1 P ← { S } � initial partition 2 for each region R ∈ P such that Pred( R ) do ( j, � ) ← argmin ( j, θ ) error( R − ( j, � )) + error( R + ( j, � )) 3 P ← P − R ∪ { R − ( j, � ) , R + ( j, � ) } 4 5 return P Mehryar Mohri - Foundations of Machine Learning page 20
Splitting/Stopping Criteria Problem: larger trees overfit training sample. Conservative splitting: • split node only if loss reduced by some fixed value . η > 0 • issue: seemingly bad split dominating useful splits. Grow-then-prune technique (CART): • grow very large tree, . Pred( R ): | R | > | n 0 | • prune tree based on: , F ( T )= � Loss( T )+ α | T | α ≥ 0 parameter determined by cross-validation. Mehryar Mohri - Foundations of Machine Learning page 21
Recommend
More recommend