bandit multiclass linear classification efficient
play

Bandit Multiclass Linear Classification: Efficient Algorithms for - PowerPoint PPT Presentation

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina Beygelzimer David Pal Balazs Szorenyi (Yahoo!) (Yahoo!) (Yahoo!) Devanathan Chen-Yu Wei Chicheng Zhang Thiruvenkatachari (USC) (Microsoft) (NYU)


  1. Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina Beygelzimer David Pal Balazs Szorenyi (Yahoo!) (Yahoo!) (Yahoo!) Devanathan Chen-Yu Wei Chicheng Zhang Thiruvenkatachari (USC) (Microsoft) (NYU)

  2. Bandit multiclass classification For t = 1 , 2 , . . . , T :

  3. Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where

  4. Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner),

  5. Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), y t ∈ [ K ] is the label (hidden).

  6. Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), ← − − − − − y t ∈ [ K ] is the label (hidden). 2. Predict class label � y t ∈ [ K ].

  7. Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), ← − − − − − y t ∈ [ K ] is the label (hidden). − − − − − → 2. Predict class label � y t ∈ [ K ]. 3. Observe feedback z t = 1 [ � y t � = y t ] ∈ { 0 , 1 } .

  8. Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), ← − − − − − y t ∈ [ K ] is the label (hidden). − − − − − → 2. Predict class label � y t ∈ [ K ]. 3. Observe feedback z t = 1 [ � y t � = y t ] ∈ { 0 , 1 } . Goal: minimize the total number of mistakes � T t =1 z t .

  9. Challenge: efficient algorithms in the separable setting Definition A dataset is called γ -linearly separable if there exists w 1 , . . . , w K such that � � ∀ y ′ � = y , � w y , x � ≥ w y ′ , x + γ, for all ( x , y ) in the dataset. (with the constraint � K i =1 � w i � 2 ≤ 1)

  10. Challenge: efficient algorithms in the separable setting Definition A dataset is called γ -linearly separable if there exists w 1 , . . . , w K such that � � ∀ y ′ � = y , � w y , x � ≥ w y ′ , x + γ, for all ( x , y ) in the dataset. (with the constraint � K i =1 � w i � 2 ≤ 1) � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0

  11. Related work Algorithm Mistake Bound Efficient? 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees

  12. Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees

  13. Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No � Banditron [KSST08] 1 TK /γ 2 ) O ( Yes 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees

  14. Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No � Banditron [KSST08] 1 TK /γ 2 ) O ( Yes O (min( K log 2 (1 /γ ) , √ 2 � 1 /γ log K )) This work Yes 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees

  15. Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No � Banditron [KSST08] 1 TK /γ 2 ) O ( Yes O (min( K log 2 (1 /γ ) , √ 2 � 1 /γ log K )) This work Yes √ Contribution : first efficient algorithm that breaks the T barrier 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees

  16. Algorithm (One-versus-rest approach)

  17. Algorithm (One-versus-rest approach)

  18. Algorithm (One-versus-rest approach) If ≥ 1 of them respond YES: y t ← any one of those YES labels � If all of them respond NO: y t ← uniform from { 1 , . . . , K } �

  19. Algorithm (One-versus-rest approach) If ≥ 1 of them respond YES: y t ← any one of those YES labels � If all of them respond NO: y t ← uniform from { 1 , . . . , K } � E [#mistakes(alg)] ≤ K � i #mistakes( i )

  20. Algorithm ◮ Each non-linear binary classifier learns the support of class i , which lies in an intersection of K − 1 halfspaces with a margin [KS04]. � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0

  21. Algorithm ◮ Each non-linear binary classifier learns the support of class i , which lies in an intersection of K − 1 halfspaces with a margin [KS04]. � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0 ◮ Choice: kernel Perceptron with rational kernel [SSSS11]: 1 K ( x , x ′ ) = 2 � x , x ′ � . 1 − 1

  22. Algorithm ◮ Each non-linear binary classifier learns the support of class i , which lies in an intersection of K − 1 halfspaces with a margin [KS04]. � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0 ◮ Choice: kernel Perceptron with rational kernel [SSSS11]: 1 K ( x , x ′ ) = 2 � x , x ′ � . 1 − 1 Thu. Poster#158 ◮

Recommend


More recommend