dual decomposed learning with factorwise oracles for
play

Dual-Decomposed Learning with Factorwise Oracles for Structured - PowerPoint PPT Presentation

Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain Xiangru Huang Joint work 1 with Ian E.H. Yen , Kai Zhong , Ruohan Zhang , Chia Dai , Pradeep Ravikumar and Inderjit Dhillon


  1. Dual-Decomposed Learning with Factorwise Oracles for Structured Prediction of Large Output Domain Xiangru Huang ∗ Joint work 1 with Ian E.H. Yen † , Kai Zhong ∗ , Ruohan Zhang ∗ , Chia Dai † , Pradeep Ravikumar † and Inderjit Dhillon ∗ . ∗ University of Texas at Austin † Carnegie Mellon University 1 [1] Dual Decomposed Learning with Factorwise Oracle for Structural SVM of Large Output Domain. NIPS 2016.

  2. Outline Motivations Key Idea Methodology Sketch Experimental Results

  3. Problem Setting ◮ Classification: learn function g : X → Y

  4. Problem Setting ◮ Classification: learn function g : X → Y ◮ Structural: Assuming structured dependencies on output g : X → Y 1 × Y 2 × · · · × Y m

  5. Example: Sequence Labeling ◮ Unigram Factor: θ u : Y t × X t → R ◮ Bigram Factor: Y b = Y t − 1 × Y t θ b : Y b → R Figure: Sequence Labeling

  6. Example: Multi-Label Classification with Pairwise Interaction ◮ Unigram Factor : θ u : Y k × X → R ◮ Bigram Factor : Y b = Y k × Y k ′ θ b : Y b → R Figure: Multi-Label with Pairwise Interaction

  7. Motivations ◮ g : X → Y 1 × Y 2 × · · · × Y m

  8. Motivations ◮ g : X → Y 1 × Y 2 × · · · × Y m ◮ Learning requires inference per iteration. ◮ Exact inference is slow: each iteration takes O( |Y i | n ) for n-gram factor, where |Y i | ≥ 3000.

  9. Motivations ◮ g : X → Y 1 × Y 2 × · · · × Y m ◮ Learning requires inference per iteration. ◮ Exact inference is slow: each iteration takes O( |Y i | n ) for n-gram factor, where |Y i | ≥ 3000. ◮ Approximation downgrades performance.

  10. Key Idea: Dual Decomposed Learning ◮ Structural Oracle (joint inference) is too expensive.

  11. Key Idea: Dual Decomposed Learning ◮ Structural Oracle (joint inference) is too expensive. ◮ Reduce Structural SVM to Multiclass SVMs via soft enforcement of consistency between factors.

  12. Key Idea: Dual Decomposed Learning ◮ Structural Oracle (joint inference) is too expensive. ◮ Reduce Structural SVM to Multiclass SVMs via soft enforcement of consistency between factors. ◮ (Cheap) Active Sets + Factorwise Oracles + Message Passing (between factors).

  13. Key Idea: Factorwise Oracles ◮ Inner-Product (unigram) Factor : θ w ( x , y ) = � w y , x � . ◮ Reduces to a primal and dual sparse Extreme Multiclass SVM . ◮ Reduce O ( ·|A i | ) (details see [2]) 2 . D ·|Y i | ) to O ( |F u | ���� ���� feat. dim. #uni. fac. ◮ Indicator (bigram) Factor : θ ( y 1 , y 2 ) = v y 1 , y 2 . ◮ Maintain Priority Queue on v y 1 , y 2 . ◮ Reduce O ( |Y 1 ||Y 2 | ) to O ( |A 1 ||A 2 | ). � �� � active set sizes 2 [2] PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. ICML 2016.

  14. Methodology Sketch ◮ Original problem: n 1 � 2 � w � 2 + C min L ( w ; x i , y i ) w i =1 � �� � struct hinge loss 3 Simon Julien et al. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. ICML 2013.

  15. Methodology Sketch ◮ Original problem: n 1 � 2 � w � 2 + C min L ( w ; x i , y i ) w i =1 � �� � struct hinge loss ◮ Dual-Decomposed into independent problems: α f ∈ ∆ |Y f | G ( α ) := 1 � � � φ ( x f , y f ) T α f � 2 − δ T min � j α j 2 F f ∈ F j ∈V � �� � Independent Multiclass SVMs with consistency constraints M if α f = α i , ∀ ( i , f ) ∈ E . ◮ Standard approach 3 finds feasible descent direction, which however needs joint inference. 3 Simon Julien et al. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. ICML 2013.

  16. Methodology Sketch ◮ Dual-Decomposed into independent problems: α f ∈ ∆ |Y f | G ( α ) := 1 � � � φ ( x f , y f ) T α f � 2 − δ T min � j α j 2 F f ∈ F j ∈V with consistency constraints M jf α f = α j , ∀ ( j , f ) ∈ E ◮ Augmented Lagrangian Method: + ρ � � � M jf α f − α j + λ t jf � 2 L ( α, λ ) := G F ( α F ) 2 F ( j , f ) ∈E � �� � � �� � indep. multiclass SVMs messages between factors (sparse) with incremental updated multipliers λ t +1 = λ t jf + η ( M jf α t +1 − α t +1 ) j jf f

  17. Methodology Sketch ◮ Augmented Lagrangian Method: + ρ � � � M jf α f − α j + λ t jf � 2 L ( α, λ ) := G F ( α F ) 2 F ( j , f ) ∈E � �� � � �� � indep. multiclass SVMs messages between factors (sparse) with incremental updated multipliers λ t +1 = λ t jf + η ( M jf α t +1 − α t +1 ) jf f j ◮ Update α and λ alternatively.

  18. Experiments: Sequence Labeling (on ChineseOCR) ◮ Chinese OCR: N = 12 , 064, T = 14 . 4, D = 400 , K = 3 , 039. ◮ |Y b | = 3 , 039 2 = 9 , 235 , 521 (bigram language model). ◮ Decoding: Viterbi Algorithm. ChineseOCR × 10 5 ChineseOCR 3 0.95 BCFW GDMM-subFMO 0.9 SSG Soft-BCFW- ρ =1 0.85 Soft-BCFW- ρ =10 BCFW 2.5 GDMM-subFMO 0.8 SSG 0.75 Soft-BCFW- ρ =1 test error Objective Soft-BCFW- ρ =10 0.7 2 0.65 0.6 0.55 0.5 1.5 0.45 10 3 10 4 10 3 10 4 time time Figure: Test Error Figure: Objective

  19. Experiments: Multi-Label Classification (on RCV1) ◮ RCV-1: N = 23 , 149, D = 47 , 236 , K = 228. ◮ |F b | = 228 2 = 51 , 984 (pairwise interaction). ◮ Decoding: Linear Program RCV1-regions RCV1-regions BCFW 10 9 BCFW GDMM-subFMO GDMM-subFMO SSG SSG Soft-BCFW- ρ =1 Soft-BCFW- ρ =1 Soft-BCFW- ρ =10 Soft-BCFW- ρ =10 10 8 10 7 test error Objective 10 -2 10 6 10 5 10 4 10 2 10 3 10 4 10 5 10 2 10 3 10 4 10 5 time time Figure: Objective Figure: Test Error

Recommend


More recommend