on the benefits of output sparsity for multi label
play

On the benefits of output sparsity for multi-label classification - PowerPoint PPT Presentation

On the benefits of output sparsity for multi-label classification Evgenii Chzhen http://echzhen.com Universit Paris-Est, Tlcom Paristech Joint work with: Christoph Denis, Mohamed Hebiri, Joseph Salmon 1 / 13 Outline Introduction


  1. On the benefits of output sparsity for multi-label classification Evgenii Chzhen http://echzhen.com Université Paris-Est, Télécom Paristech Joint work with: Christoph Denis, Mohamed Hebiri, Joseph Salmon 1 / 13

  2. Outline Introduction Framework and notation Motivation Our approach Add weights Numerical results Conclusion 2 / 13

  3. Outline Introduction Framework and notation Motivation Our approach Add weights Numerical results Conclusion 3 / 13

  4. Framework and notation We have N observations and each observation belongs to a set of labels. § Observations: X i P R D , i q J P t 0 , 1 u L , § Label vectors = binary vectors: Y i “ p Y 1 i , . . . , Y L § N, L, D - huge and probably N � L , § Y i consists of at most K ones (active labels) and K ! L . 4 / 13

  5. Outline Introduction Framework and notation Motivation Our approach Add weights Numerical results Conclusion 5 / 13

  6. Motivation 0-type error vs 1-type error Y l “ 1 when Y l “ 0 Y l “ 0 when Y l “ 1 ˆ ˆ 6 / 13

  7. Motivation 0-type error vs 1-type error Y l “ 1 when Y l “ 0 Y l “ 0 when Y l “ 1 ˆ ˆ Example q J , Y “ p 1 , . . . , 1 , 0 , . . . , 0 loomoon loomoon 10 90 q J , ˆ Y 0 “ p 1 , . . . , 1 , 1 , . . . , 1 , 0 , . . . , 0 loomoon loomoon loomoon 10 5 85 q J . ˆ Y 1 “ p 1 , . . . , 1 , 0 , . . . , 0 , 0 , . . . , 0 loomoon loomoon loomoon 5 5 90 § Same amount of mistakes but of different type § Which one is better for a user? 6 / 13

  8. Motivation 0-type error vs 1-type error Y l “ 1 when Y l “ 0 Y l “ 0 when Y l “ 1 ˆ ˆ Hamming loss L ÿ ÿ ÿ L H p Y, ˆ Y q “ Y l u “ Y l “ 1 u ` 1 t Y l ‰ ˆ 1 t ˆ 1 t ˆ Y l “ 0 u l “ 1 Y l “ 0 Y l “ 1 § For Hamming loss ˆ Y 0 and ˆ Y 1 are the same, § Hamming loss does not know anything about sparsity K , § But Hamming is separable, hence easy to optimize. 6 / 13

  9. Outline Introduction Framework and notation Motivation Our approach Add weights Numerical results Conclusion 7 / 13

  10. Our approach: add weights Weighted Hamming loss L p Y, ˆ ÿ ÿ Y q “ p 0 Y l “ 1 u ` p 1 1 t ˆ 1 t ˆ Y l “ 0 u , Y l “ 0 Y l “ 1 such that p 0 ` p 1 “ 1 . 8 / 13

  11. Our approach: add weights Weighted Hamming loss L p Y, ˆ ÿ ÿ Y q “ p 0 Y l “ 1 u ` p 1 1 t ˆ 1 t ˆ Y l “ 0 u , Y l “ 0 Y l “ 1 such that p 0 ` p 1 “ 1 . Examples § Hamming loss: p 0 “ p 1 “ 0 . 5 § [Jain et al., 2016] : p 0 “ 0 and p 1 “ 1 § Our choice: p 0 “ 2 K L and p 1 “ 1 ´ p 0 8 / 13

  12. Why our choice of weights? Consider the following situation q J § Y “ p 1 , . . . , 1 , 0 , . . . , 0 loomoon loomoon K L ´ K § ˆ Y 0 “ p 0 , . . . , 0 q J : predicts all labels inactive, § ˆ Y 1 “ p 1 , . . . , 1 q J : predicts all labels active, § ˆ Y 2 K “ p 1 , . . . , 1 , 0 , . . . , 0 q : makes K mistakes of 0-type loomoon loomoon 2 K L ´ 2 K § Do not forget that K ! L 9 / 13

  13. Why our choice of weights? Consider the following situation q J § Y “ p 1 , . . . , 1 , 0 , . . . , 0 loomoon loomoon K L ´ K § ˆ Y 0 “ p 0 , . . . , 0 q J : predicts all labels inactive, § ˆ Y 1 “ p 1 , . . . , 1 q J : predicts all labels active, § ˆ Y 2 K “ p 1 , . . . , 1 , 0 , . . . , 0 q : makes K mistakes of 0-type loomoon loomoon 2 K L ´ 2 K § Do not forget that K ! L Classical Hamming loss § ˆ Y 1 is almost the worst § ˆ Y 0 is the same as ˆ Y 2 K 9 / 13

  14. Why our choice of weights? Consider the following situation q J § Y “ p 1 , . . . , 1 , 0 , . . . , 0 loomoon loomoon K L ´ K § ˆ Y 0 “ p 0 , . . . , 0 q J : predicts all labels inactive, § ˆ Y 1 “ p 1 , . . . , 1 q J : predicts all labels active, § ˆ Y 2 K “ p 1 , . . . , 1 , 0 , . . . , 0 q : makes K mistakes of 0-type loomoon loomoon 2 K L ´ 2 K § Do not forget that K ! L [Jain et al., 2016] § ˆ Y 0 is the worst § ˆ Y 1 is the same as ˆ Y 2 K 9 / 13

  15. Why our choice of weights? Consider the following situation q J § Y “ p 1 , . . . , 1 , 0 , . . . , 0 loomoon loomoon K L ´ K § ˆ Y 0 “ p 0 , . . . , 0 q J : predicts all labels inactive, § ˆ Y 1 “ p 1 , . . . , 1 q J : predicts all labels active, § ˆ Y 2 K “ p 1 , . . . , 1 , 0 , . . . , 0 q : makes K mistakes of 0-type loomoon loomoon 2 K L ´ 2 K § Do not forget that K ! L Our choice § ˆ Y 0 , ˆ Y 1 are almost the worst § ˆ Y 2 K is almost the best 9 / 13

  16. Outline Introduction Framework and notation Motivation Our approach Add weights Numerical results Conclusion 10 / 13

  17. Numerical results Synthetic dataset with controlled sparsity: N “ 2 D “ 2 L “ 200 Median output sparsity Recall (micro) Precision (micro) Settings Our Std Our Std Our Std K “ 2 2.47 0.04 1.0 0.02 0.80 1.0 K “ 6 6.83 0.43 1.0 0.07 0.88 1.0 K “ 10 9.85 1.81 0.90 0.18 0.91 1.0 K “ 14 10.90 4.11 0.72 0.29 0.93 0.99 K “ 18 10.98 6.61 0.58 0.36 0.95 0.99 § When K ! L we output MORE active labels, § Hence, better Recall and worse Precision, § When K ą 10 our setting are violated. 11 / 13

  18. Conclusion § For sparse datasets: errors of 0/1-type are not the same for a user; § Use our framework if you agree with the previous idea; § We do not introduce a new algorithm per se, but we construct a new loss; § We provide a theoretical justification to our framework (generalization bounds and analysis of convex surrogates). 12 / 13

  19. Thank you for your attention! 13 / 13

Recommend


More recommend