sigml
play

SIGML Department of CSE IIT Kanpur Special Interest Group in - PowerPoint PPT Presentation

Multi-label Learning Trees, Embeddings, and much more! Purushottam Kar SIGML Department of CSE IIT Kanpur Special Interest Group in Machine Learning Classification Paradigms Pick one Pick one Pick all applicable Label 1 Label 1 Label 1


  1. Multi-label Learning Trees, Embeddings, and much more! Purushottam Kar SIGML Department of CSE IIT Kanpur Special Interest Group in Machine Learning

  2. Classification Paradigms Pick one Pick one Pick all applicable Label 1 Label 1 Label 1 Label 2 Label 2 Label 2 Label 3 Label 3 Label 4 Label 4 … … … … Label L Label L Binary Multi-class Multi-label

  3. Classification Paradigms Pick one Pick one Pick all applicable Binary Multi-class Multi-label

  4. Examples

  5. eXtreme Multi-label Classification What all items would this user buy? : Users : Items

  6. eXtreme Multi-label Classification Who all are present in this selfie?

  7. eXtreme Multi-label Classification Dances by name, Indian culture, Performing arts in India, South India, Tamil culture

  8. Challenges and Opportunities in Multi-label learning • Exploit label correlations • Problem not as large as it seems • Missing labels in training and test set • Appropriate training and evaluation? • Novelty and Diversity in predicted set of labels? • Useful in recommendation and tagging tasks

  9. Evaluation Techniques An Invitation to Optimization Connoisseurs

  10. Classification Metrics Truth y Predicted ŷ

  11. Hamming Loss • ( | y |+ | ŷ | -2 | y  ŷ | )/L =( | y ∆ ŷ | /L) = 3/13 = 0.23 • Symmetric difference • What if | y | >> | ŷ | ? Truth y Predicted ŷ

  12. Precision • | y  ŷ | / | ŷ | = 2/3 = 0.66 Truth y Predicted ŷ

  13. Recall • | y  ŷ | / | y | = 2/4 = 0.5 • What if | y | >> | ŷ | ? Truth y Predicted ŷ

  14. F-measure • Harmonic mean of precision and recall • 2| y  ŷ |/ (| y |+| ŷ |) = 0.57 • What if | y | >> | ŷ | ? Truth y Predicted ŷ

  15. Jaccard Distance • | y  ŷ |/| y  ŷ | = 2/5 = 0.4 • What if | y | >> | ŷ | ? Truth y Predicted ŷ

  16. Classification Metrics • Of these, only precision seems to be (mildly) appropriate for cases with • eXtremely large number of labels • Smaller prediction budgets • Missing labels in truth Truth y Predicted ŷ

  17. Ranking Metrics 1 2 2 4 3 5 4 13 5 8 6 6 7 11 8 3 9 10 10 1 11 7 Predicted  12 9 Truth y 13 12

  18. Precision@k • Precision@1 = 100% 1 2 2 4 • Precision@2 = 50% 3 5 • Precision@3 = 66% 4 13 5 8 6 6 • Very appropriate for 7 11 8 3 budget constrained 9 10 prediction settings 10 1 11 7 Predicted  12 9 Truth y 13 12

  19. Mean Average Precision • Precision@1 = 100% 1 2 2 4 • Precision@2 = 50% 3 5 • … 4 13 5 8 • Precision@13 = 13.7% 6 6 7 11 ------------------------------- 8 3 • MAP = 46.56% 9 10 • Usefulness for large L?? 10 1 11 7 Predicted  12 9 Truth y 13 12

  20. Area under the ROC curve • Count mis-orderings 1 2 2 4 • For 2: none 3 5 • For 5: 1 4 13 • For 11: 4 5 8 • For 10: 5 6 6 7 11 • Total violations: 10 8 3 • AUC = 1 – 10/(4*9) 9 10 10 1 = 0.72 11 7 Predicted  12 9 Truth y 13 12

  21. Mean Reciprocal Rank • Penalize rankings that 1 2 2 4 rank “on” labels low 3 5 • Rank of 2 = 1 4 13 5 8 • Rank of 5 = 3 6 6 • Rank of 11 = 7 7 11 8 3 • Rank of 10 = 9 9 10 • MRR = ¼* ( 1/1+1/3+1/7+1/9 ) 10 1 11 7 = 0.39 = 1/(2.52) Predicted  12 9 Truth y 13 12

  22. Solution Strategies a.k.a. how to compress a decade worth of literature into an hour long talk

  23. Notation and Formulation • Abstract problem : We have “documents” that are to be assigned a subset of L labels • Representation • Documents: vectors in D dimensions • Labels: vectors in L dimensions (Boolean hypercube) • Training set • ( x 1 , y 1 ), ( x 2 , y 2 ), ( x 3 , y 3 ), …, ( x n , y n ) • x i  R D , y i  {0,1} L

  24. The Three Pillars of Multi-label Learning • 1 -vs-All or Binary Relevance Methods • Embedding or Dimensionality Reduction Methods • Tree or Ensemble Methods

  25. 1 -vs-All Methods • Predict scores for each label separately • Threshold or rank scores to make predictions Dance Test Dance Sport Test Sport Wiki page Tech Test Tech Math Test Math

  26. 1 -vs-All Methods Questions • Are the L classifiers trained separately/jointly? • If jointly then what “joins” the classifiers? Benefits Considerations • Extremely flexible model • Training time  • In-depth theoretical • Test time  • Model size  analysis possible

  27. 1 -vs-All Methods • Binary Relevance methods • Treat each label as a separate classification problem • Formulation (on board) • Also includes so-called plug-in methods, submodular methods • Margin methods much larger • Ensure scores of “on” labels are larger than those of “off” labels • Formulation (on board) • Structured Loss minimization methods • Formulation (sketch on board)

  28. Embedding Methods • Since L >>> 1 and also has redundancies, reduce L • Dimensionality reduction!! • Nice theory, results, but expensive in prediction, training • Questions • How to embed labels (linear/non-linear) • How to predict in the embedding space • How to “pull back” to the label space • Single/multiple embeddings • CS, BCS, PLST, CPLST, LEML, SLEEC

  29. Embedding Methods • How to embed labels • RP(CS), CCA, PCA, Low local distortion proj., Learnt projections • How to pull back x Test • Sparse recovery, Nearest neighbor, Learnt projections • Considerations z  R l • Training time  • Test time  y  R L • Model size 

  30. Tree Methods All of Wiki Arts Tech Music Dance IT/SW EE/HW

  31. Tree Methods • Partition the space of documents into several bins • To ease life, perform hierarchical partitioning as a tree • At each leaf perform some classification task to predict • To increase efficiency, use several trees (forest) • Questions • Partitioning criterion (clustering, ranking, classification) • Leaf action (constant labeling, use of another multi-labeler) • Ensemble size and aggregation method (single, multiple) • LPSR, MLRF, FAST-XML • Consideration: good accuracy, fast prediction, huge models

  32. The Three Pillars of Multi-label Learning Prediction Model Well Name “Accuracy” Scalability Cost Size Understood? Did I not Now we are talking! Are you 1-vs-All Meh! Yikes! make myself Excellent kidding me! clear? Good/ Good/ Embedding Good Good Good Best Best Good/ Good/ Tree Best Large Meh! Best Best

Recommend


More recommend