multiclass multilabel classification
play

Multiclass Multilabel Classification with More Classes than Examples - PowerPoint PPT Presentation

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop Extreme Multiclass Multilabel Problems Label set is a


  1. Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop

  2. Extreme Multiclass Multilabel Problems Label set is a folksonomy (a.k.a. collaborative tagging or social tagging)

  3. Categories 1452 births / 1519 deaths / 15 th century in science / ambassadors of the republic of Florence / Ballistic experts / Fabulists / giftedness / mathematics and culture / Italian inventors / Members of the Guild of Saint Luke / Tuscan painters / people persecuted under anti- homosexuality laws...

  4. Problem Definition β€’ Multiclass multilabel classification β€’ 𝑛 training examples, 𝑙 categories β€’ 𝑛, 𝑙 β†’ ∞ together – Possibly even 𝑙 > 𝑛 β€’ Goal: Categorize unseen instances

  5. Extreme Multiclass β€’ Supervised learning starts with binary classification ( 𝑙 =2) and extends to multiclass learning – Theory: VC dimension β†’ Natarajan dimension – Algorithms: binary β†’ multiclass β€’ Usually, assume 𝑙 = 𝒫 1 β€’ Some exceptions – Hierarchy with prior knowledge on relationships – not always available – Additional assumptions (e.g. talk by Marius earlier)

  6. Application β€’ Classify the web based on Wikipedia categories β€’ Training set: All Wikipedia pages ( 𝑛 = 4.2 Γ— 10 6 ) β€’ Labels: All Wikipedia categories ( 𝑙 = 1.1 Γ— 10 6 )

  7. Challenges β€’ Statistical problem: Can ’ t get a large (or even moderate) sample from each class. β€’ Computational problem: Many classification algorithms will choke on millions of labels

  8. Propagating Labels on the Click-Graph queries web pages β€’ A bipartite graph derived from search engine logs: clicks encoded as weighted edges β€’ Wikipedia pages are labeled web pages β€’ Labels propagate along edges to other pages

  9. Example β€’ http://en.wikipedia.org/wiki/Leonardo da Vinci passes multiple labels to http://www.greatItalians.com β€’ Among them – β€œ Renaissance artists ” – good – β€œ 1452 births ” – bad β€’ Observation: β€œ 1452 births ” induces many false-positives (FP): best to remove it altogether from classifier output – (FP β‡’ TN, TP β‡’ FN)

  10. Simple Label Pruning Approach 1. Split dataset to training and validation set 2. Use training set to build an initial classifier β„Ž π‘žπ‘ π‘“ (e.g. by propagating labels over click-graph) 3. Apply β„Ž π‘žπ‘ π‘“ to validation set, count FP and TP 4. βˆ€π‘˜ ∈ 1, … , 𝑙 , remove label π‘˜ if 𝐺𝑄 > 1 βˆ’ 𝛿 π‘˜ π‘ˆπ‘„ 𝛿 π‘˜ β€’ Defines a new β€œ pruned ” classifier β„Ž π‘žπ‘π‘‘π‘’

  11. Simple Label Pruning Approach Explicitly minimizes empirical risk with respect to the 𝛿 -weighted loss: β„“ β„Ž π’š , 𝒛 = 𝑙 𝛿 𝕁 β„Ž π‘˜ π’š = 1, 𝑧 π‘˜ = 0 + 1 βˆ’ 𝛿 𝕁 β„Ž π‘˜ π’š = 0, 𝑧 π‘˜ = 1 π‘˜=1 FP FN (false positive) (false negative)

  12. Main Question Would this actually reduce the risk? 𝔽 π’š,𝒛 β„“ β„Ž π‘žπ‘π‘‘π‘’ π’š , 𝒛 < 𝔽 π’š,𝒛 β„“ β„Ž π‘žπ‘ π‘“ π’š , 𝒛 - positive

  13. Baseline Approach β€’ Prove that uniformly for all labels π‘˜ Pr(label π‘˜ and not predicted) 𝐺𝑄 ⟢ 𝐺𝑄 π‘˜ π‘˜ π‘ˆπ‘„ π‘ˆπ‘„ π‘˜ π‘˜ Pr(label π‘˜ and predicted) Problem: 𝑛, 𝑙 β†’ ∞ together. Many classes only have a handful of examples

  14. Uniform Convergence Approach β€’ Algorithm implicitly chooses a hypothesis from a certain hypothesis class – Pruning rules on top of fixed predictor β„Ž π‘žπ‘ π‘“ β€’ Prove uniform convergence by bounding VC dimension / Rademacher complexity β€’ Conclude that if empirical risk decreases, the risk decreases as well

  15. Uniform Convergence Fails β€’ Unfortunately, no uniform convergence... β€’ ... and even no algorithm/data-dependent convergence! 𝔽 𝑆 β„Ž π‘žπ‘π‘‘π‘’ βˆ’ 𝑆 β„Ž π‘žπ‘π‘‘π‘’ β‰₯ 𝑙 𝑄𝑠 π‘˜ pruned π‘ˆπ‘„ π‘˜ βˆ’ 𝐺𝑄 π‘˜ π‘˜=1 𝑙 𝑄𝑠 π‘˜ > = 𝐺𝑄 π‘ˆπ‘„ π‘ˆπ‘„ π‘˜ βˆ’ 𝐺𝑄 π‘˜ π‘˜ π‘˜=1 Weak correlation in 𝑛 β‰ˆ 𝑙 regime

  16. A Less Obvious Approach β€’ Prove directly that risk decreases β€’ Important (but mild) assumption: Each example labeled by ≀ 𝑑 labels β€’ Step 1: Risk of β„Ž π‘žπ‘π‘‘π‘’ is concentrated. For all πœ— , Pr 𝑆 β„Ž π‘žπ‘π‘‘π‘’ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’

  17. A Less Obvious Approach β€’ Part 2: Enough to prove 𝑆 β„Ž π‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’ > 0 1 β€’ Assuming for 𝛿 = 2 for simplicity, can be shown that 𝑆 β„Ž π‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’ 𝐺𝑄 π‘˜ + π‘ˆπ‘„ π‘˜ π‘˜ 1/2 > pos βˆ’ 𝒫 𝑛 2 where 𝒙 1/2 = π‘˜ π‘₯ π‘˜

  18. A Less Obvious Approach β€’ For probability vector, β€’ Part 2: Enough to prove 𝑆 β„Ž π‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’ > 0 always at most k 𝐺𝑄 π‘˜ βˆ’ π‘ˆπ‘„ π‘˜ β€’ 1 Smaller the more non- β€’ Assuming for 𝛿 = 2 for simplicity, can be shown that π‘˜:𝐺𝑄 π‘˜ β‰₯π‘ˆπ‘„ π‘˜ uniform is the distribution 𝑆 β„Ž π‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’ 𝐺𝑄 π‘˜ + π‘ˆπ‘„ π‘˜ π‘˜ 1/2 > pos βˆ’ 𝒫 𝑛 2 where 𝒙 1/2 = π‘˜ π‘₯ π‘˜

  19. Wikipedia Power-Law: 𝑠 = 1.6

  20. Wikipedia Power-Law: 𝑠 = 1.6 𝑙 0.4 𝑆 β„Ž π‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’ > pos βˆ’ 𝒫 𝑛

  21. Experiment Click graph on the entire web (based on search engine logs)

  22. Experiment Categories from Wikipedia pages propagated twice through graph

  23. Experiment Train/test split of Wikipedia pages How good are propagated categories from training set in predicting categories at test set pages?

  24. Experiment

  25. Another less obvious approach 𝑆 β„Ž π‘žπ‘ π‘“ βˆ’ 𝔽𝑆 β„Ž π‘žπ‘π‘‘π‘’ 𝑙 = 𝑄𝑠 π‘˜ pruned 𝐺𝑄 π‘˜ βˆ’ π‘ˆπ‘„ π‘˜ π‘˜=1 𝑙 𝑄𝑠 π‘˜ > = 𝐺𝑄 π‘ˆπ‘„ 𝐺𝑄 π‘˜ βˆ’ π‘ˆπ‘„ π‘˜ π‘˜ π‘˜=1 Weak but positive correlation, even if only few examples per label For large k , sum will tend to be positive

  26. Different Application: Crowdsourcing (Dekel and S., 2009)

  27. Different Application: Crowdsourcing (Dekel and S., 2009)

  28. Different Application: Crowdsourcing (Dekel and S., 2009)

  29. Different Application: Crowdsourcing (Dekel and S., 2009)

  30. Different Application: Crowdsourcing β€’ How can we improve crowdsourced data? β€’ Standard approach: Repeated labeling, but expensive β€’ A bootstrap approach: – Learn predictor from data of all workers – Throw away examples labeled by workers disagreeing a lot with the predictor – Re-train on remaining examples β€’ Works! (Under certain assumptions) β€’ Challenge: Workers often labels only a handful of examples

  31. Different Application: Crowdsourcing # examples/worker might be small, but many workers...

  32. Different Application: Crowdsourcing # examples/worker might be small, but many workers...

  33. Different Application: Crowdsourcing # examples/worker might be small, but many workers...

  34. Conclusions β€’ # classes β†’ ∞ violates assumptions of most multiclass analyses – Often based on generalizations of binary classification β€’ Possible approach – Avoid standard analysis – β€œ Extreme X ” can be a blessing rather than a curse β€’ Other applications? More complex learning algorithms (e.g. substitution)?

  35. Thanks!

Recommend


More recommend