Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann Institute of Science Joint work with Ofer Dekel, MSR NIPS 2015 Extreme Classification Workshop
Extreme Multiclass Multilabel Problems Label set is a folksonomy (a.k.a. collaborative tagging or social tagging)
Categories 1452 births / 1519 deaths / 15 th century in science / ambassadors of the republic of Florence / Ballistic experts / Fabulists / giftedness / mathematics and culture / Italian inventors / Members of the Guild of Saint Luke / Tuscan painters / people persecuted under anti- homosexuality laws...
Problem Definition β’ Multiclass multilabel classification β’ π training examples, π categories β’ π, π β β together β Possibly even π > π β’ Goal: Categorize unseen instances
Extreme Multiclass β’ Supervised learning starts with binary classification ( π =2) and extends to multiclass learning β Theory: VC dimension β Natarajan dimension β Algorithms: binary β multiclass β’ Usually, assume π = π« 1 β’ Some exceptions β Hierarchy with prior knowledge on relationships β not always available β Additional assumptions (e.g. talk by Marius earlier)
Application β’ Classify the web based on Wikipedia categories β’ Training set: All Wikipedia pages ( π = 4.2 Γ 10 6 ) β’ Labels: All Wikipedia categories ( π = 1.1 Γ 10 6 )
Challenges β’ Statistical problem: Can β t get a large (or even moderate) sample from each class. β’ Computational problem: Many classification algorithms will choke on millions of labels
Propagating Labels on the Click-Graph queries web pages β’ A bipartite graph derived from search engine logs: clicks encoded as weighted edges β’ Wikipedia pages are labeled web pages β’ Labels propagate along edges to other pages
Example β’ http://en.wikipedia.org/wiki/Leonardo da Vinci passes multiple labels to http://www.greatItalians.com β’ Among them β β Renaissance artists β β good β β 1452 births β β bad β’ Observation: β 1452 births β induces many false-positives (FP): best to remove it altogether from classifier output β (FP β TN, TP β FN)
Simple Label Pruning Approach 1. Split dataset to training and validation set 2. Use training set to build an initial classifier β ππ π (e.g. by propagating labels over click-graph) 3. Apply β ππ π to validation set, count FP and TP 4. βπ β 1, β¦ , π , remove label π if πΊπ > 1 β πΏ π ππ πΏ π β’ Defines a new β pruned β classifier β πππ‘π’
Simple Label Pruning Approach Explicitly minimizes empirical risk with respect to the πΏ -weighted loss: β β π , π = π πΏ π β π π = 1, π§ π = 0 + 1 β πΏ π β π π = 0, π§ π = 1 π=1 FP FN (false positive) (false negative)
Main Question Would this actually reduce the risk? π½ π,π β β πππ‘π’ π , π < π½ π,π β β ππ π π , π - positive
Baseline Approach β’ Prove that uniformly for all labels π Pr(label π and not predicted) πΊπ βΆ πΊπ π π ππ ππ π π Pr(label π and predicted) Problem: π, π β β together. Many classes only have a handful of examples
Uniform Convergence Approach β’ Algorithm implicitly chooses a hypothesis from a certain hypothesis class β Pruning rules on top of fixed predictor β ππ π β’ Prove uniform convergence by bounding VC dimension / Rademacher complexity β’ Conclude that if empirical risk decreases, the risk decreases as well
Uniform Convergence Fails β’ Unfortunately, no uniform convergence... β’ ... and even no algorithm/data-dependent convergence! π½ π β πππ‘π’ β π β πππ‘π’ β₯ π ππ π pruned ππ π β πΊπ π π=1 π ππ π > = πΊπ ππ ππ π β πΊπ π π π=1 Weak correlation in π β π regime
A Less Obvious Approach β’ Prove directly that risk decreases β’ Important (but mild) assumption: Each example labeled by β€ π‘ labels β’ Step 1: Risk of β πππ‘π’ is concentrated. For all π , Pr π β πππ‘π’ β π½π β πππ‘π’
A Less Obvious Approach β’ Part 2: Enough to prove π β ππ π β π½π β πππ‘π’ > 0 1 β’ Assuming for πΏ = 2 for simplicity, can be shown that π β ππ π β π½π β πππ‘π’ πΊπ π + ππ π π 1/2 > pos β π« π 2 where π 1/2 = π π₯ π
A Less Obvious Approach β’ For probability vector, β’ Part 2: Enough to prove π β ππ π β π½π β πππ‘π’ > 0 always at most k πΊπ π β ππ π β’ 1 Smaller the more non- β’ Assuming for πΏ = 2 for simplicity, can be shown that π:πΊπ π β₯ππ π uniform is the distribution π β ππ π β π½π β πππ‘π’ πΊπ π + ππ π π 1/2 > pos β π« π 2 where π 1/2 = π π₯ π
Wikipedia Power-Law: π = 1.6
Wikipedia Power-Law: π = 1.6 π 0.4 π β ππ π β π½π β πππ‘π’ > pos β π« π
Experiment Click graph on the entire web (based on search engine logs)
Experiment Categories from Wikipedia pages propagated twice through graph
Experiment Train/test split of Wikipedia pages How good are propagated categories from training set in predicting categories at test set pages?
Experiment
Another less obvious approach π β ππ π β π½π β πππ‘π’ π = ππ π pruned πΊπ π β ππ π π=1 π ππ π > = πΊπ ππ πΊπ π β ππ π π π=1 Weak but positive correlation, even if only few examples per label For large k , sum will tend to be positive
Different Application: Crowdsourcing (Dekel and S., 2009)
Different Application: Crowdsourcing (Dekel and S., 2009)
Different Application: Crowdsourcing (Dekel and S., 2009)
Different Application: Crowdsourcing (Dekel and S., 2009)
Different Application: Crowdsourcing β’ How can we improve crowdsourced data? β’ Standard approach: Repeated labeling, but expensive β’ A bootstrap approach: β Learn predictor from data of all workers β Throw away examples labeled by workers disagreeing a lot with the predictor β Re-train on remaining examples β’ Works! (Under certain assumptions) β’ Challenge: Workers often labels only a handful of examples
Different Application: Crowdsourcing # examples/worker might be small, but many workers...
Different Application: Crowdsourcing # examples/worker might be small, but many workers...
Different Application: Crowdsourcing # examples/worker might be small, but many workers...
Conclusions β’ # classes β β violates assumptions of most multiclass analyses β Often based on generalizations of binary classification β’ Possible approach β Avoid standard analysis β β Extreme X β can be a blessing rather than a curse β’ Other applications? More complex learning algorithms (e.g. substitution)?
Thanks!
Recommend
More recommend