Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond
Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.
Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.
Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space
Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun
Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun
Training
Training
Training WikiLSHTC has 325k labels. Good luck with that!!
Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost
Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost Need to reduce the required training data.
Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints N
Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N
Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N
Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Which data points should I label? N
Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints For a particular datapoint, which labels should I reveal? N
Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Can I choose datapoint-label pairs to annotate? N
In this talk • An active learner for Multi-label classification that: • Answers all your questions • Is Computationally Cheap • Is Non myopic and near-optimal • Incorporates label sparsity • Achieves higher accuracy than state-of-the-art
Classification
Classification Model* x i y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012
Classification Model* x i z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012
Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012
Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, α 1 α 2 α 3 α L Sparsity i i i i NIPS 2012
Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i
Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i
Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α 1 α 2 α 3 α L i i i i
Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i
Sparsity Priors a 0 = 10 − 6 , b 0 = 10 − 6
Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i
Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i Problem: Exact inference is intractable. α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i
Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i
Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i
Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i
Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L Approximate Gamma i i i i
Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i
Active Learning Criteria • Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as: X H ( X ) = − P ( x i ) log( P ( x i )) � i • Picks points far apart from each other • For a Gaussian process, H = 1 2 log( | Σ | ) + const
Active Learning Criteria • Mutual Information: Measures reduction in uncertainty over unlabeled space MI ( A, B ) = H ( A ) − H ( A | B ) � • Used in past work successfully for regression
Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )
Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Variance is not preserved across layers
Idea: Collapsed Variational Bayes
Idea: Collapsed Variational Bayes x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i
Idea: Collapsed Variational Bayes x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i
Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i
Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W Use Variational Bayes for 2 σ 2 sparsity g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i
Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )
Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) exponential time
Solution: Approximate Mutual Information • Approximate the final distribution over Y by a Gaussian • Use the Gaussian to estimate the mutual information ˆ • Theorem 1: lim MI → MI a 0 → 0 ,b 0 → 0
Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Subset selection problem is NP complete
Solution: Use Submodularity • Under some weak conditions, the objective is sub-modular • Sub-modularity ensures that the greedy solution is a constant times the optimal solution
Algorithm • Input: Feature vectors for a set of unlabeled instance, U and a budget n • Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI MI ( A ∪ x ) − ˆ ˆ x ← arg max MI ( A ) x ∈ U\ A
Recommend
More recommend