active learning for sparse bayesian multilabel
play

Active Learning for Sparse Bayesian Multilabel Classification - PowerPoint PPT Presentation

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond Multilabel Classification Given a set of


  1. Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond

  2. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

  3. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

  4. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space

  5. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun

  6. Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun

  7. Training

  8. Training

  9. Training WikiLSHTC has 325k labels. Good luck with that!!

  10. Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost

  11. Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost Need to reduce the required training data.

  12. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints N

  13. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N

  14. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N

  15. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Which data points should I label? N

  16. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints For a particular datapoint, which labels should I reveal? N

  17. Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Can I choose datapoint-label pairs to annotate? N

  18. In this talk • An active learner for Multi-label classification that: • Answers all your questions • Is Computationally Cheap • Is Non myopic and near-optimal • Incorporates label sparsity • Achieves higher accuracy than state-of-the-art

  19. Classification

  20. Classification Model* x i y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

  21. Classification Model* x i z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

  22. Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

  23. Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, α 1 α 2 α 3 α L Sparsity i i i i NIPS 2012

  24. Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  25. Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  26. Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α 1 α 2 α 3 α L i i i i

  27. Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  28. Sparsity Priors a 0 = 10 − 6 , b 0 = 10 − 6

  29. Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  30. Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i Problem: Exact inference is intractable. α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  31. Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  32. Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

  33. Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i

  34. Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L Approximate Gamma i i i i

  35. Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i

  36. Active Learning Criteria • Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as: X H ( X ) = − P ( x i ) log( P ( x i )) � i • Picks points far apart from each other • For a Gaussian process, H = 1 2 log( | Σ | ) + const

  37. Active Learning Criteria • Mutual Information: Measures reduction in uncertainty over unlabeled space MI ( A, B ) = H ( A ) − H ( A | B ) � • Used in past work successfully for regression

  38. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )

  39. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Variance is not preserved across layers

  40. Idea: Collapsed Variational Bayes

  41. Idea: Collapsed Variational Bayes x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

  42. Idea: Collapsed Variational Bayes x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

  43. Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

  44. Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W Use Variational Bayes for 2 σ 2 sparsity g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

  45. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )

  46. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) exponential time

  47. Solution: Approximate Mutual Information • Approximate the final distribution over Y by a Gaussian • Use the Gaussian to estimate the mutual information ˆ • Theorem 1: lim MI → MI a 0 → 0 ,b 0 → 0

  48. Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Subset selection problem is NP complete

  49. Solution: Use Submodularity • Under some weak conditions, the objective is sub-modular • Sub-modularity ensures that the greedy solution is a constant times the optimal solution

  50. Algorithm • Input: Feature vectors for a set of unlabeled instance, U and a budget n • Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI MI ( A ∪ x ) − ˆ ˆ x ← arg max MI ( A ) x ∈ U\ A

Recommend


More recommend