Active Learning for Sparse Bayesian Multilabel Classification - PowerPoint PPT Presentation

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond

Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels.

Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space

Multilabel Classification Given a set of datapoints, the goal is to annotate them with a set of labels. x i ∈ R d Feature ¡vector, ¡d: ¡dimension ¡of ¡the ¡feature ¡space Iraq Sea Sky Flowers Human Brick Sun

Training

Training WikiLSHTC has 325k labels. Good luck with that!!

Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost

Training Is Expensive • Training data can also be very expensive, like genomic data, chemical data • Getting each label incurs additional cost Need to reduce the required training data.

Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints N

Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 0 1 3 Datapoints N

Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Which data points should I label? N

Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints For a particular datapoint, which labels should I reveal? N

Active Learning Labels Iraq Flowers Sun Sky 1 2 3 L 1 2 3 Datapoints Can I choose datapoint-label pairs to annotate? N

In this talk • An active learner for Multi-label classification that: • Answers all your questions • Is Computationally Cheap • Is Non myopic and near-optimal • Incorporates label sparsity • Achieves higher accuracy than state-of-the-art

Classification

Classification Model* x i y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

Classification Model* x i z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, NIPS 2012

Classification Model* x i W z k z 1 z 2 Compressed Space i i i Φ y 1 y 2 y 3 y L Labels i i i i *Kapoor et al, α 1 α 2 α 3 α L Sparsity i i i i NIPS 2012

Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

Classification Model: Potentials x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α 1 α 2 α 3 α L i i i i

Classification Model: Priors x i W z k z 1 z 2 i i i Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

Sparsity Priors a 0 = 10 − 6 , b 0 = 10 − 6

Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

Classification Model x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i Problem: Exact inference is intractable. α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L i i i i α 1 α 2 α 3 α L i i i i

Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i

Inference: Variational Bayes x i W z k z 1 z 2 Approximate Gaussian i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L Approximate Gamma i i i i

Inference: Variational Bayes x i W z k z 1 z 2 i i i Φ y 1 y 2 y 3 y L Approximate Gaussian i i i i α 1 α 2 α 3 α L i i i i

Active Learning Criteria • Entropy: Is a measure of uncertainty. For a random variable X, the entropy H is given as: X H ( X ) = − P ( x i ) log( P ( x i )) � i • Picks points far apart from each other • For a Gaussian process, H = 1 2 log( | Σ | ) + const

Active Learning Criteria • Mutual Information: Measures reduction in uncertainty over unlabeled space MI ( A, B ) = H ( A ) − H ( A | B ) � • Used in past work successfully for regression

Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )

Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Variance is not preserved across layers

Idea: Collapsed Variational Bayes

Idea: Collapsed Variational Bayes x i W f x i ( W, z i ) = e − || W T xi − zi || 2 2 σ 2 z k z 1 z 2 i i i g φ ( y i , z i ) = e − || Φ yi − zi || 2 2 χ 2 Φ i ∼ N (0 , 1 y j ) y 1 y 2 y 3 y L α j i i i i i α j i ∼ Γ ( α j α 1 α 2 α 3 α L i ; a 0 , b 0 ) i i i i

Idea: Collapsed Variational Bayes x i f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W 2 σ 2 g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

Idea: Collapsed Variational Bayes Integrate to get a Gaussian x i distribution over Y f x i ( W, z i ) = e − || W T xi − zi || 2 W Use Variational Bayes for 2 σ 2 sparsity g φ ( y i , z i ) = e − || Φ yi − zi || 2 z k z 1 z 2 2 χ 2 i i i Φ i ∼ N (0 , 1 y j ) α j y L y 1 y 2 y 3 i i i i i α j i ∼ Γ ( α j α L α 1 α 2 α 3 i ; a 0 , b 0 ) i i i i

Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A )

Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space Problem: Computing Mutual Information still needs A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) exponential time

Solution: Approximate Mutual Information • Approximate the final distribution over Y by a Gaussian • Use the Gaussian to estimate the mutual information ˆ • Theorem 1: lim MI → MI a 0 → 0 ,b 0 → 0

Active Learning: Mutual Information • We have already modeled the distribution over labels, Y as a Gaussian process • The goal is to select a subset of labels that offers the maximum reduction in entropy over the remaining space A ∗ = arg A ⊆ U max H ( Y U\A ) − H ( Y U\A |A ) Problem: Subset selection problem is NP complete

Solution: Use Submodularity • Under some weak conditions, the objective is sub-modular • Sub-modularity ensures that the greedy solution is a constant times the optimal solution

Algorithm • Input: Feature vectors for a set of unlabeled instance, U and a budget n • Iteratively, add a datapoint x to labeled set A, such that x leads to maximum increase in MI MI ( A ∪ x ) − ˆ ˆ x ← arg max MI ( A ) x ∈ U\ A

Active Learning for Sparse Bayesian Multilabel Classification - PowerPoint PPT Presentation

Active Learning for Sparse Bayesian Multilabel Classification Deepak Vasisht, MIT & IIT Delhi Andreas Domianou, University of Sheffield Manik Varma, MSR, India Ashish Kapoor, MSR, Redmond Multilabel Classification Given a set of

Extreme multilabel learning Charles Elkan Amazon Fellow December 12, 2015 1/32 Massive

An Empirical Study on Lazy Multilabel Classification Algorithms Eleftherios Spyromitros,

Multiclass Multilabel Classification with More Classes than Examples Ohad Shamir Weizmann

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

The Multilabel Naive Credal Classifier Alessandro Antonucci and Giorgio Corani {

Bayesian Batch Active Learning as Sparse Subset Approximation Robert Pinsler Jonathan Gordon

How can we get most useful information at minimum cost? 2 Sponsored search Which ads should be

Hollow electron lenses for HL-LHC Miriam Fitterer (FNAL) US LHC Users Association Meeting, 02

Lecture 11: Energy and security Lecture 11: Energy and security considerations in wireless PHY +

Groundwater, its treatment and protection Harri Mattila, 2019 1 7.10.2019 General l

Large scale graph processing systems: survey and an experimental evaluation Cluster Computing

Characterization of Industrial Smoke Plumes from Remote Sensing Data Michael Mommert , University

10.40 Chemical Engineering Thermodynamics: a Multiscale Approach for the 21st Century Brief

Activities in electrolyte solutions by molecular simulation of the osmotic pressure M. T.