bayesian batch active learning as sparse subset
play

Bayesian Batch Active Learning as Sparse Subset Approximation - PowerPoint PPT Presentation

Bayesian Batch Active Learning as Sparse Subset Approximation Robert Pinsler Jonathan Gordon Eric Nalisnick Jos Miguel Hernndez-Lobato October 2019 Research Talk Introduction Acquiring labels for supervised learning can be costly


  1. Bayesian Batch Active Learning as Sparse Subset Approximation Robert Pinsler Jonathan Gordon Eric Nalisnick José Miguel Hernández-Lobato October 2019 Research Talk

  2. Introduction • Acquiring labels for supervised learning can be costly and time-consuming • In such settings, active learning (AL) enables data-efficient model training by intelligently selecting points for which labels should be requested

  3. Introduction Model Train model Pool-based Labeled Unlabeled active learning training set pool set (AL) Select queries Oracle

  4. Introduction Sequential AL loop Train Query single Update Select data model point model data point

  5. Introduction Sequential AL loop Batch AL approaches: - scale to large datasets and models - enable parallel data acquisition - (ideally) trade off diversity and representativeness Train Query single Update Select data model point model data point How to construct such a batch?

  6. Bayesian Batch Active Learning Bayesian approach: Choose set of points that maximally reduces uncertainty over parameter posterior • NP-hard, but greedy approximations exist: MaxEnt , BALD • Naïve batch strategy: Select b best points according to acquisition function

  7. Bayesian Batch Active Learning Bayesian approach: Choose set of points that maximally reduces uncertainty over parameter posterior • NP-hard, but greedy approximations exist: MaxEnt , BALD • Naïve batch strategy: Select b best points according to acquisition function Budget is wasted on selecting nearby points MaxEnt BALD

  8. Bayesian Batch Active Learning Bayesian approach: Choose set of points that maximally reduces uncertainty over parameter posterior • NP-hard, but greedy approximations exist: MaxEnt , BALD • Naïve batch strategy: Select b best points according to acquisition function Budget is wasted on selecting nearby points MaxEnt BALD Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior

  9. Bayesian Batch Active Learning Bayesian approach: Choose set of points that maximally reduces uncertainty over parameter posterior • NP-hard, but greedy approximations exist: MaxEnt , BALD • Naïve batch strategy: Select b best points according to acquisition function MaxEnt BALD Ours Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior

  10. Related Work Bayesian coresets Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior We take inspiration from Bayesian coresets • Coreset: Summarize data by sparse, weighted subset • Bayesian coreset: Approximate posterior by sparse, weighted subset

  11. Related Work Bayesian coresets Idea: Re-cast batch construction as optimizing a sparse subset approximation to complete data posterior We take inspiration from Bayesian coresets • Coreset: Summarize data by sparse, weighted subset • Bayesian coreset: Approximate posterior by sparse, weighted subset • Batch AL with Bayesian coresets: Batch = Bayesian coreset

  12. Batch Construction as Sparse Subset Approximation Choose batch such that best approximates

  13. Batch Construction as Sparse Subset Approximation Choose batch such that best approximates We don't know the labels of the points in the pool set before querying them

  14. Batch Construction as Sparse Subset Approximation Choose batch such that best approximates We don't know the labels of the points in the pool set before querying them T ake expectation w.r.t. current predictive posterior distribution:

  15. Batch Construction as Sparse Subset Approximation Choose batch such that best approximates We don't know the labels of the points in the pool set before querying them T ake expectation w.r.t. current predictive posterior distribution:

  16. Batch Construction as Sparse Subset Approximation Hilbert coresets

  17. Batch Construction as Sparse Subset Approximation Hilbert coresets

  18. Batch Construction as Sparse Subset Approximation Hilbert coresets • Considers directionality of residual error → adaptively construct batch while accounting for similarity between data points (induced by norm) • Still intractable!

  19. Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization

  20. Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization 1 1. Relax constraints

  21. Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization 1 1. Relax constraints 2 2. Apply Frank-Wolfe algorithm • Geometrically motivated convex optimization algorithm • Iteratively selects vector most aligned with residual error • Corresponds to adding at most one data point to batch in every iteration

  22. Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization 1 1. Relax constraints 2 2. Apply Frank-Wolfe algorithm • Geometrically motivated convex optimization algorithm • Iteratively selects vector most aligned with residual error • Corresponds to adding at most one data point to batch in every iteration 3 3. Project continuous weights back to feasible space (i.e. binarize them)

  23. Batch Construction as Sparse Subset Approximation Frank-Wolfe optimization 1 1. Relax constraints 2 2. Apply Frank-Wolfe algorithm • Geometrically motivated convex optimization algorithm • Iteratively selects vector most aligned with residual error • Corresponds to adding at most one data point to batch in every iteration 3 3. Project continuous weights back to feasible space (i.e. binarize them) Which norm is appropriate?

  24. Batch Construction as Sparse Subset Approximation Choice of Inner Products Norm is induced by inner product, e.g. 1. Weighted Fisher inner product + Leads to simple, interpretable expressions for linear models -- Requires taking gradients w.r.t. parameters -- Scales quadratically with pool set size

  25. Batch Construction as Sparse Subset Approximation Choice of Inner Products Norm is induced by inner product, e.g. 1. Weighted Fisher inner product + Leads to simple, interpretable expressions for linear models -- Requires taking gradients w.r.t. parameters -- Scales quadratically with pool set size Example: Linear regression

  26. Batch Construction as Sparse Subset Approximation Choice of Inner Products Norm is induced by inner product, e.g. 1. Weighted Fisher inner product + Leads to simple, interpretable expressions for linear models -- Requires taking gradients w.r.t. parameters -- Scales quadratically with pool set size Example: Linear regression • Connections to BALD , leverage scores and influence functions • Probit regression also yields interpretable closed-form solution

  27. Batch Construction as Sparse Subset Approximation Choice of Inner Products Norm is induced by inner product, e.g. 1. Weighted Fisher inner product + Leads to simple, interpretable expressions for linear models -- Requires taking gradients w.r.t. parameters -- Scales quadratically with pool set size 2. Weighted Euclidean inner product + Only requires tractable likelihood computations + Scalable to large pool set sizes (linearly) and complex, non-linear models through random projections -- No gradient information utilized

  28. Batch Construction as Sparse Subset Approximation Choice of Inner Products Norm is induced by inner product, e.g. 1. Weighted Fisher inner product + Leads to simple, interpretable expressions for linear models -- Requires taking gradients w.r.t. parameters -- Scales quadratically with pool set size 2. Weighted Euclidean inner product + Only requires tractable likelihood computations + Scalable to large pool set sizes (linearly) and complex, non-linear models through random projections -- No gradient information utilized J-dimensional random projection in Euclidean space

  29. Experimental Setup Experiments (i) Does our approach avoid correlated queries? closed form (ii) Is our method competitive in the small-data regime? closed form projections (iii) Does our method scale to large datasets and models?

  30. Experimental Setup Experiments (i) Does our approach avoid correlated queries? closed form (ii) Is our method competitive in the small-data regime? closed form projections (iii) Does our method scale to large datasets and models? Model: Neural Linear Deterministic ? Stochastic fully feature extractor connected layer (e.g. ConvNet) Exact inference (regression) Mean-field VI (classification)

  31. Experiments: Probit Regression Does our approach avoid correlated queries? BALD ACS-FW

  32. Experiments: Probit Regression Does our approach avoid correlated queries? BALD No change ACS-FW

  33. Experiments: Probit Regression Does our approach avoid correlated queries? BALD No change ACS-FW Rotates in data space

  34. Experiments: Probit Regression Does our approach avoid correlated queries? BALD ACS-FW And again...

  35. Experiments: Probit Regression Does our approach avoid correlated queries? BALD ACS-FW ACS-FW queries diverse batch of points

  36. Experiments: Regression Is our method competitive in the small-data regime?

  37. Experiments: Regression Is our method competitive in the small-data regime? Competitive on small data, even more beneficial for larger N

  38. Experiments: Regression Does our method scale to large datasets and models?

Recommend


More recommend