galen reeves
play

Galen Reeves Departments of ECE and Statistical Science Duke - PowerPoint PPT Presentation

Scalable Posterior Approximation Galen Reeves Departments of ECE and Statistical Science Duke University August 2015 Collaborators at Duke David B. Dunson Willem van den Boom 2 variable selection / support recovery identify the locations


  1. Scalable Posterior Approximation Galen Reeves Departments of ECE and Statistical Science Duke University August 2015

  2. Collaborators at Duke David B. Dunson Willem van den Boom 2

  3. variable selection / support recovery • identify the locations / identities of agents which have significant e ff ects on observed behaviors ‣ e.g., gene expression, face recognition, etc. • find relevant features for building a model ‣ machine learning • recover a sparse signal from noisy linear measurements ‣ compressed sensing • determine which entries of a unknown parameter vector are significant ‣ statistics 3

  4. high-dimensional inference p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 4

  5. high-dimensional inference p unknown n observations Types of questions: parameters (the data) • posterior distribution β 1 high-dimensional p ( β | y ) y 1 distribution β 2 • posterior mean and y 2 covariance β 3 p x 1 vector, E [ β | y ] Cov [ β | y ] p x p matrix y 3 • posterior marginal β 4 distribution one-dimensional p ( β 1 | y ) distribution y n β p 5

  6. edges mean dependencies p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 6

  7. inference is easy if graph is sparse… p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 7

  8. … but dense graphs are challenging p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 8

  9. statistical model for parameters entries of β conditionally independent given hyper parameters θ p Y p ( β | θ ) = p ( β j | θ ) j =1 mixed discrete-continuous distribution for marginal prior probability distribution if equal to zero nonzero prior distribution p ( β j | θ ) 9

  10. standard linear model n x p matrix y = X � + ✏ p unknown Gaussian n observations N (0 , σ 2 I ) parameters errors 10

  11. why challenging? • number of feature subsets grows exponentially with p ‣ curse of dimensionality • exact inference requires computing high-dimensional integrals • brute-force integration is computationally infeasible • extensive research focuses on methods for approximate inference 11

  12. tradeo ff s for high-dimensional inference linear methods (least-squares) focus of recent research LASSO AMP MCMC BCR scalability YFA MCMC (unbounded time) brute-force numerical integration accuracy 12

  13. problems with existing methods 13

  14. problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance 13

  15. problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled 13

  16. problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled • variational approximations ‣ di ffi culty with multimodal posteriors, hard to interpret 13

  17. Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ probability mass at zero prior distribution 14

  18. Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ probability probability large mass at zero mass at zero observation posterior distribution prior distribution 14

  19. Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ probability probability large mass at zero mass at zero observation posterior distribution probability prior distribution mass at zero small observation posterior distribution 14

  20. Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ Gaussian approximation large observation posterior distribution Gaussian prior distribution approximation small observation posterior distribution 15

  21. problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled • variational approximations ‣ di ffi culty with multimodal posteriors, hard to interpret 16

  22. problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled • variational approximations ‣ di ffi culty with multimodal posteriors, hard to interpret • loopy belief propagation, approximate message passing (AMP) ‣ lack theoretical guarantees for general matrices 16

  23. high-dimensional variable selection n x p matrix ✏ ∼ N (0 , � 2 I ) y = X � + ✏ Gaussian errors n observations p unknown parameters drawn independently with known distribution (e.g. spike & slab) 17

  24. high-dimensional variable selection n x p matrix ✏ ∼ N (0 , � 2 I ) y = X � + ✏ Gaussian errors n observations p unknown parameters drawn independently with known distribution (e.g. spike & slab) Goal: compute posterior marginal distribution of first entry Z p ( β | y ) d β p p ( β 1 | y ) = 2 17

  25. overview of our approach • rotate the data to isolate the parameter of interest • introduce an auxiliary variable which summarizes the influence of the other parameters • use any means possible to compute / estimate the posterior mean and posterior variance of the auxiliary variable • apply Gaussian approx. to auxiliary variable and solve one- dimensional integration problem to obtain posterior approximation 18

  26. 1: reparameterize • Apply rotation matrix to the data which zeros out all but one entry in the first column of the data 2 · · · 3 ˜ ˜ ˜ x 1 , 1 x 1 , 2 x 1 ,p · · · 0 ˜ ˜ x 2 , 2 x 2 ,p 6 7 y = ˜ 5 � + ˜ ✏ 6 . . . 7 . . . 6 7 . . . 4 · · · 0 x n, 2 ˜ x n,p ˜ • Only first observation depends on first entry. p X x 1 , 1 � 1 + x 1 ,j � j + ˜ ✏ 1 y 1 = ˜ ˜ ˜ j =2 auxiliary variable captures φ ( β p 2 ) influence of other parameters 19

  27. step 1: reparameterize β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 20

  28. step 1: reparameterize y 1 β 2 y 2 β 3 β 1 y 3 β 4 y n β p 21

  29. step 1: reparameterize y 1 β 2 y 2 β 3 y 3 β 1 β 4 y n β p 22

  30. step 1: reparameterize y 1 β 2 y 2 β 3 y 3 β 1 β 4 y n β p 22

  31. step 1: reparameterize ˜ y 1 β 2 ˜ y 2 β 3 ˜ y 3 β 1 β 4 ˜ y n β p 23

  32. step 1: reparameterize φ ( β p 2 ) ˜ y 1 β 2 ˜ y 2 p φ ( β p X 2 ) = ˜ x 1 ,j β j β 3 j =2 ˜ y 3 β 1 auxiliary variable encapsulates influence β 4 of other parameters ˜ y n β p 24

  33. step 1: reparameterize Z y p y 1 | φ ( β p 2 )) p ( φ ( β p y p 2 ) d φ ( β p p ( β 1 , ˜ y 1 | ˜ 2 ) = p ( β 1 , ˜ 2 ) | ˜ 2 ) φ ( β p 2 ) ˜ y 1 β 2 ˜ y 2 p φ ( β p X 2 ) = ˜ x 1 ,j β j β 3 j =2 ˜ y 3 β 1 auxiliary variable encapsulates influence β 4 of other parameters ˜ y n β p 25

  34. 
 step 2: estimate / compute • compute the posterior mean and variance of auxiliary variable 
 E [ φ ( β p y p V ar [ φ ( β p y p 2 ) | ˜ 2 ] 2 ) | ˜ 2 ] • can use a variety of methods ‣ AMP (if iterations converge) ‣ LASSO ‣ Bayesian Compressed Regression (BCR) ‣ [your favorite method] • the quantities are independent of target parameter! 26

  35. step 3: approximate • apply Gaussian approximation to auxiliary variable to compute posterior approximation prior distribution Z y 1 | φ ( β p 2 ) , β 1 ) p ( β 1 ) p ( φ ( β p y p 2 ) d φ ( β p p ( β 1 | y ) ∝ p (˜ 2 ) | ˜ 2 ) replace with Gaussian by Gaussian using mean assumption on noise and variance from previous step • approximation can be accurate even if the prior and posterior are highly non-Gaussian 27

  36. advantages of our framework • does not apply Gaussian approximation directly to posterior • has precise theoretical guarantees under the same assumptions as AMP • can leverage other methods (e.g. LASSO) to produce accurate approximations in settings where AMP fails 28

  37. results: accuracy posterior inclusion probabilities p ( β 1 6 = 0 | y ) for small problem (p = 12) can compute MSE with respect to true posterior inclusion probability approximate message 0.06 MSE passing (AMP) ● ● ● ● ● ● ● Bayesian compressed ● ● ● ● ● regression (BCR) 0.00 ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 correlation between columns of matrix 29

  38. results: accuracy posterior inclusion probabilities p ( β 1 6 = 0 | y ) for large problems, ground true is intractable compare methods using empirical ROC curves 1 1 AMP LASSO True positive rate True positive rate AMP LASSO matrix with ! matrix with incoherent columns iid entries correlated columns 0 0 0 1 0 1 False positive rate False positive rate 30

  39. further directions framework extends to more general models Z p ( β , y ) = p ( β | θ ) p ( y | β , θ ) d θ conditionally conditionally independent Gaussian p Y p ( β | θ ) = p ( β j | θ ) j =1 31

Recommend


More recommend