Scalable Posterior Approximation Galen Reeves Departments of ECE and Statistical Science Duke University August 2015
Collaborators at Duke David B. Dunson Willem van den Boom 2
variable selection / support recovery • identify the locations / identities of agents which have significant e ff ects on observed behaviors ‣ e.g., gene expression, face recognition, etc. • find relevant features for building a model ‣ machine learning • recover a sparse signal from noisy linear measurements ‣ compressed sensing • determine which entries of a unknown parameter vector are significant ‣ statistics 3
high-dimensional inference p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 4
high-dimensional inference p unknown n observations Types of questions: parameters (the data) • posterior distribution β 1 high-dimensional p ( β | y ) y 1 distribution β 2 • posterior mean and y 2 covariance β 3 p x 1 vector, E [ β | y ] Cov [ β | y ] p x p matrix y 3 • posterior marginal β 4 distribution one-dimensional p ( β 1 | y ) distribution y n β p 5
edges mean dependencies p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 6
inference is easy if graph is sparse… p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 7
… but dense graphs are challenging p unknown n observations parameters (the data) β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 8
statistical model for parameters entries of β conditionally independent given hyper parameters θ p Y p ( β | θ ) = p ( β j | θ ) j =1 mixed discrete-continuous distribution for marginal prior probability distribution if equal to zero nonzero prior distribution p ( β j | θ ) 9
standard linear model n x p matrix y = X � + ✏ p unknown Gaussian n observations N (0 , σ 2 I ) parameters errors 10
why challenging? • number of feature subsets grows exponentially with p ‣ curse of dimensionality • exact inference requires computing high-dimensional integrals • brute-force integration is computationally infeasible • extensive research focuses on methods for approximate inference 11
tradeo ff s for high-dimensional inference linear methods (least-squares) focus of recent research LASSO AMP MCMC BCR scalability YFA MCMC (unbounded time) brute-force numerical integration accuracy 12
problems with existing methods 13
problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance 13
problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled 13
problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled • variational approximations ‣ di ffi culty with multimodal posteriors, hard to interpret 13
Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ probability mass at zero prior distribution 14
Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ probability probability large mass at zero mass at zero observation posterior distribution prior distribution 14
Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ probability probability large mass at zero mass at zero observation posterior distribution probability prior distribution mass at zero small observation posterior distribution 14
Example: one-dimensional problem with spike & slab (Bernoulli-Gaussian) prior y = � + ✏ Gaussian approximation large observation posterior distribution Gaussian prior distribution approximation small observation posterior distribution 15
problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled • variational approximations ‣ di ffi culty with multimodal posteriors, hard to interpret 16
problems with existing methods • regularized least-squares (e.g. LASSO) ‣ lack measures of statistical significance • sampling methods like MCMC ‣ not clear when su ffi ciently converged / sampled • variational approximations ‣ di ffi culty with multimodal posteriors, hard to interpret • loopy belief propagation, approximate message passing (AMP) ‣ lack theoretical guarantees for general matrices 16
high-dimensional variable selection n x p matrix ✏ ∼ N (0 , � 2 I ) y = X � + ✏ Gaussian errors n observations p unknown parameters drawn independently with known distribution (e.g. spike & slab) 17
high-dimensional variable selection n x p matrix ✏ ∼ N (0 , � 2 I ) y = X � + ✏ Gaussian errors n observations p unknown parameters drawn independently with known distribution (e.g. spike & slab) Goal: compute posterior marginal distribution of first entry Z p ( β | y ) d β p p ( β 1 | y ) = 2 17
overview of our approach • rotate the data to isolate the parameter of interest • introduce an auxiliary variable which summarizes the influence of the other parameters • use any means possible to compute / estimate the posterior mean and posterior variance of the auxiliary variable • apply Gaussian approx. to auxiliary variable and solve one- dimensional integration problem to obtain posterior approximation 18
1: reparameterize • Apply rotation matrix to the data which zeros out all but one entry in the first column of the data 2 · · · 3 ˜ ˜ ˜ x 1 , 1 x 1 , 2 x 1 ,p · · · 0 ˜ ˜ x 2 , 2 x 2 ,p 6 7 y = ˜ 5 � + ˜ ✏ 6 . . . 7 . . . 6 7 . . . 4 · · · 0 x n, 2 ˜ x n,p ˜ • Only first observation depends on first entry. p X x 1 , 1 � 1 + x 1 ,j � j + ˜ ✏ 1 y 1 = ˜ ˜ ˜ j =2 auxiliary variable captures φ ( β p 2 ) influence of other parameters 19
step 1: reparameterize β 1 y 1 β 2 y 2 β 3 y 3 β 4 y n β p 20
step 1: reparameterize y 1 β 2 y 2 β 3 β 1 y 3 β 4 y n β p 21
step 1: reparameterize y 1 β 2 y 2 β 3 y 3 β 1 β 4 y n β p 22
step 1: reparameterize y 1 β 2 y 2 β 3 y 3 β 1 β 4 y n β p 22
step 1: reparameterize ˜ y 1 β 2 ˜ y 2 β 3 ˜ y 3 β 1 β 4 ˜ y n β p 23
step 1: reparameterize φ ( β p 2 ) ˜ y 1 β 2 ˜ y 2 p φ ( β p X 2 ) = ˜ x 1 ,j β j β 3 j =2 ˜ y 3 β 1 auxiliary variable encapsulates influence β 4 of other parameters ˜ y n β p 24
step 1: reparameterize Z y p y 1 | φ ( β p 2 )) p ( φ ( β p y p 2 ) d φ ( β p p ( β 1 , ˜ y 1 | ˜ 2 ) = p ( β 1 , ˜ 2 ) | ˜ 2 ) φ ( β p 2 ) ˜ y 1 β 2 ˜ y 2 p φ ( β p X 2 ) = ˜ x 1 ,j β j β 3 j =2 ˜ y 3 β 1 auxiliary variable encapsulates influence β 4 of other parameters ˜ y n β p 25
step 2: estimate / compute • compute the posterior mean and variance of auxiliary variable E [ φ ( β p y p V ar [ φ ( β p y p 2 ) | ˜ 2 ] 2 ) | ˜ 2 ] • can use a variety of methods ‣ AMP (if iterations converge) ‣ LASSO ‣ Bayesian Compressed Regression (BCR) ‣ [your favorite method] • the quantities are independent of target parameter! 26
step 3: approximate • apply Gaussian approximation to auxiliary variable to compute posterior approximation prior distribution Z y 1 | φ ( β p 2 ) , β 1 ) p ( β 1 ) p ( φ ( β p y p 2 ) d φ ( β p p ( β 1 | y ) ∝ p (˜ 2 ) | ˜ 2 ) replace with Gaussian by Gaussian using mean assumption on noise and variance from previous step • approximation can be accurate even if the prior and posterior are highly non-Gaussian 27
advantages of our framework • does not apply Gaussian approximation directly to posterior • has precise theoretical guarantees under the same assumptions as AMP • can leverage other methods (e.g. LASSO) to produce accurate approximations in settings where AMP fails 28
results: accuracy posterior inclusion probabilities p ( β 1 6 = 0 | y ) for small problem (p = 12) can compute MSE with respect to true posterior inclusion probability approximate message 0.06 MSE passing (AMP) ● ● ● ● ● ● ● Bayesian compressed ● ● ● ● ● regression (BCR) 0.00 ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 correlation between columns of matrix 29
results: accuracy posterior inclusion probabilities p ( β 1 6 = 0 | y ) for large problems, ground true is intractable compare methods using empirical ROC curves 1 1 AMP LASSO True positive rate True positive rate AMP LASSO matrix with ! matrix with incoherent columns iid entries correlated columns 0 0 0 1 0 1 False positive rate False positive rate 30
further directions framework extends to more general models Z p ( β , y ) = p ( β | θ ) p ( y | β , θ ) d θ conditionally conditionally independent Gaussian p Y p ( β | θ ) = p ( β j | θ ) j =1 31
Recommend
More recommend