Bayesian Methods for Variable Selection with Applications to - - PowerPoint PPT Presentation

bayesian methods for variable selection with applications
SMART_READER_LITE
LIVE PREVIEW

Bayesian Methods for Variable Selection with Applications to - - PowerPoint PPT Presentation

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data Part 3: Variable Selection for Mixture Models Marina Vannucci Rice University, USA ABS13-Italy 06/17-21/2013 Marina Vannucci (Rice University, USA) Bayesian


slide-1
SLIDE 1

Bayesian Methods for Variable Selection with Applications to High-Dimensional Data

Part 3: Variable Selection for Mixture Models Marina Vannucci

Rice University, USA

ABS13-Italy 06/17-21/2013

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 1 / 36

slide-2
SLIDE 2

Part 3: Variable Selection for Mixture Models

Finite mixture models for sample clustering. Variable selection Simulated data Supervised case (discriminant analysis). Applications to genomic data. Case study in imaging genetics.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 2 / 36

slide-3
SLIDE 3

So far we have focussed our attention on linear settings. Now, mixture models, characterizing behavior of data that arise from a mixture of subpopulations. Mixture models widely used in classification, clustering, density estimation. Simple example: xi ∼ w1N(µ1, Σ1) + w2N(µ2, Σ2) then any sample can come from two distributions: xi ∼ N(µ1, Σ1) with probability w1 xi ∼ N(µ2, Σ2) with probability w2 We address the case of many variables.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 3 / 36

slide-4
SLIDE 4

Objective

Simultaneous variable selection and sample clustering Cluster structure of samples confined to a small subset of variables. Noisy variables mask the recovery of the clusters. Proposed methodology:

  • Use multivariate normal mixture model with an unknown number of

components to determine cluster structure of the samples.

  • Use stochastic search techniques to examine the space of variable

subsets and identify most probable models.

  • Also, infinite mixture models via Dirichlet process priors.

Genomic data: Identify disease subtypes and select the discriminating genes.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 4 / 36

slide-5
SLIDE 5

Finite Mixture Models

In the case of G components xi|w, θ iid

G

  • k=1

wkf(xi|θk) w ∼ Dir(α, G) θk ∼ πk(θk) G ∼ π(G)? where the mixture weights follow a Dirichlet distribution with parameter α. We will consider f(xi|θk) multivariate normal with θk = (µk, Σk).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 5 / 36

slide-6
SLIDE 6

An alternative specification (from a missing data perspective) uses latent variables y = (y1, . . . , yn)′, where yi = k if the ith observation comes from cluster k (xi|yi = k, w, θ) ∼ f(xi|θk) p(yi = k) = wk w ∼ Dir(α, G) θk ∼ πk(θk) G ∼ π(G)? This facilitates inference via Gibbs sampler (McLachlan and Basford (1988)).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 6 / 36

slide-7
SLIDE 7

Posterior inference for fixed G

Gibbs sampling proceeds at each iteration, given G, with posterior conditionals: P(y(t)

i

= k|·) ≈ w(t−1)

k

f(xi|θ(t−1)

k

) P(w(t)

k |·) = Dir(α1 + n1, ..., αG + nG)

p(θ(t)

k |·) = wk(θ(t) k ) i fk(xi|θ(t) k )y(t)

ik Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 7 / 36

slide-8
SLIDE 8

Reversible Jump MCMC

What if G is unknown? Treat G as unknown parameter, model dimension changes. Use RJMCMC by Green (1995, Biometrika) Allows moves between parameter spaces with different dimensions. Additional random variables are introduced to ensure dimension matching.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 8 / 36

slide-9
SLIDE 9

The general idea for reversible jump, given unknown G and θk, is: From a starting state (G, θG) propose a new model with probability JG,G∗ and generate an augmenting random u from a proposal J(u|G, G∗, θG). Determine the proposed model parameters as θG∗ = gG,G∗(θG, u) where g is a deterministic function that relates the parameters of model G to those of G∗. Accept the new model with probability min(r, 1) where r = p(x|θG∗)π(θG∗)πG∗JG,G∗J(u|G∗, G, θG∗) p(x|θG)π(θG)πGJG∗,GJ(u|G, G∗, θG) × |Jacobian| with Jacobian= ∇gG,G∗(θG,u)

∇(θG,u)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 9 / 36

slide-10
SLIDE 10

Reversible jump can be thought of as a generalization of MH sampler MH sampler: r = {likelihood x prior x proposal ratios} RJ sampler: r = {likelihood x prior x proposal ratios x Jacobian} Usually implementation has three kind of moves BIRTH: Move to dimension k + 1 DEATH: Move to dimension k − 1 MOVE: Move within dimension k Not necessarily nested models

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 10 / 36

slide-11
SLIDE 11

Variable Selection

Discriminating variables define a mixture of G distributions. Introduce latent p-vector γ with binary entries γj = 1 if variable j defines a mixture distribution γj = 0

  • therwise.

The likelihood function is given by L(G, γ, w, µ, Σ, η, Ω|X, y) =

G

  • k=1

(2π)

−pnk 2 |Σk| −nk 2 wnk

k

× exp   −1 2

  • xi∈Ck

(x(γ)i − µ(γ)k)TΣ−1

(γ)k(x(γ)i − µ(γ)k)

   ×φ(X(γc)|η(γc), Ω(γc)), where Ck = {xi|yi = k} with cardinality nk, φ(.) is multivariate normal density.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 11 / 36

slide-12
SLIDE 12

Prior Model

Assume γj’s are independent Bernoulli variables Number of components, G, can be assumed to follow a truncated Poisson

  • r a discrete Uniform on [2, . . . , Gmax].

w|G ∼ Dirichlet(α, . . . , α). µk(γ)|Σk(γ), G ∼ N(µ0(γ), hΣk(γ)) Σk(γ)|G ∼ IW(δ; Qγ) , where (γ) indicates the covariates with γj = 1. Conjugate priors on parameters for case γj = 0. We work with a marginalized likelihood.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 12 / 36

slide-13
SLIDE 13

Model Fitting

(1) Update γ by Metropolis algorithm (add/delete and swap moves). (2) Update w from its full conditional (Dirichlet draw). (3) Update y from its full conditional (multinomial draw). (4) Split one cluster into two, or merge two into one. (5) Birth or death of an empty component. Steps (4) and (5) via reversible jump MCMC extended to multivariate setting.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 13 / 36

slide-14
SLIDE 14

Updating γ

Metropolis move to update γold to γnew: (a) Add/delete: randomly choose a γj and change its value. (b) Swap: randomly choose a 0 and a 1 in γold and switch values. New candidate γnew accepted with probability min

  • 1, f(γnew|X, G, w, y)

f(γold|X, G, w, y)

  • .

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 14 / 36

slide-15
SLIDE 15

Updating w and y

w|G, γ, y, X ∼ Dirichlet(α + n1, . . . , α + nG). y updated one element at a time from f(yi = k|X, y(−i), γ, w, G).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 15 / 36

slide-16
SLIDE 16

Split/merge and birth/death moves

The proposal for these moves in the multivariate setting is intricate. It is necessary to integrate out µ and Σ. Deriving f(yi|X, G, w, γ) is computationally prohibitive. Defining adjacency in the multivariate setting is not straightforward.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 16 / 36

slide-17
SLIDE 17

Posterior Inference for y

Number of clusters, G, estimated by value most frequently visited by MCMC sampler. Estimate marginal posterior probabilities p(yi = k|X, G). Posterior allocation of sample i estimated as

  • yi = max

1≤k≤G {p(yi = k|X, G)} .

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 17 / 36

slide-18
SLIDE 18

Posterior Inference for γ

Select variables with largest marginal posterior probability p(γj = 1|X, G) Select variables that are in the “best” models

  • γ∗ = argmax

1≤t≤M

  • p(γ(t)|X, G,

w, y)

  • ,

with y the estimated sample allocations and w = 1

M

M

t=1 w(t).

Tadesse, Sha and Vannucci (JASA, 2005)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 18 / 36

slide-19
SLIDE 19

Infinite Mixture Models via Dirichlet Process Priors

Integrating over w and taking G → ∞ we get p(yi = k and yl = k for some l = i|y−i) = n−i,k n − 1 + α p(yi = yl for all l = i|y−i) = α n − 1 + α. (1) MCMC updates γ via Metropolis and yi from full conditionals p(yi = k and yl = k for some l = i|y−i, X, γ) p(yi = yl for all l = i|y−i, X, γ). (2) Inference on y by MAP or by estimating p(yi = yj|X). Same as before for γ Natural approach to clustering (samples from a DP can have a number of ties).

Kim, Tadesse and Vannucci (Biometrika, 2006)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 19 / 36

slide-20
SLIDE 20

Application to Simulated Data

15 samples, 4 multivariate normal densities, 20 variables xij ∼ I{1≤i≤4}N(µ1, σ2

1) + I{5≤i≤7}N(µ2, σ2 2)+

I{8≤i≤13}N(µ3, σ2

3) + I{14≤i≤15}N(µ4, σ2 4),

i = 1, . . . , 15, j = 1, . . . , 20, µk ∈ [−5, 5], σ2

k ∈ [.1, 2]

Cluster sizes: 4-3-6-2 Additional set of 980 noisy variables drawn from a standard normal density

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 20 / 36

slide-21
SLIDE 21

Weakly informative priors for model parameters. (δ = 3, α = 1, h = 100, Q = kI) Truncated Poisson prior for G with Gmax = 10. MCMC with 100,000 iterations - starting model with 1 randomly selected γj set to 1.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 21 / 36

slide-22
SLIDE 22

Trace plot of number of clusters, G

2 4 6 8 10 x 10

4

5 10 15 iteration G Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 22 / 36

slide-23
SLIDE 23

Trace plot for number of included variables, pγ

2 4 6 8 10 x 10

4

5 10 15 20 25 30 35 iteration number of included variables, pγ Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 23 / 36

slide-24
SLIDE 24

Marginal posterior probabilities, p(γj = 1|X, G = 4)

100 200 300 400 500 600 700 800 900 1000 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 variable index p(γj=1|X,G) Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 24 / 36

slide-25
SLIDE 25

Marginal posterior probabilities of sample allocations, p(yi = k|X, G = 4), i = 1, . . . , 15, k = 1, . . . , 4

1 5 10 15 0.2 0.4 0.6 0.8 1 samples p(γj=1|X,G=4) 1 5 10 15 0.2 0.4 0.6 0.8 1 samples p(γj=2|X,G=4) 1 5 10 15 0.2 0.4 0.6 0.8 1 samples p(γj=3|X,G=4) 1 5 10 15 0.2 0.4 0.6 0.8 1 samples p(yi=4|X,G=4)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 25 / 36

slide-26
SLIDE 26

Results

G = 4 had stronger support All sample allocations corresponded to the true cluster structure There were 16 variables with marginal probability > .7 (15 were correct) Very little sensitivity to model parameters, with the exception of the covariance hyperparameters

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 26 / 36

slide-27
SLIDE 27

Application to microarray data

Endometrial cancer: Most common gynecologic malignancy in the US. 10 tumor and 4 normal tissues collected from hysterectomy specimens, examined with Affymetrix Hu6800 arrays. Probe sets with unreliable readings (< 20 and > 16, 000) removed ⇒ p = 762. Gene expressions were log-transformed and scaled by their range. Specified weakly informative priors for model parameters. Used truncated Poisson prior for G with Gmax = n. p(γj) ∼ Bernoulli(ϕ = 10/p). Ran four MCMC chains with widely different starting points: (a) 1; (b) 10; (c) 25; (d) 50 randomly selected γj’s set to 1.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 27 / 36

slide-28
SLIDE 28

Posterior distribution of G Union of 4 chains – p(γj = 1|X, G = 3)

1 2 3 4 5 6 7 8 0.05 0.1 0.15 0.2 0.25 0.3 0.35 number of clusters p(G=k|X) 100 200 300 400 500 600 762 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 gene p(γj=1|X,G=2)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 28 / 36

slide-29
SLIDE 29

We have identified 3 classes and a set of 31 genes that can distinguish subtypes of the disease.

2 4 6 8 10 12 14 5 10 15 20 25 30 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 1 5 10 14 0.4 0.6 0.8 1 samples p(yi=1|X,G=3) 1 5 10 14 0.2 0.4 0.6 0.8 1 samples p(yi=2|X,G=3) 1 5 10 14 0.2 0.4 0.6 0.8 1 samples p(yi=3|X,G=3)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 29 / 36

slide-30
SLIDE 30

Supervised case (discriminant analysis)

Model-based approach to classification. Objective: Assign objects to (known) groups based on a set of measurements on a training set. Data from group k modeled as Xk − 1nkµ

k ∼ N(I, Σk),

with k = 1, . . . , G and where the vector µk and the matrix Σk are the mean and the covariance matrix of the k-th group, respectively. Group assignments: y = (y1, . . . , yn)′, where yi = k if the ith observation comes from group k. Unsupervised setting (clustering): G, w, y unknown. Supervised setting (discriminant analysis): G, y known (ˆ wk = nk/n). Aim is to classify new samples via a “classifier” (predictive distribution).

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 30 / 36

slide-31
SLIDE 31

Conjugate priors µk ∼ N(mk, hkΣk) Σk ∼ IW(δk, Ωk), where Ωk is a scale matrix and δk a shape parameter. The predictive distribution (t-student) is used to classify new samples as πk(yf |X) = pk(xf )ˆ wk G

l=1 pl(xf )ˆ

wi , where pk(xf ) indicates the predictive distribution. A new observations is then assigned to the group with the highest posterior probability.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 31 / 36

slide-32
SLIDE 32

Variable selection

Want to select discriminating variables. Introduce latent p-vector γ with binary entries γj = 1 if variable j defines a mixture distribution γj = 0

  • therwise.

The likelihood function is given by L(X, y; ·) =

G

  • k=1

nk

  • i=1

wnk

k pk(Xi(γ)) n

  • i=1

p(Xi(γc)|Xi(γ)), Conjugate priors for selected and non-selected variables. Bernoulli priors or Markov random field on γ.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 32 / 36

slide-33
SLIDE 33

Model Fitting - MCMC

Can marginalize over the model parameters Update γ by Metropolis algorithm. Classify new samples based on selected variables

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 33 / 36

slide-34
SLIDE 34

A Benchmark Example - Leukemia data, Golub (1999, Science)

Supervised setting. Likelihood function defined as in Raftery & Dean (2006) 38+34 patients; 3,571 genes, Kegg-graph for prior on γ 29 genes selected; 33/34 correctly classified samples

500 1000 1500 2000 2500 3000 3500 0.0 0.2 0.4 0.6 0.8 1.0 Variable Index

  • Post. Prob.

5 10 15 20 25 30 35 0.0 0.2 0.4 0.6 0.8 1.0 Units

  • Post. Prob.

Group ALL

5 10 15 20 25 30 35 0.0 0.2 0.4 0.6 0.8 1.0 Units

  • Post. Prob.

Group AML

Stingo and Vannucci (Bioinformatics, 2011)

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 34 / 36

slide-35
SLIDE 35

Main References

Tadesse, M.G., Sha, N. and Vannucci, M. (2005) Bayesian variable selection in clustering high-dimensional data, Journal of the American Statistical Association, 100, 602-617. Kim, S., Tadesse, M.G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika, 93(4), 877-893. Stingo, F.C. and Vannucci, M. (2011). Variable Selection for Discriminant Analysis with Markov Random Field Priors for the Analysis of Microarray Data. Bioinformatics, 27(4), 495-501.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 35 / 36

slide-36
SLIDE 36

Case Study in Imaging Genetics

Will have slides from STINGO, F.C., GUINDANI, M., VANNUCCI, M. and CALHOUN, V. (2013). An Integrative Bayesian Modeling Approach to Imaging Genetics. Journal of the American Statistical Association, accepted.

Marina Vannucci (Rice University, USA) Bayesian Variable Selection (Part 3) ABS13-Italy 06/17-21/2013 36 / 36