Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 1 / 34
Bayesian nonparametrics Bayesian nonparametrics There are two main approaches to Bayesian nonparametrics for density estimation Dirichlet process and Polya trees See M¨ uller and Mitra (2013) for a general overview of all Bayesian nonparametric problems, e.g. density estimation, clustering, regression, random effects distributions, etc. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 2 / 34
Bayesian nonparametrics Motivation 0.20 0.15 density 0.10 0.05 0.00 0 4 8 phi Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 3 / 34
Bayesian nonparametrics Goal Let Y i come from an unknown probability measure G , i.e. Y i ∼ G . As a Bayesian, the natural approach is to put a prior on G . That is, we want to make statements like P ( Y i ∈ A ) = G ( A ) for any set A . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 4 / 34
Bayesian nonparametrics Dirichlet process One approach is to use a Dirichlet process (Ferguson 1973). We write G ∼ DP ( aG 0 ) where a > 0 is concentration (or total mass) parameter and G 0 is the base measure, i.e. a probability distribution defined on the support of G . For any partition A 1 , . . . , A K of the sample space S , the probability vector [ G ( A 1 ) , . . . , G ( A K )] follows a Dirichlet distribution, i.e. [ G ( A 1 ) , . . . , G ( A K )] ∼ Dir ([ aG 0 ( A 1 ) , . . . , aG 0 ( a K )]) . Thus E [ G ( A 1 )] = G 0 ( A 1 ) and V ar [ G ( A 1 )] = G 0 ( A 1 )[1 − G 0 ( A 1 )] . 1+ a Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 5 / 34
Bayesian nonparametrics Conjugacy of the Dirichlet process Assume ind Y i ∼ G and G ∼ DP ( aG 0 ) then for any partition { A 1 , . . . , A K } , we have [ G ( A 1 ) , . . . , G ( A K )] | y ∼ Dir ([ aG 0 ( A 1 ) + � n i =1 I( y i ∈ A 1 ) , . . . , aG 0 ( A K ) + � n i =1 I( y i ∈ A K )]) and thus � n � � G| y ∼ DP aG 0 + δ y i i =1 which has n � a � � n � 1 � E [ G ( A ) | y ] = G 0 ( A ) + n I( y i ∈ A ) a + n a + n i =1 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 6 / 34
Bayesian nonparametrics Stick-breaking representation A constructive representation of the Dirichlet process is the stick-breaking representation. Assume G ∼ DP ( aG 0 ) , then ∞ � G ( · ) = π h δ θ h ( · ) h =1 ind where π ∼ stick ( a ) and θ h ∼ G 0 . The stick distribution is the following: π h = ν h � ℓ<h (1 − ν ℓ ) and ind ν h ∼ Be (1 , a ) . π 1 π 2 π 3 π 4 π 5 … 0 1 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 7 / 34
Bayesian nonparametrics Realizations from a DP Base measure is a standard normal. Realizations are across the columns and values for a are down the rows. 1 2 3 4 10.0 7.5 0.01 5.0 2.5 0.0 10.0 7.5 0.1 5.0 2.5 0.0 10.0 density 7.5 5.0 1 2.5 0.0 2.0 1.5 1.0 10 0.5 0.0 0.8 0.6 100 0.4 0.2 0.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 theta Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 8 / 34
Bayesian nonparametrics DP mixture If we have an absolutely continuous distribution we are trying to approximate, then a DP is not reasonable. Thus, we may want to use a DP mixture, i.e. ind ind Y i ∼ p ( ·| θ i ) , θ i ∼ G , G ∼ DP ( aG 0 ) for some parametric model p ( ·| θ ) . Alternatively, if we use the stick-breaking construction, we have ∞ ind ind � Y i ∼ p ( ·| θ i ) , θ i ∼ π h δ θ ∗ h h =1 ind where θ ∗ ∼ G 0 and π ∼ stick ( a ) . h Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 9 / 34
Bayesian nonparametrics Finite approximation to the stick-breaking representation For some ǫ > 0 , there exists an H such that � ∞ h = H π h < ǫ and components H and beyond can reasonably be ignored. The resulting model is H ind ind � Y i ∼ p ( ·| θ i ) , θ i ∼ π h δ θ ∗ h h =1 where � π h = ν h ℓ<h (1 − ν ℓ ) ind ν h ∼ Be (1 , a ) for h < H , and ν H = 1 . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 10 / 34
Bayesian nonparametrics Normal example Normal example A DP mixture model for the marginal distribution for Y i = φ i is � µ i H � ind ∼ N ( µ i , σ 2 � Y i i ) ∼ π h δ ( µ ∗ h ,σ 2 ∗ σ 2 h ) i h =1 where � H h =1 π h = 1 . Alternatively, we can introduce a latent variable ζ i = h if observation i came from group h . Then ind ∼ N ( µ ∗ h , σ 2 ∗ Y i | ζ i = h h ) ind ζ i ∼ Cat ( H, π ) where ζ ∼ Cat ( H, π ) is a categorical random variable with P ( ζ = h ) = π h for h = 1 , . . . , H and π = ( π 1 , . . . , π H ) . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 11 / 34
Bayesian nonparametrics Normal example Normal example Let H ind i ) ind ∼ N ( µ i , σ 2 ( µ i , σ 2 � Y i i ) , ∼ π h δ ( µ ∗ h ,σ 2 ∗ h ) h =1 where the base measure G 0 is ind ind µ ∗ h | σ 2 ∗ ∼ N ( m h , v 2 h σ 2 ∗ σ 2 ∗ h ) and ∼ IG ( c h , d h ) . h h But since each ( µ i , σ 2 h , σ 2 ∗ i ) must equal ( µ ∗ h ) for some h , we can rewrite the model as H ind � π h N ( µ ∗ h , σ 2 ∗ Y i ∼ h ) h =1 with a prior that is equal to the base measure. Thus this model is equivalent to our finite mixture with the exception of the prior for π . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 12 / 34
Bayesian nonparametrics Normal example MCMC - Blocked Gibbs sampler The steps of a Gibbs sampler with stationary distribution p ( π, µ, σ 2 , ζ | y ) ∝ p ( y | ζ, µ, σ 2 ) p ( ζ | π ) p ( µ | σ 2 ) p ( σ 2 ) p ( π ) has steps 1. For i = 1 , . . . , n , independently sample ζ i from its full conditional h , σ 2 ∗ P ( ζ i = h | . . . ) ∝ π h N ( y i ; µ ∗ h ) 2. Jointly sample π and µ, σ 2 because they are conditionally independent. ind ∼ Be (1 + Z h , a + Z + a. Sample ν h h ) for V = 1 , . . . , H − 1 where Z h = � n i =1 I( ζ i = h ) and Z + h = � H h ′ = h +1 Z h ′ and set ν H = 1 . Then calculate π h = ν h � ℓ<h (1 − ν ℓ ) . b. For h = 1 , . . . , H , sample µ h , σ 2 h from their full conditional ind ind µ ∗ h | σ 2 ∗ ∼ N ( m ′ h , v ′ 2 h σ 2 ∗ σ 2 ∗ ∼ IG ( c ′ h , d ′ h ) h ) h h h , v ′ 2 where m ′ h , c ′ h , and d ′ h are exactly the same as in the normal finite mixture MCMC. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 13 / 34
Bayesian nonparametrics JAGS library("rjags") dp_normal_blocked = " model { for (i in 1:n) { y[i] ~ dnorm(mu[zeta[i]], tau[zeta[i]]) zeta[i] ~ dcat(pi[]) } for (h in 1:H) { mu[h] ~ dnorm(2,1/3) tau[h] ~ dgamma(.1,.1) sigma[h] <- 1/sqrt(tau[h]) } # Stick breaking for (h in 1:(H-1)) { V[h] ~ dbeta(1,a) } V[H] <- 1 pi[1] <- V[1] for (h in 2:H) { pi[h] <- V[h] * (1-V[h-1]) * pi[h-1] / V[h-1] } } " tmp = hat[sample(nrow(hat), 1000),] dat = list(n=nrow(tmp), H=25, y=tmp$phi, a=1) jm = jags.model(textConnection(dp_normal_blocked), data = dat, n.chains = 3) r = jags.samples(jm, c('mu','sigma','pi','zeta'), 1e3) Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 14 / 34
Bayesian nonparametrics JAGS Monitor convergence of density As previously discussed, the model as constructed as identifiability h , and σ 2 ∗ problems among the π h , µ ∗ h due to label switching. What is identified in the model is the value of the density at any particular value. So rather than directly monitoring the parameters, we will monitor the estimated density, i.e. at iteration m of the MCMC, the estimated density at location x is H π ( m ) N ( x ; µ ∗ ( m ) , σ 2 ∗ ( m ) � ) . h h h h =1 Monitoring this quantity at a variety of locations x will provide appropriate convergence assessment. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 15 / 34
Bayesian nonparametrics JAGS Monitor convergence of density −2.53078024929682 −2.11797253776076 0.005 0.7 0.004 0.6 0.003 0.5 0.002 0.4 0.001 0.3 chain 0.000 density 1 0 250 500 750 1000 0 250 500 750 1000 2 1.32209172503982 1.73489943657589 0.13 3 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0 250 500 750 1000 0 250 500 750 1000 iteration Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 16 / 34
Bayesian nonparametrics JAGS Monitoring the number of utilized components Since we are using a finite approximation to the DP, we should monitor the index of the maximum occupied component (or the number of occupied clusters). If the finite approximation is reasonable, then this number will be smaller than H . If not, then H should be increased. Specifically, at iteration m , we monitor max { ζ ( m ) , . . . , ζ ( m ) , } 1 n the index of the maximum occupied cluster, or H � I( Z h > 0) , h =1 the number of occupied clusters. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 17 / 34
Recommend
More recommend