lecture 14 inference in dirichlet processes
play

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia


  1. CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

  2. Dirichlet Process mixture models A mixture model with a DP as nonparametric prior: ‘Mixing weights’ (prior): G |{ α , G 0 } ~ DP( α , G 0 ) The base distribution G 0 and G are distributions over the same probability space. ‘Cluster’ parameters: η n | G ~ G For each data point n = 1,..., N , draw a distribution η n with value η c * over observations from G (We can interpret this as clustering because G is discrete with probability 1; hence different η n take on identical values η c * with nonzero probability. Data points are partitioned into | C | clusters: c = c 1 ...c N ) Observed data: x n | η n ~ p (x n | η n ) For each data point n = 1,...,N , draw observation x n from η n 2 Bayesian Methods in NLP

  3. Stick-breaking representation of DPMs π 2 π 1 0 1 1 − v 1 v 1 (1 − v 1 )v 2 The component parameters η * : η * i ~ G 0 The mixing proportions π i ( v ) are defined by a stick-breaking process: V i ~ Beta (1, α ) π i ( v ) = v i ∏ j = 1. ..i − 1 (1 − v j ) also written as π ( v) ~ GEM( α ) ( G riffiths/ E ngen/ M cCloskey) Hence, if G ~ DP( α , G 0 ): G = ∑ i =1... ∞ π i ( v ) δ η i* with η * i ~ G 0 3 Bayesian Methods in NLP

  4. DP mixture models with DP( α , G 0 ) 1. Define stick-breaking weights by drawing V i | α ~ Beta(1, α ) 2. Draw cluster η i * | G 0 ~ G 0 i = {1, 2, ...} 3. For the n th data point: Draw cluster id Z n | {v 1 ,v 2 ...} ~ Mult( π ( v )) Draw observation X n | z n ~ p (x | η z n * ) p (x | η *) is from an exponential family of distributions G 0 is from the corresponding conjugate prior e.g. p (x | η *) multinomial , G 0 Dirichlet 4 Bayesian Methods in NLP

  5. Stick-breaking construction of DPMs V k α Z n η k * λ X n N ∞ Stick lengths V i ~ Beta(1, α ), yielding mixing weights π i ( v ) = v i ∏ j<i ( 1 − v j ) Component parameters: η i * ~ G 0 (assume G 0 is conjugate prior with hyperparameter λ ) Assignment of data to components: Z n |{v 1 , .... } ~ Mult( π ( v )) Generating the observations: X n | z n ~ p ( x n | η z n *) 5 Bayesian Methods in NLP

  6. Inference for DP mixture models Given observed data x 1 , ...., x n , compute the predictive density : p (x | x 1 , ...., x n , α , G 0 ) = ∫ p (x | w ) p ( w | x 1 , ...., x n , α , G 0 ) d w Problem: the posterior of the latent variables p ( w | x 1 , ....,x n , α , G 0 ) can’t be computed in closed form Approximate inference: - Gibbs sampling: Sample from a Markov chain with equilibrium distribution p ( W | x 1 , ...., x n , α , G 0 ) - Variational inference : Construct a tractable variational approximation q of p with free variational parameters ν 6 Bayesian Methods in NLP

  7. Gibbs sampling Bayesian Methods in NLP 7

  8. Gibbs sampling for DPMs Two variants that differ in their definition of the Markov Chain Collapsed Gibbs sampler: Integrates out G and the distinct parameter values { η 1 *.... η | C| *} associated with the clusters Blocked Gibbs sampler: Based on the stick-breaking construction. This requires a truncated variant of the DP. 8 Bayesian Methods in NLP

  9. Collapsed Gibbs sampler for DPMs Integrate out the random measure G and the distinct parameter values { η 1 *.... η | C| *} associated with each cluster Given data x = x 1 ...x N , each state of the Markov chain is a cluster assignment c = c 1 ...c N to each data point Each sample is also a cluster assignment c = c 1 ...c N Given a cluster assignment c b = c 1 ...c N with C distinct clusters, the predictive density is p (x N+1 | c b , x , α , λ ) = ∑ k ≤ C+1 p (c N+1 = k | c b , α ) p (x N+1 | c b , c N+1 = k, λ ) 9 Bayesian Methods in NLP

  10. Collapsed Gibbs sampler for DPMs ‘Macro-sample step’: Assign a new cluster to all data points. ‘Micro-sample step’: Sample assignment variables C n for each data point conditioned on the assignment of the remaining points, c -n C n is either one of the values in c -n or a new value: p (c n = k | x , c -n ) ∝ p ( x n | x -n , c -n , c n =k, λ ) p ( c n = k | c -n , α ) with p ( x n | x -n , c -n , c n =k, λ ) = p ( x n , c -n , c n =k, λ ) / p ( x -n , c -n , c n =k, λ ) and p ( c n = k | c -n , α ) given by the Polya (Blackwell/McQueen) urn Inference: After burn-in, collect B sample assignments c b and average across their predictive densities. 10 Bayesian Methods in NLP

  11. Blocked Gibbs sampling Based on the stick-breaking construction. States of the Markov chain consist of ( V, η *, Z) Problem: in the actual DPM model V, η * are infinite. Instead, the blocked Gibbs sampler uses a truncated DP (TDP), which samples only a finite collection of T stick lengths (and hence clusters) By setting V T -1 = 1, π i = 0 for i ≥ T : π i ( v ) = v i ∏ j<i ( 1 − v j ) 11 Bayesian Methods in NLP

  12. Blocked Gibbs sampling The states of the Markov chain consist of - the beta variables V = {V 1 ...V T-1 }, - the mixture component parameters η * = { η 1 *... η T *} - the indicator variables Z = {Z 1 ...Z N } Sampling: - For n=1...N , sample Z N from p (z n = k | v, η *, x ) = π k ( v) p (x n | η k *) - For k=1...K , sample V k from Beta ( γ k2 , γ k1 α + n k+1...K ) γ k1 = 1+ n k with n k : number of data points in cluster k γ k2 = α + n k+1...K : with n k+1...K the data points in clusters k+1...K - For k=1...K , sample η k * from its posterior p ( η k * | τ k ) τ k = ( λ 1 + n -ik (x i ) , λ 2 + n -ik ) Predictive density for each sample: p (x n+1 | x , z , α , λ ) = ∑ k E[ π k ( v) | γ 1 .... γ K ] p (x n+1 | τ k ) 12 Bayesian Methods in NLP

  13. Variational inference (recap) Bayesian Methods in NLP 13

  14. Standard EM L ( q , θ ) = ln p ( X | θ ) − KL( q || p ) is a lower bound on the KL( q || p ) incomplete log-likelihood ln p ( X | θ ) ln p ( X | θ old ) E-step: With θ old fixed, return q new L ( q, θ old ) that maximizes L ( q , θ old ) wrt. q ( Z ) , Now KL( q new || p old ) = 0 . M-step: With q new fixed, return θ new that maximizes L ( q new , θ ) wrt. θ . If L ( q new , θ new ) > L ( q new , θ old ) : ln p ( X | θ new ) > ln p ( X | θ old ), and hence KL( q new || p new ) > 0 14 Bayesian Methods in NLP

  15. Variational inference Variational inference is applicable when you have to compute an intractable posterior over latent variables p ( W | X ) Basic idea: Replace the exact, but intractable posterior p ( W | X ) with a tractable approximate posterior q ( W | X, V ) q ( W | X, V ) is from a family of simpler distributions over the latent variables W that is defined by a set of free variational parameters V Unlike in EM, KL( q || p ) > 0 for any q , since q only approximates p 15 Bayesian Methods in NLP

  16. Variational EM Initialization : Define initial model θ old and variational distribution q ( W | X , V ) E-step: Find V that maximize the variational distribution q ( W | X , V ) Compute the expectation of true posterior p ( W | X , θ old ) under the new variational distribution q ( W | X , V ) M-step: Find model parameters θ new that maximize the expectation of the p ( W , X | θ ) under the variational posterior q ( W | X , V ) Set θ old := θ new 16 Bayesian Methods in NLP

  17. Blei and Jordan’s mean-field variational inference for DP Bayesian Methods in NLP 17

  18. Variational inference Define a family of variational distributions q ν ( w ) with variational parameters ν = ν 1 .... ν M that are specific to each observation x i Set ν to minimze the KL-divergence between q ν ( w ) and p ( w | x , θ ): D( q ν ( w ) || p ( w | x , θ ) ) = E q [log q ν ( W )] − E q [log p ( W , x | θ )] + log p ( x | θ ) (Here, log p ( x | θ ) can be ignored when finding q ) This is equivalent to maximizing a lower bound on log p ( x | θ ): log p ( x | θ ) = E q [log p ( W , x | θ )] − E q [log q ν ( W )] + D( q ν ( w )|| p ( w | x , θ )) log p ( x | θ ) ≥ E q [log p ( W , x | θ )] − E q [log q ν ( W )] 18 Bayesian Methods in NLP

  19. q ν ( W ) for DPMs Blei and Jordan use again the stick-breaking construction. Hence, the latent variables are W = ( V , η *, Z ) V : T − 1 truncated stick lengths η * : T component parameters Z : cluster assignments of the N data points 19 Bayesian Methods in NLP

  20. Variational inference for DPMs In general: log p ( x | θ ) ≥ E q [log p ( W , x | θ )] − E q [log q ν ( W )] For DPMs: θ = ( α , λ ); W = ( V , η * , Z ) log p ( x | α , λ ) ≥ E q [log p ( V | α )] + E q [log p ( η * | λ )] + ∑ n [ E q [log p (Z n | V )] + E q [log p (x n | Z n )] ] − E q [log q ν ( V , η * , Z )] Problem: V = {V 1 , V 2 ,...}, η * = { η 1 *, η 2 *, ... } are infinite. Solution: use a truncated representation 20 Bayesian Methods in NLP

  21. Variational approximations q ν ( v , η *, z ) V t γ t Z n η k * τ t φ n N T The variational parameters ν = ( γ 1..T-1 , τ 1..T, φ 1...N ) q ν ( v , η *, z ) = ∏ t<T q γ t (v t ) ∏ t<T q τ t ( η t *) ∏ n ≤ N q φ n (z n ) q γ t (v t ): Beta distributions with variational parameter γ t q τ t ( η t *): conjugate priors for η , with parameter τ t q φ n (z n ): multinomials with variational parameters φ n 21 Bayesian Methods in NLP

Recommend


More recommend