memoized online variational inference for dirichlet
play

Memoized Online Variational Inference for Dirichlet Process Mixture - PowerPoint PPT Presentation

Memoized Online Variational Inference for Dirichlet Process Mixture Models Michael C. Hughes Erik B. Sudderth Department of Computer Science, Brown University 26 June 2014 Advances in Neural Information Processing Systems (2013) Presented by


  1. Memoized Online Variational Inference for Dirichlet Process Mixture Models Michael C. Hughes Erik B. Sudderth Department of Computer Science, Brown University 26 June 2014 Advances in Neural Information Processing Systems (2013) Presented by Kyle Ulrich Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 1 / 12

  2. Review: Dirichlet process mixture models A draw, G , from a DP consists of an infinite collection of atoms: ∞ � G � G ∼ DP( α 0 H ) , w k δ φ k . (1) k =1 The mixture weights w k are represented by the stick-breaking process and the data-generating parameters φ k are drawn from the base measure H : k − 1 � w k = v k (1 − v ℓ ) , v k ∼ Beta(1 , α 0 ) , φ k ∼ H ( λ 0 ) (2) ℓ =1 Each data point n = 1 , . . . , N has cluster assignment z n and observation x n distributed according to z n ∼ Cat( w ) , x n ∼ F ( φ z n ) (3) Often, H and F are assumed to belong to the exponential family. Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 2 / 12

  3. Overview of inference for DPM models 1 Variational inference is attractive for large-scale datasets However, full-dataset variational inference scales poorly and often converges to poor local optima 2 Stochastic online (SO) variational inference alternatively scales to large datasets On the downside, they are sensitive to the learning rate decay schedule and choice of batch size 3 The proposed memoized online (MO) variational inference avoids these noisy gradients and learning rates Requires multiple full passes through the data Birth and merge moves naturally help MO escape local optima Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 3 / 12

  4. Mean-field variational inference for DP mixture models With mean-field inference, we seek to obtain a variational distribution, N K α 0 ) q ( φ k | ˆ � � q ( z , v , φ ) = q ( z n | ˆ r n ) q ( v k | ˆ α 1 , ˆ λ k ) , (4) n =1 k =1 with the following distributions on the individual factors: q ( φ k ) = H (ˆ q ( z n ) = Cat(ˆ r n 1 , . . . , ˆ r nK ) , q ( v k ) = Beta(ˆ α k 1 , ˆ α k 0 ) , λ k ) . The parameters of q are optimized such that the KL divergence from the true posterior is minimized; this results in maximizing the ELBO, L ( q ) � E q [log p ( x , v , z , φ | α 0 , λ 0 ) − log q ( v , z , φ )] (5) α 0 , and ˆ Maximizing this ELBO, we can iteratively update ˆ r n , ˆ α 1 , ˆ λ k . These batch updates are standard and presented in the paper. Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 4 / 12

  5. Truncation strategy There are many methods to set the truncation level K of the DP: Place artificially large mass on the final component, i.e., q ( v K = 1) = 1 1 Set the stick-breaking ‘tail’ to the prior, i.e., q ( v k ) = p ( v k | α ) for k > K 2 Truncate the assignments to enforce q ( z n = k ) = 0 for k > K 3 This work uses method 3 above, which has several advantages: All data is explained by the first K components. This allows the data 1 to be conditionally independent to all parameters with k > K . Therefore, inference only needs to consider a finite set of K atoms 2 This minimizes unnecessary computation while still approximating the 3 infinite posterior Truncation is nested – any q with truncation K could be represented 4 exactly under truncation K + 1 with zero mass on final component Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 5 / 12

  6. Stochastic online (SO) variational inference At each iteration t , SO processes only a subset of data, B t , sampled uniformly at random from the large corpus of data. SO first updates local factors q ( z n ) for n ∈ B t Then, with a noisy gradient step, SO updates the sufficient statistics of the global factors λ k , compute 1 ˆ For example, for ˆ λ ∗ N k = λ 0 + � n ∈B t ˆ r nk t ( x n ) |B t | Then update the global parameter as ˆ λ ( t ) ← ρ t ˆ k + (1 − ρ t )ˆ λ ( t − 1) λ ∗ k k ρ t is the learning rate. Convergence is guaranteed for appropriate decays of ρ t . Performance: This has computational advantages and sometimes achieves better solutions than the full-dataset algorithm However, it is sensitive to learning rate decay and choice of batch size 1 t represents the sufficient statistics of the observation distribution Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 6 / 12

  7. Memoized online variational inference The data is divided into B fixed batches {B b } B b =1 Maintain memoized sufficient statistics 2 S b k = [ˆ N k ( B b ) , s k ( B b )] k = [ˆ Track full-dataset statistics S 0 N k , s k ( x )] Visit each distinct batch once per full pass through the data Update local parameters for current batch, i.e., ˆ r n for n ∈ B b 1 Update cached global sufficient statistics for each component: 2 k ← [ˆ S 0 k ← S 0 k − S b S b S 0 k ← S 0 k + S b k , N k ( B b ) , s k ( B b )] , (6) k α k 0 and ˆ Update global parameters, i.e., ˆ α k 1 , ˆ λ k 3 Advantages: Unlike SO, MO is guaranteed to improve the ELBO at every step 1 MO updates reduce to standard full-dataset updates 2 More scalable and converges faster than the full-dataset algorithm 3 Same computational complexity as SO, without need for learning rates 4 2 For notation, sufficient statistics are defined as ˆ N k � E q [ � N n =1 z nk ] and s k ( x ) � E q [ � N n =1 z nk t ( x n )] Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 7 / 12

  8. Birth moves To escape local optima, we may wish to propose birth moves This is done in three steps: Collection : During pass 1, subsample data in targeted component k ′ Creation : Before pass 2, fit a DPM to subsampled data Adoption : During pass 2, update parameters with all K + K ′ components. Future merge moves will eliminate components. Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 8 / 12

  9. Merge moves To reduce computation costs, we may wish to propose merge moves This merge move has three steps: Select components k a and k b to merge into k m 1 Form the candidate configuration q ′ by utilizing the additive properties: 2 S 0 k m = S 0 k a + S 0 ˆ r nk m = ˆ r nk a + ˆ r nk b (7) k b Accept q ′ only if the ELBO improves 3 For each pass of the data, the author’s proposed algorithm performs: One birth move Memoized ascent steps for all batches Several merges after the final batch Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 9 / 12

  10. Results – toy data Data ( N = 100000) synthetic image patches generated by a zero-mean GMM with 8 equally-common components Each component has a 25 × 25 covariance matrix producing 5 × 5 patches. We wish to recover these matrices and the number of them. Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 10 / 12

  11. Results – MNIST digit clustering Clustering N = 60000 MNIST images of handwritten digits 0-9. As preprocessing, all images projected to D = 50 via PCA. Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 11 / 12

  12. Questions? Hughes and Sudderth (NIPS 2013) Memoized Online VB Inference for DPMs 26 June 2014 12 / 12

Recommend


More recommend