convergence of latent mixing measures in finite and
play

Convergence of latent mixing measures in finite and infinite mixture - PowerPoint PPT Presentation

Convergence of latent mixing measures in finite and infinite mixture models Long Nguyen Department of Statistics University of Michigan BNP Workshop, ICERM 2012 Nguyen@BNP (ICERM12) 1 / 29 Outline Identifiability and consistency in


  1. Convergence of latent mixing measures in finite and infinite mixture models Long Nguyen Department of Statistics University of Michigan BNP Workshop, ICERM 2012 Nguyen@BNP (ICERM’12) 1 / 29

  2. Outline Identifiability and consistency in mixture model-based clustering 1 convergence of mixing measures Wasserstein metric 2 Posterior concentration rates of mixing measures 3 finite mixture models Dirichlet process mixture models 4 Implications and proof ideas Nguyen@BNP (ICERM’12) 2 / 29

  3. Clustering problem How do we subdivide D = { X 1 , . . . , X n } in R d into clusters? Nguyen@BNP (ICERM’12) 3 / 29

  4. Clustering problem How do we subdivide D = { X 1 , . . . , X n } in R d into clusters? Assume that data X 1 , . . . , X n are iid sample from a mixture model k � p G ( x ) = p i f ( x | θ i ) i = 1 How do we guarantee consistent estimates for mixture components θ = ( θ 1 , . . . , θ k ) and p = ( p 1 , . . . , p k ) ? Nguyen@BNP (ICERM’12) 3 / 29

  5. Bayesian nonparametric approach Define mixing distribution: k � G = p i δ θ i i = 1 Endow G with a prior distribution on the space of probability measures ¯ G (Θ) for finite mixtures, k is given use parametric priors on mixing probabilities p and θ for infinite mixtures, k is unknown use a nonparametric prior such as the Dirichlet process: G ∼ DP ( γ, H ) Nguyen@BNP (ICERM’12) 4 / 29

  6. Bayesian nonparametric approach Define mixing distribution: k � G = p i δ θ i i = 1 Endow G with a prior distribution on the space of probability measures ¯ G (Θ) for finite mixtures, k is given use parametric priors on mixing probabilities p and θ for infinite mixtures, k is unknown use a nonparametric prior such as the Dirichlet process: G ∼ DP ( γ, H ) Compute posterior distribution of G given data, Π( G | X 1 , . . . , X n ) We are interested in concentration behavior of the posterior of G Nguyen@BNP (ICERM’12) 4 / 29

  7. Posterior concentration of mixing measure G Let X 1 , . . . , X n be an iid sample from the mixture density � p G ( x ) = f ( x | θ ) G ( d θ ) f is known, while G = G 0 unknown discrete mixing measure Questions when does the posterior distribution Π( G | X 1 , . . . , X n ) concentrate most of its mass around the “truth” G 0 ? what is the rate of concentration (convergence)? Nguyen@BNP (ICERM’12) 5 / 29

  8. Related Work Significant advances in posterior asymptotics (i.e., posterior consistency and convergence rates) general theory: Barron, Shervish & Wasserman (1999), Ghosal, Ghosh & van der Vaart (2000), Shen & Wasserman (2000), Walker (2004), Ghosal & van der Vaart (2007), Walker, Lijoi & Prunster (2007), ... going back to work of Schwarz (1965) and Le Cam (1973) mixture models: Ghosal, Ghosh & Ramamoorthi (1999), Genovese & Wasserman (2000), Ishwaran & Zarepour (2002), Ghosal & van der Vaart (2007), ... These work focus mostly on the posterior concentration behavior of the data density p G , not mixing measure G per se Nguyen@BNP (ICERM’12) 6 / 29

  9. Related Work on mixture models Convergence of parameters p and θ in certain finite mixture settings: polynomial-time learnable settings: Kalai, Moitra, and Valiant (2010), Belkin & Sinha (2010); going back to Dasgupta (2000) overfitted setting: Rousseau & Mengersen (JRSSB, 2011) Convergence of mixing measure G in a univariate finite mixture: settled by Jiahua Chen (AOS, 1995), who established optimal rate n − 1 / 4 Bayesian asymptotics by Ishwaran, James and Sun (JASA, 2001) Literature on deconvolution in kernel density estimation, in ’80 and early ’90 (Hall, Carroll, Fan, Zhang, ...) Posterior concentration behavior of mixing measures in multivariate finite mixtures, and infinite mixtures remains unresolved Nguyen@BNP (ICERM’12) 7 / 29

  10. Outline Identifiability and consistency in mixture model-based clustering 1 Wasserstein metric 2 Posterior concentration rates of mixing measures 3 Implications and proof ideas 4 Nguyen@BNP (ICERM’12) 8 / 29

  11. Optimal transportation problem (Monge/Kantorovich, cf. Villani, ’03) How to optimally transport to goods from a collection of producers to a collection of consumers, all of which are located in some space? squares: locations of producers; circles: locations of consumers The optimal cost of transportation defines a (Wasserstein) distance between “production density” and “consumption density”. Nguyen@BNP (ICERM’12) 9 / 29

  12. Wasserstein metric (cont) i = 1 p i δ θ i , G ′ = � k ′ j . A coupling between p and p ′ is a joint Let G = � k j = 1 p ′ j δ θ ′ distribution q on [ 1 , . . . , k ] × [ 1 , . . . , k ′ ] whose marginals are p and p ′ . That is, for any ( i , j ) ∈ [ 1 , . . . , k ] × [ 1 , . . . , k ′ ] , k k ′ � � q ij = p j ; q ij = p i . i = 1 j = 1 Definition Let ρ be a distance function on Θ , the Wasserstein distance is defined by: d ρ ( G , G ′ ) = inf � q ij ρ ( θ i , θ ′ j ) . q i , j When Θ ⊂ R d , ρ is Euclidean metric on R d , for r ≥ 1, use ρ r as a distance function on R d to obtain L r Wasserstein metric: � 1 / r � � W r ( G , G ′ ) := q ij � θ i − θ ′ j � r inf . q i , j Nguyen@BNP (ICERM’12) 10 / 29

  13. Examples and Facts Wasserstein distance W r metrizes weak convergence in the space of probability measures on Θ . Nguyen@BNP (ICERM’12) 11 / 29

  14. Examples and Facts Wasserstein distance W r metrizes weak convergence in the space of probability measures on Θ . If Θ = R , then W 1 ( G , G ′ ) = � CDF ( G ) − CDF ( G ′ ) � 1 . Nguyen@BNP (ICERM’12) 11 / 29

  15. Examples and Facts Wasserstein distance W r metrizes weak convergence in the space of probability measures on Θ . If Θ = R , then W 1 ( G , G ′ ) = � CDF ( G ) − CDF ( G ′ ) � 1 . If G 0 = δ θ 0 and G = � k i = 1 p i δ θ i , then k � W 1 ( G 0 , G ) = p i � θ 0 − θ i � . i = 1 Nguyen@BNP (ICERM’12) 11 / 29

  16. Examples and Facts Wasserstein distance W r metrizes weak convergence in the space of probability measures on Θ . If Θ = R , then W 1 ( G , G ′ ) = � CDF ( G ) − CDF ( G ′ ) � 1 . If G 0 = δ θ 0 and G = � k i = 1 p i δ θ i , then k � W 1 ( G 0 , G ) = p i � θ 0 − θ i � . i = 1 k δ θ i , G ′ = � k If G = � k 1 1 k δ θ ′ j , then i = 1 j = 1 k 1 � W 1 ( G , G ′ ) = inf k � θ i − θ ′ π ( i ) � , π i = 1 where π ranges over the set of permutations on ( 1 , . . . , k ) . Nguyen@BNP (ICERM’12) 11 / 29

  17. Relations between Wasserstein distances and divergences If W 2 ( G , G ′ ) = 0, then clearly G = G ′ so that V ( p G , p G ′ ) = h ( p G , p G ′ ) = K ( p G , p G ′ ) = 0 . It can be shown that an f -divergence (e.g., variational distance V , Hellinger h , Kullback-Leibler distance K ) between p G , p G ′ is always bounded from above by a Wasserstein distance if f ( x | θ ) is Gaussian with mean parameter θ , then √ h ( p G , p G ′ ) ≤ W 2 ( G , G ′ ) / 2 2 . if f ( x | θ ) is Gamma with location parameter θ , then K ( p G || p G ′ ) = O ( W 1 ( G , G ′ )) . Nguyen@BNP (ICERM’12) 12 / 29

  18. Relations between Wasserstein distances and divergences If W 2 ( G , G ′ ) = 0, then clearly G = G ′ so that V ( p G , p G ′ ) = h ( p G , p G ′ ) = K ( p G , p G ′ ) = 0 . It can be shown that an f -divergence (e.g., variational distance V , Hellinger h , Kullback-Leibler distance K ) between p G , p G ′ is always bounded from above by a Wasserstein distance if f ( x | θ ) is Gaussian with mean parameter θ , then √ h ( p G , p G ′ ) ≤ W 2 ( G , G ′ ) / 2 2 . if f ( x | θ ) is Gamma with location parameter θ , then K ( p G || p G ′ ) = O ( W 1 ( G , G ′ )) . Conversely: if the distance between p G , p ′ G is small, can we ensure that W 2 ( G , G ′ ) (or W 1 ( G , G ′ ) , etc) be small? Nguyen@BNP (ICERM’12) 12 / 29

  19. Identifiability in mixture models The family { f ( ·| θ ) , θ ∈ Θ } is identifiable if for any G , G ′ ∈ G (Θ) , | p G ( x ) − p G ′ ( x ) | = 0 for almost all x implies that G = G ′ . G (Θ) is space of discrete measures with finite support points on Θ , ¯ G (Θ) is space of all discrete measures on Θ Nguyen@BNP (ICERM’12) 13 / 29

  20. Identifiability in mixture models The family { f ( ·| θ ) , θ ∈ Θ } is identifiable if for any G , G ′ ∈ G (Θ) , | p G ( x ) − p G ′ ( x ) | = 0 for almost all x implies that G = G ′ . G (Θ) is space of discrete measures with finite support points on Θ , ¯ G (Θ) is space of all discrete measures on Θ Stronger notion of identifiability (due to Chen (1995) for univariate case) Strong identifiability Let Θ ⊆ R d . The family { f ( ·| θ ) , θ ∈ R d } is strongly identifiable if f ( x | θ ) is twice differentiable in θ , and for any finite k and distinct θ 1 , . . . , θ k , the equality k � � � α i f ( x | θ i ) + β T i Df ( x | θ i ) + γ T i D 2 f ( x | θ i ) γ i � � sup � = 0 (1) � � x ∈X � i = 1 implies that α i = 0, β i = γ i = 0 ∈ R d for i = 1 , . . . , k . Here, Df ( x | θ i ) and D 2 f ( x | θ i ) denote the gradient and the Hessian at θ i of f ( x |· ) , resp. Nguyen@BNP (ICERM’12) 13 / 29

Recommend


More recommend