bayesian nonparametrics
play

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, - PowerPoint PPT Presentation

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian Nonparametrics About this class Last time Bayesian formulation of RLS, for regression. (Basically, a normal distribution.) This time a more


  1. Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian Nonparametrics

  2. About this class Last time Bayesian formulation of RLS, for regression. (Basically, a normal distribution.) This time a more complicated probability model: the Dirichlet Process. And its application to clustering. And also more Bayesian terminology. C. Frogner Bayesian Nonparametrics

  3. Plan Dirichlet distribution + other basics The Dirichlet process Abstract definition Stick Breaking Chinese restaurant process Clustering Dirichlet process mixture model Hierarchical Dirichlet process mixture model C. Frogner Bayesian Nonparametrics

  4. Gamma Function and Beta Distribution The Gamma function � ∞ x z − 1 e − x dx . Γ( z ) = 0 Extends factorial function to R + : Γ( z + 1 ) = z Γ( z ) . Beta Distribution P ( x | α, β ) = Γ( α + β ) Γ( α )Γ( β ) x ( α − 1 ) ( 1 − x ) ( β − 1 ) for x ∈ [ 0 , 1 ] , α > 0, β > 0. αβ α (Mean: α + β , variance: ( α + β ) 2 ( α + β + 1 ) .) C. Frogner Bayesian Nonparametrics

  5. Beta Distribution 4 4 � = 1, � = 1 � = 1.0, � = 1.0 � = 2, � = 2 � = 1.0, � = 3.0 3.5 3.5 � = 5, � = 3 � = 1.0, � = 0.3 � = 4, � = 9 � = 0.3, � = 0.3 3 3 2.5 2.5 p( � | � , � ) p( � | � , � ) 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 � � For large parameters the distribution is unimodal. For small parameters it favors biased binomial distributions. C. Frogner Bayesian Nonparametrics

  6. Dirichlet Distribution Generalizes Beta distribution to the K-dimensional simplex S K . K S K = { x ∈ R K : � x i = 1 , x i ≥ 0 ∀ i } i = 1 Dirichlet distribution K P ( x | α ) = P ( x 1 , . . . , x K ) = Γ( � K i = 1 α i ) � ( x i ) α i − 1 � K i = 1 Γ( α i ) i = 1 where α = ( α 1 , . . . , α K ) , α i > 0 ∀ i , x ∈ S K . We write x ∼ Dir ( α ) , i.e. x 1 , . . . , x K ∼ Dir ( α 1 , . . . , α K ) . C. Frogner Bayesian Nonparametrics

  7. Dirichlet Distribution university-logo C. Frogner Bayesian Nonparametrics

  8. Properties of the Dirichlet Distribution Mean α i E [ x i ] = . � K j = 1 α j Variance α i ( � i � = j α j ) Var [ x i ] = . ( � K j = 1 α j ) 2 ( 1 + � K j = 1 α j ) Covariance α i α j Cov ( x i , x j ) = . ( � K j = 1 α j ) 2 ( 1 + � K j = 1 α j ) Marginals: x i ∼ Beta ( α i , � j � = i α j ) Aggregation: ( x 1 + x 2 , . . . , x k ) ∼ Dir ( α 1 + α 2 , . . . , α K ) C. Frogner Bayesian Nonparametrics

  9. Multinomial Distribution If you throw n balls into k bins, the distribution of balls into bins is given by the multinomial distribution. Multinomial distribution Let p = ( p 1 , . . . , p K ) be probabilities over K categories and C = ( C 1 , . . . , C K ) be category counts. C i is the number of samples in the i th category, from n independent draws of a categorical variable with category probabilities p . Then K n ! � p C i P ( C | n , p ) = i . � K i = 1 C i ! i = 1 For K = 2 this is the binomial distribution. C. Frogner Bayesian Nonparametrics

  10. An idea Treat the Dirichlet distribution as a distribution on probabilities: each sample θ ∼ Dir ( α ) defines a K -dimensional multinomial distribution. x ∼ Mult ( θ ) , θ ∼ Dir ( α ) C. Frogner Bayesian Nonparametrics

  11. An idea Treat the Dirichlet distribution as a distribution on probabilities: each sample θ ∼ Dir ( α ) defines a K -dimensional multinomial distribution. x ∼ Mult ( θ ) , θ ∼ Dir ( α ) Posterior on θ : θ | x ∼ Dir ( α + x ) C. Frogner Bayesian Nonparametrics

  12. Conjugate Priors Say x ∼ F ( θ ) (the likelihood ) and θ ∼ G ( α ) (the prior ). Conjugate prior G is a conjugate prior for F if the posterior P ( θ | x , α ) is in the same family as G . (E.g. if F is Gaussian then P ( θ | x , α ) should also be Gaussian.) So the Dirichlet distribution is a conjugate prior for the multinomial. C. Frogner Bayesian Nonparametrics

  13. Plan Dirichlet distribution + other basics The Dirichlet process Abstract definition Stick Breaking Chinese restaurant process Clustering Dirichlet process mixture model Hierarchical Dirichlet process mixture model C. Frogner Bayesian Nonparametrics

  14. Parametric vs. nonparametric Parametric : fix parameters independent of data. Nonparametric : effective number of parameters can grow with the data. E.g. density estimation: fitting Gaussian vs. parzen windows. E.g. Kernel methods are nonparametric. C. Frogner Bayesian Nonparametrics

  15. Dirichlet Process Want: distribution on all K-dimensional simplices (for all K ). Informal Description X is a space, F is a probability distribution on X and F ( X ) is the set of all possible distributions on X . A Dirichlet Process gives a distribution over F ( X ) . A sample path from a DP is an element F ∈ F ( X ) . F can be seen as a (random) probability distribution on X . C. Frogner Bayesian Nonparametrics

  16. Dirichlet Process Want: distribution on all K-dimensional simplices (for all K ). Formal Definition Let X be a space and H be the base measure on X . F is a sample from the Dirichlet Process DP ( α, H ) on X if its finite-dimensional marginals have the Dirichlet distribution: ( F ( B 1 ) , . . . , F ( B K )) ∼ Dir ( α H ( B 1 ) , . . . , α H ( B 2 )) for all partitions B 1 , . . . , B K of X (for any K ). C. Frogner Bayesian Nonparametrics

  17. Stick Breaking Construction Explicit construction of a DP . Let α > 0, ( π i ) ∞ i = 1 such that i − 1 i − 1 � � p i = β i ( 1 − β j ) = β i ( 1 − p j ) j = 1 j = 1 where β i ∼ Beta ( 1 , α ) , for all i . Let H be a distribution on X and define ∞ � F = p i δ θ i i = 1 where θ i ∼ H , for all i . C. Frogner Bayesian Nonparametrics

  18. Stick Breaking Construction: Interpretation 0.5 0.5 β 1 1 −β 1 0.4 0.4 π 1 0.3 0.3 β 2 1 −β 2 � k � k 0.2 0.2 π 2 0.1 0.1 β 3 1 −β 3 0 0 0 5 10 15 20 0 5 10 15 20 k k π 3 0.5 0.5 β 4 1 −β 4 0.4 0.4 π 4 0.3 0.3 � k � k β 5 0.2 0.2 π 5 0.1 0.1 0 0 0 5 10 15 20 0 5 10 15 20 k k α = 1 α = 5 The weights π partition a unit-length stick in an infinite set: the i -th weight is a random proportion β i of the stick remaining after sampling the first i − 1 weights. C. Frogner Bayesian Nonparametrics

  19. Stick Breaking Construction (cont.) It is possible to prove (Sethuraman ’94) that the previous construction returns a DP and conversely a Dirichlet process is discrete almost surely. C. Frogner Bayesian Nonparametrics

  20. Chinese Restaurant Process There is an infinite (countable) set of tables. First customer sits at the first table. Customer i sits at table j with probability n j α + i + 1 , where n j is the number of customers at table j , and i sits at the first open table with probability α α + i + 1 C. Frogner Bayesian Nonparametrics

  21. The Role of the Strength Parameter Note that E [ β i ] = 1 / ( 1 + α ) . for small α , the first few components will have all the mass. for large α , F approaches the distribution H assigning uniform weights to the samples θ i . C. Frogner Bayesian Nonparametrics

  22. Number of Clusters and Strength Parameter It is possible to prove (Antoniak ’77??) that the number of components with positive count grows as α log n as we increase the number of samples n . C. Frogner Bayesian Nonparametrics

  23. Another idea Clustering with the K -dimensional Dirichlet: take each sample θ ∼ Dir ( α ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ Dir ( α ) ( G is a a distribution on observation space X , say, Gaussian.) θ i is the probability of x coming from the i th cluster. C. Frogner Bayesian Nonparametrics

  24. Another idea Clustering with the K -dimensional Dirichlet: take each sample θ ∼ Dir ( α ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ Dir ( α ) ( G is a a distribution on observation space X , say, Gaussian.) θ i is the probability of x coming from the i th cluster. C. Frogner Bayesian Nonparametrics

  25. Another idea Clustering with the K -dimensional Dirichlet: take each sample θ ∼ Dir ( α ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ Dir ( α ) ( G is a a distribution on observation space X , say, Gaussian.) θ i is the probability of x coming from the i th cluster. C. Frogner Bayesian Nonparametrics

  26. Another idea Clustering with the Dirichlet Process: take each sample θ ∼ DP ( α, H ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ DP ( α, H ) ( G is a a distribution on observation space X , say, Gaussian. H can be uniform on { 1 , . . . , K } .) C. Frogner Bayesian Nonparametrics

  27. Another idea Clustering with the Dirichlet Process: take each sample θ ∼ DP ( α, H ) to define a K -dimensional categorical (instead of multinomial) distribution. x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ DP ( α, H ) ( G is a a distribution on observation space X , say, Gaussian. H can be uniform on { 1 , . . . , K } .) C. Frogner Bayesian Nonparametrics

  28. Another idea Clustering with the Dirichlet Process: x ∼ G ( φ ) , φ ∼ Cat ( θ ) , θ ∼ DP ( α, H ) This is the Dirichlet Process mixture model . C. Frogner Bayesian Nonparametrics

Recommend


More recommend