bayesian nonparametrics
play

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, - PowerPoint PPT Presentation

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian Nonparametrics About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss


  1. Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian Nonparametrics

  2. About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet processes and their several characterizations and properties. L. Rosasco Bayesian Nonparametrics

  3. Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics

  4. References and Acknowledgments This lecture heavily draws (sometimes literally) from the list of references below, which we suggest as further readings. Figures are taken either from Sudderth PhD thesis or Teh Tutorial. Main references/sources: Yee Whye Teh, Tutorial in the Machine Learning Summer School , and his notes Dirichelet Processes . Erik Sudderth, PhD Thesis. Gosh and Ramamoorthi, Bayesian Nonparametrics , (book). See also: Zoubin Ghahramani, Tutorial ICML. Michael Jordan, Nips Tutorial. Rasmussen, Williams, Gaussian Processes for Machine Learning , (book). Ferguson, paper in Annals of Statistics. Sethuraman, paper in Statistica Sinica . Berlinet, Thomas-Agnan, RKHS in Probability and Statistics , (book). Thanks to Dan, Rus and Charlie for various discussions. L. Rosasco Bayesian Nonparametrics

  5. Parametrics vs Nonparametrics We can illustrate the difference between the two approaches considering the following prototype problems. function estimation 1 density estimation 2 L. Rosasco Bayesian Nonparametrics

  6. (Parametric) Function Estimation Data, S = ( X , Y ) = ( x i , y i ) n i = 1 Model, y i = f θ ( x i ) + ǫ i , e.g. f θ ( x ) = � θ, x � and ǫ ∼ N ( 0 , σ 2 ) , σ > 0. prior θ ∼ P ( θ ) posterior P ( θ | X , Y ) = P ( θ ) P ( Y | X , θ ) P ( Y | X ) prediction � P ( y ∗ | x ∗ , X , Y ) = P ( y ∗ | x ∗ , θ ) P ( θ | X , Y ) d θ L. Rosasco Bayesian Nonparametrics

  7. (Parametric) Density Estimation Data, S = ( x i ) n i = 1 Model, x i ∼ F θ prior θ ∼ P ( θ ) posterior P ( θ | X ) = P ( θ ) P ( X | θ ) P ( X ) prediction � P ( x ∗ | X ) = P ( x ∗ | θ ) P ( θ | X ) d θ L. Rosasco Bayesian Nonparametrics

  8. Nonparametrics: a Working Definition In the above models the number of parameters available for learning is fixed a priori. Ideally the more data we have, the more parameters we would like to explore. This is in essence the idea underlying nonparametric models. L. Rosasco Bayesian Nonparametrics

  9. The Right to a Prior A finite sequence is exchangeable if its distribution does not change under permutation of the indices. A sequence is infinitely exchangeable if any finite subsequence is exchangeable. De Finetti’s Theorem If the random variables ( x i ) ∞ i = 1 are infinitely exchangeable, then there exists some space Θ and a corresponding distribution p ( θ ) , such that the joint distribution of n observations is given by: n � � P ( x 1 , . . . , x n ) = P ( θ ) P ( x i | θ ) d θ. Θ i = 1 L. Rosasco Bayesian Nonparametrics

  10. Question The previous classical result is often advocated as a justification for considering (possibly infinite dimensional) priors. Can we find computationally efficient nonparametric models? We already met one when we considered the Bayesian interpretation of regularization... L. Rosasco Bayesian Nonparametrics

  11. Reminder: Stochastic Processes Stochastic Process A family ( X t ) : (Ω , P ) → R , t ∈ T , of random variables over some index set T . Note that: X t ( ω ) , ω ∈ Ω , is a number, X t ( · ) is a random variable, X ( · ) ( ω ) : T → R is a function and is called sample path . L. Rosasco Bayesian Nonparametrics

  12. Gaussian Processes GP ( f 0 , K ) , Gaussian Process (GP) with mean f 0 and covariance function K A family ( G x ) x ∈ X of random variables over X such that: for any x 1 , . . . , x n in X , G x 1 , . . . , G x n is a multivariate Gaussian. We can define the mean f 0 : X → R of the GP from the mean f 0 ( x 1 ) , . . . , f 0 ( x n ) and the covariance function K : X × X → R settting K ( x i , x j ) equal to the corresponding entries of covariance matrix. Then K is symm., pos. def. function. A sample path of the GP can be thought of as a random function f ∼ GP ( f 0 , K ) . L. Rosasco Bayesian Nonparametrics

  13. (Nonparametric) Function Estimation Data, S = ( X , Y ) = ( x i , y i ) n i = 1 Model, y i = f ( x i ) + ǫ i prior f ∼ GP ( f 0 , K ) posterior P ( f | X , Y ) = P ( f ) P ( Y | X , f ) P ( Y | X ) prediction � P ( y ∗ | x ∗ , X , Y ) = P ( y ∗ | x ∗ , f ) P ( f | X , Y ) df We have seen that the last equation can be computed in closed form. L. Rosasco Bayesian Nonparametrics

  14. (Nonparametric) Density Estimation Dirichelet Processes (DP) will give us a way to build nonparametric priors for density estimation. Data, S = ( x i ) n i = 1 Model, x i ∼ F prior F ∼ DP ( α, H ) posterior P ( F | X ) = P ( F ) P ( X | F ) P ( X ) prediction � P ( x ∗ | X ) = P ( x ∗ | F ) P ( F | X ) dF L. Rosasco Bayesian Nonparametrics

  15. Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics

  16. Dirichelet Distribution It is a distribution over the K-dimensional simplex S K , i.e. i = 1 x i = 1 and x i ≥ 0 for all i . x ∈ R K such that � K The Dirichelet distribution is given by K P ( x ) = P ( x 1 , . . . , x K ) = Γ( � K i = 1 α i ) ( x i ) α i − 1 � � K i = 1 Γ( α i ) i = 1 where α = ( α 1 , . . . , α K ) is a parameter vector and Γ is the Gamma function. We write x ∼ Dir ( α ) , i.e. x 1 , . . . , x K ∼ Dir ( α 1 , . . . , α K ) . L. Rosasco Bayesian Nonparametrics

  17. Dirichelet Distribution university-logo L. Rosasco Bayesian Nonparametrics

  18. Reminder: Gamma Function and Beta Distribution The Gamma function � ∞ t z − 1 e − t dt . γ ( z ) = 0 It is possible to prove that Γ( z + 1 ) = z Γ( z ) . Beta Distribution Special case of the Dirichelet distribution given by K = 2. Γ( α + β ) Γ( α ) + Γ( β ) x ( α − 1 ) ( 1 − x ) ( β − 1 ) P ( x | α, β ) = Note that here x ∈ [ 0 , 1 ] whereas for the Dirichelet distribution we would have x = ( x 1 , x 2 ) with x 1 , x 2 > 0 and x 1 + x 2 = 1. L. Rosasco Bayesian Nonparametrics

  19. Beta Distribution 4 4 � = 1, � = 1 � = 1.0, � = 1.0 � = 2, � = 2 � = 1.0, � = 3.0 3.5 3.5 � = 5, � = 3 � = 1.0, � = 0.3 � = 4, � = 9 � = 0.3, � = 0.3 3 3 2.5 2.5 p( � | � , � ) p( � | � , � ) 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 � � For large parameters the distribution is unimodal. For small parameters it favors biased binomial distributions. L. Rosasco Bayesian Nonparametrics

  20. Properties of the Dirichelet Distribution Note that the K -simplex S K can be seen as the space of probabilities of a discrete (categorical) random variable with K possible values. Let α 0 = � K i = 1 α i . Expectation E [ x i ] = α i . α 0 Variance V [ x i ] = α i ( α 0 − α i ) 0 ( α 0 + 1 ) . α 2 Covariance α i α j Cov ( x i , x j ) = 0 ( α 0 + 1 ) . α 2 L. Rosasco Bayesian Nonparametrics

  21. Properties of the Dirichelet Distribution Aggregation: let ( x 1 , . . . , x K ) ∼ Dir ( α 1 , . . . , α K ) then ( x 1 + x 2 , . . . , x K ) ∼ Dir ( α 1 + α 2 , . . . , α K ) . More generally, aggregation of any subset of the categories produces a Dirichelet distribution with parameters summed as above. The marginal distribution of any single component of a Dirichelet distribution follows a beta distribution. L. Rosasco Bayesian Nonparametrics

  22. Conjugate Priors Let X ∼ F and F ∼ P ( ·| α ) = P α . P ( F | X , α ) = P ( F , α ) P ( X | F , α ) P ( X , α ) We say that P ( F , α ) is a conjugate prior for the likelihood P ( X | F , α ) if, for any X and α , the posterior distribution P ( F | X , α ) is in the same family of the prior. Moreover in this case the prior and the posterior distributions are then called conjugate distributions. The Dirichelet distribution is conjugate to the multinomial distribution L. Rosasco Bayesian Nonparametrics

  23. Multinomial Distribution Let X have values in { 1 , . . . , K } . Given π 1 , . . . , π K define the probability mass function, K π δ i ( X ) � P ( X | π 1 , . . . , π K ) = . i i = 1 multinomial distribution Given n observations the total probability of all possible sequences of length n taking those values is K n ! P ( x 1 , . . . , x n | π 1 , . . . , π K ) = � π C i i , � K i = 1 C i ! i = 1 where C i = � n j = 1 δ i ( X j ) ( C i is the number of observations with value i ). For K = 2 this is just the binomial distribution. L. Rosasco Bayesian Nonparametrics

  24. Conjugate Posteriors and Predictions Given n observations S = x 1 , . . . , x n from a multinomial distribution P ( ·| θ ) with a Dirichelet prior P ( θ | α ) we have P ( θ | S , α ) ∝ P ( θ | α ) P ( S | θ ) ∝ K ( θ i ) ( α i + C i − 1 ) ∝ Dir ( α 1 + C 1 , . . . , α K + C K ) � i = 1 where C i is the number of observations with value i . L. Rosasco Bayesian Nonparametrics

  25. Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics

Recommend


More recommend