Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian Nonparametrics
About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet processes and their several characterizations and properties. L. Rosasco Bayesian Nonparametrics
Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics
References and Acknowledgments This lecture heavily draws (sometimes literally) from the list of references below, which we suggest as further readings. Figures are taken either from Sudderth PhD thesis or Teh Tutorial. Main references/sources: Yee Whye Teh, Tutorial in the Machine Learning Summer School , and his notes Dirichelet Processes . Erik Sudderth, PhD Thesis. Gosh and Ramamoorthi, Bayesian Nonparametrics , (book). See also: Zoubin Ghahramani, Tutorial ICML. Michael Jordan, Nips Tutorial. Rasmussen, Williams, Gaussian Processes for Machine Learning , (book). Ferguson, paper in Annals of Statistics. Sethuraman, paper in Statistica Sinica . Berlinet, Thomas-Agnan, RKHS in Probability and Statistics , (book). Thanks to Dan, Rus and Charlie for various discussions. L. Rosasco Bayesian Nonparametrics
Parametrics vs Nonparametrics We can illustrate the difference between the two approaches considering the following prototype problems. function estimation 1 density estimation 2 L. Rosasco Bayesian Nonparametrics
(Parametric) Function Estimation Data, S = ( X , Y ) = ( x i , y i ) n i = 1 Model, y i = f θ ( x i ) + ǫ i , e.g. f θ ( x ) = � θ, x � and ǫ ∼ N ( 0 , σ 2 ) , σ > 0. prior θ ∼ P ( θ ) posterior P ( θ | X , Y ) = P ( θ ) P ( Y | X , θ ) P ( Y | X ) prediction � P ( y ∗ | x ∗ , X , Y ) = P ( y ∗ | x ∗ , θ ) P ( θ | X , Y ) d θ L. Rosasco Bayesian Nonparametrics
(Parametric) Density Estimation Data, S = ( x i ) n i = 1 Model, x i ∼ F θ prior θ ∼ P ( θ ) posterior P ( θ | X ) = P ( θ ) P ( X | θ ) P ( X ) prediction � P ( x ∗ | X ) = P ( x ∗ | θ ) P ( θ | X ) d θ L. Rosasco Bayesian Nonparametrics
Nonparametrics: a Working Definition In the above models the number of parameters available for learning is fixed a priori. Ideally the more data we have, the more parameters we would like to explore. This is in essence the idea underlying nonparametric models. L. Rosasco Bayesian Nonparametrics
The Right to a Prior A finite sequence is exchangeable if its distribution does not change under permutation of the indices. A sequence is infinitely exchangeable if any finite subsequence is exchangeable. De Finetti’s Theorem If the random variables ( x i ) ∞ i = 1 are infinitely exchangeable, then there exists some space Θ and a corresponding distribution p ( θ ) , such that the joint distribution of n observations is given by: n � � P ( x 1 , . . . , x n ) = P ( θ ) P ( x i | θ ) d θ. Θ i = 1 L. Rosasco Bayesian Nonparametrics
Question The previous classical result is often advocated as a justification for considering (possibly infinite dimensional) priors. Can we find computationally efficient nonparametric models? We already met one when we considered the Bayesian interpretation of regularization... L. Rosasco Bayesian Nonparametrics
Reminder: Stochastic Processes Stochastic Process A family ( X t ) : (Ω , P ) → R , t ∈ T , of random variables over some index set T . Note that: X t ( ω ) , ω ∈ Ω , is a number, X t ( · ) is a random variable, X ( · ) ( ω ) : T → R is a function and is called sample path . L. Rosasco Bayesian Nonparametrics
Gaussian Processes GP ( f 0 , K ) , Gaussian Process (GP) with mean f 0 and covariance function K A family ( G x ) x ∈ X of random variables over X such that: for any x 1 , . . . , x n in X , G x 1 , . . . , G x n is a multivariate Gaussian. We can define the mean f 0 : X → R of the GP from the mean f 0 ( x 1 ) , . . . , f 0 ( x n ) and the covariance function K : X × X → R settting K ( x i , x j ) equal to the corresponding entries of covariance matrix. Then K is symm., pos. def. function. A sample path of the GP can be thought of as a random function f ∼ GP ( f 0 , K ) . L. Rosasco Bayesian Nonparametrics
(Nonparametric) Function Estimation Data, S = ( X , Y ) = ( x i , y i ) n i = 1 Model, y i = f ( x i ) + ǫ i prior f ∼ GP ( f 0 , K ) posterior P ( f | X , Y ) = P ( f ) P ( Y | X , f ) P ( Y | X ) prediction � P ( y ∗ | x ∗ , X , Y ) = P ( y ∗ | x ∗ , f ) P ( f | X , Y ) df We have seen that the last equation can be computed in closed form. L. Rosasco Bayesian Nonparametrics
(Nonparametric) Density Estimation Dirichelet Processes (DP) will give us a way to build nonparametric priors for density estimation. Data, S = ( x i ) n i = 1 Model, x i ∼ F prior F ∼ DP ( α, H ) posterior P ( F | X ) = P ( F ) P ( X | F ) P ( X ) prediction � P ( x ∗ | X ) = P ( x ∗ | F ) P ( F | X ) dF L. Rosasco Bayesian Nonparametrics
Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics
Dirichelet Distribution It is a distribution over the K-dimensional simplex S K , i.e. i = 1 x i = 1 and x i ≥ 0 for all i . x ∈ R K such that � K The Dirichelet distribution is given by K P ( x ) = P ( x 1 , . . . , x K ) = Γ( � K i = 1 α i ) ( x i ) α i − 1 � � K i = 1 Γ( α i ) i = 1 where α = ( α 1 , . . . , α K ) is a parameter vector and Γ is the Gamma function. We write x ∼ Dir ( α ) , i.e. x 1 , . . . , x K ∼ Dir ( α 1 , . . . , α K ) . L. Rosasco Bayesian Nonparametrics
Dirichelet Distribution university-logo L. Rosasco Bayesian Nonparametrics
Reminder: Gamma Function and Beta Distribution The Gamma function � ∞ t z − 1 e − t dt . γ ( z ) = 0 It is possible to prove that Γ( z + 1 ) = z Γ( z ) . Beta Distribution Special case of the Dirichelet distribution given by K = 2. Γ( α + β ) Γ( α ) + Γ( β ) x ( α − 1 ) ( 1 − x ) ( β − 1 ) P ( x | α, β ) = Note that here x ∈ [ 0 , 1 ] whereas for the Dirichelet distribution we would have x = ( x 1 , x 2 ) with x 1 , x 2 > 0 and x 1 + x 2 = 1. L. Rosasco Bayesian Nonparametrics
Beta Distribution 4 4 � = 1, � = 1 � = 1.0, � = 1.0 � = 2, � = 2 � = 1.0, � = 3.0 3.5 3.5 � = 5, � = 3 � = 1.0, � = 0.3 � = 4, � = 9 � = 0.3, � = 0.3 3 3 2.5 2.5 p( � | � , � ) p( � | � , � ) 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 � � For large parameters the distribution is unimodal. For small parameters it favors biased binomial distributions. L. Rosasco Bayesian Nonparametrics
Properties of the Dirichelet Distribution Note that the K -simplex S K can be seen as the space of probabilities of a discrete (categorical) random variable with K possible values. Let α 0 = � K i = 1 α i . Expectation E [ x i ] = α i . α 0 Variance V [ x i ] = α i ( α 0 − α i ) 0 ( α 0 + 1 ) . α 2 Covariance α i α j Cov ( x i , x j ) = 0 ( α 0 + 1 ) . α 2 L. Rosasco Bayesian Nonparametrics
Properties of the Dirichelet Distribution Aggregation: let ( x 1 , . . . , x K ) ∼ Dir ( α 1 , . . . , α K ) then ( x 1 + x 2 , . . . , x K ) ∼ Dir ( α 1 + α 2 , . . . , α K ) . More generally, aggregation of any subset of the categories produces a Dirichelet distribution with parameters summed as above. The marginal distribution of any single component of a Dirichelet distribution follows a beta distribution. L. Rosasco Bayesian Nonparametrics
Conjugate Priors Let X ∼ F and F ∼ P ( ·| α ) = P α . P ( F | X , α ) = P ( F , α ) P ( X | F , α ) P ( X , α ) We say that P ( F , α ) is a conjugate prior for the likelihood P ( X | F , α ) if, for any X and α , the posterior distribution P ( F | X , α ) is in the same family of the prior. Moreover in this case the prior and the posterior distributions are then called conjugate distributions. The Dirichelet distribution is conjugate to the multinomial distribution L. Rosasco Bayesian Nonparametrics
Multinomial Distribution Let X have values in { 1 , . . . , K } . Given π 1 , . . . , π K define the probability mass function, K π δ i ( X ) � P ( X | π 1 , . . . , π K ) = . i i = 1 multinomial distribution Given n observations the total probability of all possible sequences of length n taking those values is K n ! P ( x 1 , . . . , x n | π 1 , . . . , π K ) = � π C i i , � K i = 1 C i ! i = 1 where C i = � n j = 1 δ i ( X j ) ( C i is the number of observations with value i ). For K = 2 this is just the binomial distribution. L. Rosasco Bayesian Nonparametrics
Conjugate Posteriors and Predictions Given n observations S = x 1 , . . . , x n from a multinomial distribution P ( ·| θ ) with a Dirichelet prior P ( θ | α ) we have P ( θ | S , α ) ∝ P ( θ | α ) P ( S | θ ) ∝ K ( θ i ) ( α i + C i − 1 ) ∝ Dir ( α 1 + C 1 , . . . , α K + C K ) � i = 1 where C i is the number of observations with value i . L. Rosasco Bayesian Nonparametrics
Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics
Recommend
More recommend