Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, - PowerPoint PPT Presentation

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian Nonparametrics

About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet processes and their several characterizations and properties. L. Rosasco Bayesian Nonparametrics

Plan Parametrics, nonparametrics and priors A reminder on distributions Dirichelet processes Definition Stick Breaking Pólya Urn Scheme and Chinese processes L. Rosasco Bayesian Nonparametrics

References and Acknowledgments This lecture heavily draws (sometimes literally) from the list of references below, which we suggest as further readings. Figures are taken either from Sudderth PhD thesis or Teh Tutorial. Main references/sources: Yee Whye Teh, Tutorial in the Machine Learning Summer School , and his notes Dirichelet Processes . Erik Sudderth, PhD Thesis. Gosh and Ramamoorthi, Bayesian Nonparametrics , (book). See also: Zoubin Ghahramani, Tutorial ICML. Michael Jordan, Nips Tutorial. Rasmussen, Williams, Gaussian Processes for Machine Learning , (book). Ferguson, paper in Annals of Statistics. Sethuraman, paper in Statistica Sinica . Berlinet, Thomas-Agnan, RKHS in Probability and Statistics , (book). Thanks to Dan, Rus and Charlie for various discussions. L. Rosasco Bayesian Nonparametrics

Parametrics vs Nonparametrics We can illustrate the difference between the two approaches considering the following prototype problems. function estimation 1 density estimation 2 L. Rosasco Bayesian Nonparametrics

(Parametric) Function Estimation Data, S = ( X , Y ) = ( x i , y i ) n i = 1 Model, y i = f θ ( x i ) + ǫ i , e.g. f θ ( x ) = � θ, x � and ǫ ∼ N ( 0 , σ 2 ) , σ > 0. prior θ ∼ P ( θ ) posterior P ( θ | X , Y ) = P ( θ ) P ( Y | X , θ ) P ( Y | X ) prediction � P ( y ∗ | x ∗ , X , Y ) = P ( y ∗ | x ∗ , θ ) P ( θ | X , Y ) d θ L. Rosasco Bayesian Nonparametrics

Nonparametrics: a Working Definition In the above models the number of parameters available for learning is fixed a priori. Ideally the more data we have, the more parameters we would like to explore. This is in essence the idea underlying nonparametric models. L. Rosasco Bayesian Nonparametrics

The Right to a Prior A finite sequence is exchangeable if its distribution does not change under permutation of the indices. A sequence is infinitely exchangeable if any finite subsequence is exchangeable. De Finetti’s Theorem If the random variables ( x i ) ∞ i = 1 are infinitely exchangeable, then there exists some space Θ and a corresponding distribution p ( θ ) , such that the joint distribution of n observations is given by: n � � P ( x 1 , . . . , x n ) = P ( θ ) P ( x i | θ ) d θ. Θ i = 1 L. Rosasco Bayesian Nonparametrics

Question The previous classical result is often advocated as a justification for considering (possibly infinite dimensional) priors. Can we find computationally efficient nonparametric models? We already met one when we considered the Bayesian interpretation of regularization... L. Rosasco Bayesian Nonparametrics

Reminder: Stochastic Processes Stochastic Process A family ( X t ) : (Ω , P ) → R , t ∈ T , of random variables over some index set T . Note that: X t ( ω ) , ω ∈ Ω , is a number, X t ( · ) is a random variable, X ( · ) ( ω ) : T → R is a function and is called sample path . L. Rosasco Bayesian Nonparametrics

Gaussian Processes GP ( f 0 , K ) , Gaussian Process (GP) with mean f 0 and covariance function K A family ( G x ) x ∈ X of random variables over X such that: for any x 1 , . . . , x n in X , G x 1 , . . . , G x n is a multivariate Gaussian. We can define the mean f 0 : X → R of the GP from the mean f 0 ( x 1 ) , . . . , f 0 ( x n ) and the covariance function K : X × X → R settting K ( x i , x j ) equal to the corresponding entries of covariance matrix. Then K is symm., pos. def. function. A sample path of the GP can be thought of as a random function f ∼ GP ( f 0 , K ) . L. Rosasco Bayesian Nonparametrics

(Nonparametric) Function Estimation Data, S = ( X , Y ) = ( x i , y i ) n i = 1 Model, y i = f ( x i ) + ǫ i prior f ∼ GP ( f 0 , K ) posterior P ( f | X , Y ) = P ( f ) P ( Y | X , f ) P ( Y | X ) prediction � P ( y ∗ | x ∗ , X , Y ) = P ( y ∗ | x ∗ , f ) P ( f | X , Y ) df We have seen that the last equation can be computed in closed form. L. Rosasco Bayesian Nonparametrics

(Nonparametric) Density Estimation Dirichelet Processes (DP) will give us a way to build nonparametric priors for density estimation. Data, S = ( x i ) n i = 1 Model, x i ∼ F prior F ∼ DP ( α, H ) posterior P ( F | X ) = P ( F ) P ( X | F ) P ( X ) prediction � P ( x ∗ | X ) = P ( x ∗ | F ) P ( F | X ) dF L. Rosasco Bayesian Nonparametrics

Dirichelet Distribution It is a distribution over the K-dimensional simplex S K , i.e. i = 1 x i = 1 and x i ≥ 0 for all i . x ∈ R K such that � K The Dirichelet distribution is given by K P ( x ) = P ( x 1 , . . . , x K ) = Γ( � K i = 1 α i ) ( x i ) α i − 1 � � K i = 1 Γ( α i ) i = 1 where α = ( α 1 , . . . , α K ) is a parameter vector and Γ is the Gamma function. We write x ∼ Dir ( α ) , i.e. x 1 , . . . , x K ∼ Dir ( α 1 , . . . , α K ) . L. Rosasco Bayesian Nonparametrics

Dirichelet Distribution university-logo L. Rosasco Bayesian Nonparametrics

Reminder: Gamma Function and Beta Distribution The Gamma function � ∞ t z − 1 e − t dt . γ ( z ) = 0 It is possible to prove that Γ( z + 1 ) = z Γ( z ) . Beta Distribution Special case of the Dirichelet distribution given by K = 2. Γ( α + β ) Γ( α ) + Γ( β ) x ( α − 1 ) ( 1 − x ) ( β − 1 ) P ( x | α, β ) = Note that here x ∈ [ 0 , 1 ] whereas for the Dirichelet distribution we would have x = ( x 1 , x 2 ) with x 1 , x 2 > 0 and x 1 + x 2 = 1. L. Rosasco Bayesian Nonparametrics

Beta Distribution 4 4 � = 1, � = 1 � = 1.0, � = 1.0 � = 2, � = 2 � = 1.0, � = 3.0 3.5 3.5 � = 5, � = 3 � = 1.0, � = 0.3 � = 4, � = 9 � = 0.3, � = 0.3 3 3 2.5 2.5 p( � | � , � ) p( � | � , � ) 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 � � For large parameters the distribution is unimodal. For small parameters it favors biased binomial distributions. L. Rosasco Bayesian Nonparametrics

Properties of the Dirichelet Distribution Note that the K -simplex S K can be seen as the space of probabilities of a discrete (categorical) random variable with K possible values. Let α 0 = � K i = 1 α i . Expectation E [ x i ] = α i . α 0 Variance V [ x i ] = α i ( α 0 − α i ) 0 ( α 0 + 1 ) . α 2 Covariance α i α j Cov ( x i , x j ) = 0 ( α 0 + 1 ) . α 2 L. Rosasco Bayesian Nonparametrics

Properties of the Dirichelet Distribution Aggregation: let ( x 1 , . . . , x K ) ∼ Dir ( α 1 , . . . , α K ) then ( x 1 + x 2 , . . . , x K ) ∼ Dir ( α 1 + α 2 , . . . , α K ) . More generally, aggregation of any subset of the categories produces a Dirichelet distribution with parameters summed as above. The marginal distribution of any single component of a Dirichelet distribution follows a beta distribution. L. Rosasco Bayesian Nonparametrics

Conjugate Priors Let X ∼ F and F ∼ P ( ·| α ) = P α . P ( F | X , α ) = P ( F , α ) P ( X | F , α ) P ( X , α ) We say that P ( F , α ) is a conjugate prior for the likelihood P ( X | F , α ) if, for any X and α , the posterior distribution P ( F | X , α ) is in the same family of the prior. Moreover in this case the prior and the posterior distributions are then called conjugate distributions. The Dirichelet distribution is conjugate to the multinomial distribution L. Rosasco Bayesian Nonparametrics

Multinomial Distribution Let X have values in { 1 , . . . , K } . Given π 1 , . . . , π K define the probability mass function, K π δ i ( X ) � P ( X | π 1 , . . . , π K ) = . i i = 1 multinomial distribution Given n observations the total probability of all possible sequences of length n taking those values is K n ! P ( x 1 , . . . , x n | π 1 , . . . , π K ) = � π C i i , � K i = 1 C i ! i = 1 where C i = � n j = 1 δ i ( X j ) ( C i is the number of observations with value i ). For K = 2 this is just the binomial distribution. L. Rosasco Bayesian Nonparametrics

Conjugate Posteriors and Predictions Given n observations S = x 1 , . . . , x n from a multinomial distribution P ( ·| θ ) with a Dirichelet prior P ( θ | α ) we have P ( θ | S , α ) ∝ P ( θ | α ) P ( S | θ ) ∝ K ( θ i ) ( α i + C i − 1 ) ∝ Dir ( α 1 + C 1 , . . . , α K + C K ) � i = 1 where C i is the number of observations with value i . L. Rosasco Bayesian Nonparametrics

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, - PowerPoint PPT Presentation

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian Nonparametrics About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian

Variational Russian Roulette for Variational Russian Roulette for Deep Bayesian Nonparametrics

A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie Mellon University

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Tutorial at CVPR 2012 Erik

Bayesian Magic for Complex Social Science Data: Fusion, Nonparametrics, Dynamics, Dyads, Networks

Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at

Structured Databases of Named Entities from Bayesian Nonparametrics Dr. Jacob Eisenstein

Bayesian Nonparametrics Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

Should we think of quantum probabilities as Bayesian probabilities? Carlton M. Caves C. M.

Multi-Dimensional Reflective BSDE July 29 2010, Cornell University By Qinghua Li, Columbia

Modeling the probability of occurrence of events with the new stpreg command Matteo Bottai, ScD 1

Embeddings of statistical manifolds H ong V an L e Institute of Mathematics, CAS

the multiple Chernoff distance Ke Li California Institute of Technology QMath 13, Georgia Tech

Kernel Methods for Topological Data Analysis Kenji Fukumizu The Institute of Statistical

Constrained optimal discrimination designs for Fourier regression models S. Biedermann, School of