Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 1 / 57
Sources and Inspirations Tutorials (slides) P . Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics . NIPS 2011. M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That . NIPS 2005. Articles etc. E.B. Sudderth, Chapter in PhD thesis, 2006. E. Fox, Chapter in PhD thesis, 2008. Y.W. Teh, Dirichlet Processes . Encyclopedia of Machine Learning, 2010. Springer. ... Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 2 / 57
Outline Introduction and background 1 Bayesian learning Nonparametric models Finite mixture models 2 Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference A little more theory. . . 4 De Finetti’s REDUX Dirichlet process REDUX The hierarchical Dirichlet process 5 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 3 / 57
Introduction and background Outline Introduction and background 1 Bayesian learning Nonparametric models Finite mixture models 2 Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference A little more theory. . . 4 De Finetti’s REDUX Dirichlet process REDUX The hierarchical Dirichlet process 5 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 4 / 57
Introduction and background Bayesian learning The meaning of it all BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57
Introduction and background Bayesian learning The meaning of it all BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57
Introduction and background Bayesian learning The meaning of it all BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 5 / 57
Introduction and background Bayesian learning Bayesian statistics Estimate a parameter θ ∈ Θ after observing data x . Frequentist Maximum Likelihood (ML): ˆ θ MLE = argmax θ p ( x | θ ) = argmax θ L ( θ : x ) Bayesian Bayes Rule: p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Bayesian prediction (using the whole posterior, not just one estimator) � p ( x new | x ) = p ( x new | θ ) p ( θ | x ) d θ Θ Maximum A Posteriori (MAP) ˆ p ( x | θ ) p ( θ ) θ MAP = argmax θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57
Introduction and background Bayesian learning Bayesian statistics Estimate a parameter θ ∈ Θ after observing data x . Frequentist Maximum Likelihood (ML): ˆ θ MLE = argmax θ p ( x | θ ) = argmax θ L ( θ : x ) Bayesian Bayes Rule: p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Bayesian prediction (using the whole posterior, not just one estimator) � p ( x new | x ) = p ( x new | θ ) p ( θ | x ) d θ Θ Maximum A Posteriori (MAP) ˆ p ( x | θ ) p ( θ ) θ MAP = argmax θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57
Introduction and background Bayesian learning Bayesian statistics Estimate a parameter θ ∈ Θ after observing data x . Frequentist Maximum Likelihood (ML): ˆ θ MLE = argmax θ p ( x | θ ) = argmax θ L ( θ : x ) Bayesian Bayes Rule: p ( θ | x ) = p ( x | θ ) p ( θ ) p ( x ) Bayesian prediction (using the whole posterior, not just one estimator) � p ( x new | x ) = p ( x new | θ ) p ( θ | x ) d θ Θ Maximum A Posteriori (MAP) ˆ p ( x | θ ) p ( θ ) θ MAP = argmax θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 6 / 57
Introduction and background Bayesian learning De Finetti’s theorem A premise: Definition An infinite sequence random variables ( x 1 , x 2 , . . . ) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on ( 1 , . . . , N ) , p ( x 1 , x 2 , . . . , x N ) = p ( x π ( 1 ) , x π ( 2 ) . . . , x π ( N ) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57
Introduction and background Bayesian learning De Finetti’s theorem A premise: Definition An infinite sequence random variables ( x 1 , x 2 , . . . ) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on ( 1 , . . . , N ) , p ( x 1 , x 2 , . . . , x N ) = p ( x π ( 1 ) , x π ( 2 ) . . . , x π ( N ) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 7 / 57
Introduction and background Bayesian learning De Finetti’s theorem (cont’d) Theorem (De Finetti, 1935. Aka Representation Theorem) A sequence of random variables ( x 1 , x 2 , . . . ) is infinitely exchangeable if for all N , there exists a random variable θ and a probability measure p on it such that N � � p ( x 1 , x 2 , . . . , x N ) = p ( θ ) p ( x i | θ ) d θ Θ i = 1 i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57
Introduction and background Bayesian learning De Finetti’s theorem (cont’d) Theorem (De Finetti, 1935. Aka Representation Theorem) A sequence of random variables ( x 1 , x 2 , . . . ) is infinitely exchangeable if for all N , there exists a random variable θ and a probability measure p on it such that N � � p ( x 1 , x 2 , . . . , x N ) = p ( θ ) p ( x i | θ ) d θ Θ i = 1 i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 8 / 57
Introduction and background Bayesian learning Bayesian learning Hypothesis space H Given data D , compute p ( h | D ) = p ( D | h ) p ( h ) p ( D ) Then, we probably want to predict some future data D ′ , by either: Average over H , i.e. p ( D ′ | D ) = H p ( D ′ | h ) p ( h | D ) p ( h ) dh � Choose the MAP h (or compute it directly), i.e. p ( D ′ | D ) = p ( D ′ | h MAP ) Sample from the posterior ... H can be anything! Bayesian learning as a general learning framework We will consider the case in which h is a probabilistic model itself, i.e. a parameter vector θ . Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 9 / 57
Introduction and background Bayesian learning A simple example Infer the bias θ ∈ [ 0 , 1 ] of a coin after observing N tosses. H = 1 , T = 0 , p ( H ) = θ h = θ , hence H = [ 0 , 1 ] Sequence of Bernoulli trials: θ p ( x 1 , . . . , x n | θ ) = θ n H ( 1 − θ ) N − n H x 1 x 2 x N where n H = # heads. Unknown θ : θ � 1 θ n H ( 1 − θ ) n H − k p ( θ ) d θ p ( x 1 , . . . , x N ) = x i 0 N Need to find a “good” prior p ( θ ) . . . Beta distribution! Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 10 / 57
Introduction and background Bayesian learning A simple example (cont’d) Beta distribution: θ ∼ Beta ( a , b ) B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 1 p ( θ | a , b ) = Bayesian learning: p ( h | D ) ∝ p ( D | h ) p ( h ) ; for us: Beta(0 . 1 , 0 . 1) p ( θ | x 1 , . . . , x N ) ∝ p ( x 1 , . . . , x n | θ ) p ( θ ) 1 = θ n H ( 1 − θ ) n T B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 Beta(1 , 1) ∝ θ n H + a − 1 ( 1 − θ ) n T + b − 1 i.e. θ | x 1 , . . . , x N ∼ Beta ( a + N H , b + N T ) Beta(2 , 3) We’re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(10 , 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57
Introduction and background Bayesian learning A simple example (cont’d) Beta distribution: θ ∼ Beta ( a , b ) B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 1 p ( θ | a , b ) = Bayesian learning: p ( h | D ) ∝ p ( D | h ) p ( h ) ; for us: Beta(0 . 1 , 0 . 1) p ( θ | x 1 , . . . , x N ) ∝ p ( x 1 , . . . , x n | θ ) p ( θ ) 1 = θ n H ( 1 − θ ) n T B ( a , b ) θ a − 1 ( 1 − θ ) b − 1 Beta(1 , 1) ∝ θ n H + a − 1 ( 1 − θ ) n T + b − 1 i.e. θ | x 1 , . . . , x N ∼ Beta ( a + N H , b + N T ) Beta(2 , 3) We’re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(10 , 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, 2013 11 / 57
Recommend
More recommend