Lecture 13: Dirichlet Processes Julia Hockenmaier - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

Finite mixture model Mixing proportions: The prior probability of each component (assuming uniform α ) π | α ~ Dirichlet ( α /K, ..., α /K) Mixture components: The distribution over observations for each component θ * | H ~ H ( H is typically a Dirichlet distribution) k Indicator variables : Which component is observation i drawn from? z i | π ~ Multinomial ( π ) The observations: The probability of observation i under component z i x i |z i ,{ θ * } ~ F ( θ * i ) ( F is typically a categorical distribution) k z 2 Bayesian Methods in NLP

Dirichlet Process DP( α , H ) The Dirichlet process DP( α , H) defines a distribution over distributions over a probability space Θ . Draws G ~ DP( α , H ) from this DP are random distributions over Θ DP( α , H) has two parameters: Base distribution H: a distribution over the probability space Θ Concentration parameter α : a positive real number If G ~ DP( α , H ), then for any finite measurable partition A 1 ...A r of Θ : ( G (A 1 ), ..., G (A r )) ~ Dirichlet ( α H (A 1 ), ..., α H (A r )) 3 Bayesian Methods in NLP

The base distribution H Probability space Θ A 1 A 2 A 3 Since A 1 , A 2 , A 3 partition Θ , we can use the base distribution H to define a categorical distribution over A 1 , A 2 , A 3 : H (A 1 ) + H (A 2 ) + H (A 3 ) = 1 Note that we can use H to define a categorical distribution over any finite partition A 1 ...A r of Θ , even if H is smooth 4 Bayesian Methods in NLP

Draws from the DP: G ~ DP( α , H ) Probability space Θ A 1 A 2 A 3 Every individual draw G from DP( α , H ) is also a distribution over Θ G also defines a categorical distribution over any partition of Θ For any finite partition A 1 ...A r of Θ , this categorical distribution is drawn from a Dirichlet prior defined by α and H: ( G (A 1 ), G (A 2 ), G (A 3 )) ~ Dir ( α H (A 1 ), α H (A 2 ), α H (A 3 )) 5 Bayesian Methods in NLP

The role of H and α The base distribution H defines the mean (expectation) of G : For any measurable set A ⊆ Θ , E[ G (A)] = H (A) The concentration parameter α is inversely related to the variance of G : V[ G (A)] = H (A)(1 − H (A))/( α +1) α specifies how much mass is around the mean The larger α , the smaller the variance α is also called the strength parameter: If we use DP( α , H ) as a prior, α tells us how much we can deviate from the prior: As α → ∞ , G (A) → H (A) 6 Bayesian Methods in NLP

The posterior of G: G | θ 1 , ... θ n Assume the distribution G is drawn from a DP: G ~ DP( α , H ) The prior of G : ( G (A 1 ),..., G (A K )) ~ Dirichlet ( α H (A 1 ), ..., α H (A K )) Given a sequence of observations θ 1 ... θ n from Θ that are drawn from this G : θ i | G ~ G What is the posterior of G given the observed θ 1 ... θ n ? For any finite partition A 1 ....A K of Θ , define the number of observations in A k : n k = #{ i: θ i ∈ A k } The posterior of G given observations θ 1 ... θ n ( G (A 1 ),..., G (A K ))| θ 1 , ... θ n ~ Dirichlet ( α H (A 1 ) + n 1 , ..., α H (A K ) + n K ) 7 Bayesian Methods in NLP

The posterior of G: G | θ 1 , ... θ n The observations θ 1 ... θ n define an empirical distribution over Θ : P n i =1 δ θ i ← This is just a fancy way of saying P(A k ) = n k /n n The posterior of G given observations θ 1 ... θ n ( G (A 1 ),..., G (A K ))| θ 1 , ... θ n ~ Dirichlet ( α H (A 1 ) + n 1 , ..., α H (A K ) + n K ) The posterior is a DP with: - concentration parameter α + n - a base distribution that is a weighted average of H and the empirical distribution. P n i =1 δ θ i α n G | θ 1 , ... θ n ∼ DP ( α + n, α + nH + ) α + n n The weight of the empirical distribution is proportional to the amount of data. The weight of H is proportional to α 8 Bayesian Methods in NLP

The Blackwell MacQueen urn Assume each value in Θ has a unique color. θ 1 ... θ n is a sequence of colored balls. With probability α /( α +n), the n+1 th ball is drawn from H With probability n/( α +n) the n+1 th ball is drawn from an urn that contains all previously drawn balls. Note that this implies that G is a discrete distribution, even if H is not. 9 Bayesian Methods in NLP

The clustering property of DPs θ 1 ... θ n induces a partition of the set 1...n into k unique values. This means that the DP defines a distribution over such partitions. The expected number of clusters k increases with α but grows only logarithmically in n : E[ k | n] ≃ α log(1 + n/ α ) 10 Bayesian Methods in NLP

NLP 101: language modeling Task: Given a stream of words w 1 ...w n , predict the next word w n+1 with a unigram model P(w) Answer: If w n+1 is a word w we’ve seen before: P(w n+1 = w) ∝ Freq(w) But what if w n+1 has never been seen before? We need to reserve some mass for new events P(w n+1 is a new word) ∝ α P(w n+1 = w) = Freq(w)/(n+ α ) if Freq(w) > 0 = α /(n+ α ) if Freq(w) = 0 11 Bayesian Methods in NLP

The Chinese restaurant processs The (i+1)th customer c i+1 sits: - at an existing table t k that already has n k customers with probability n k /(i+ α ) - at new table with probability α /(i+ α ) 12 Bayesian Methods in NLP

The predictive distribution θ n+1 | θ 1 , ..., θ n The predictive distribution of θ n+1 given a sequence of i.i.d. draws θ 1 , ..., θ n ~ G, with G ~ DP ( α , H ) and G marginalized out is given by the posterior base distribution given θ 1 , ..., θ n P ( θ n +1 ∈ A ) = E [ G ( A ) | θ 1 , ..., θ n ] P n i =1 δ θ i ( A ) α = α + nH ( A ) + α + n 13 Bayesian Methods in NLP

The stick-breaking representation π 2 π 3 π 1 0 1 1 − β 1 β 1 (1 −β 1 ) β 2 G ~ DP ( α , H ) if: - The component parameters are drawn from the base distribution: θ * k ~ H - The weights of each cluster are defined by a stick-breaking process: β k ~ Beta (1, α ) π k ~ β k ∏ l = 1. ..k − 1 (1 −β l ) also written as π ~ GEM( α ) ( G riffiths/ E ngen/ M cCloskey) G = ∑ k =1... ∞ π k δ θ * k 14 Bayesian Methods in NLP

Dirichlet Process Mixture Models Each observation x i is associated with a latent parameter θ i Each θ i is drawn i.i.d. from G ; each x i is drawn from F ( θ i ) G | α ,H ~ DP ( α , H) θ i | G ~ G x i | θ i ~ F ( θ i ) Since G is discrete, θ i can be equal to θ j All x i , x j with θ i = θ j belong to the same mixture component There are a countably infinite number of mixture components. Stick-breaking representation: Mixing proportions: π | α ~ GEM ( α ) Indicator variables: z i | π ~ Mult ( π ) Component parameters: θ * k | H ~ H Observations: x i |z i ,{ θ * k } ~ F( θ * z i ) 15 Bayesian Methods in NLP

Hierarchical Dirichlet Processes Since both H and G are distributions over the same space Θ , the base distribution of a DP can be a draw from another DP. This allows us to specify hierarchical Dirichlet Processes, where each group of data is generated by its own DP. Assume a global measure G 0 drawn from a DP: G 0 ~ DP( γ , H ) For each group j, define another DP G j with base measure G 0 : G j ~ DP( α 0 , G 0 ) (or G j ~ DP( α j , G 0 ) , but it is common to assume all α j are the same) α 0 specifies the amount of variability around the prior G 0 Since all groups share the same base G 0 , all G j use the same atoms (balls of the same colors) 16 Bayesian Methods in NLP

Lecture 13: Dirichlet Processes Julia Hockenmaier - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment Finite mixture model Mixing

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content

Hierarchical Dirichlet Processes AMS 241, Fall 2010 Vadim von Brzeski vvonbrze@ucsc.edu

Nested Hierarchical Dirichlet Processes John Paisley, Chong Wang, David M. Blei, and Michael I.

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for

Perspective Hierarchical Dirichlet Process for Perspective Hierarchical Dirichlet Process for

Boundary Representation of Dirichlet Forms on Canonically Compactifiable Graphs Michael Schwarz

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2

Lecture 1: Lvy processes A. E. Kyprianou Department of Mathematical Sciences, University of

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Eigenvalue estimates and localization of the first Dirichlet eigenfunction Thomas Beck

Interpolating sequences for the Dirichlet space Nicola Arcozzi, with R. Rochberg and E. Sawyer

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors David

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant

A Survey of Recent Results on the Hardy Space of Dirichlet Series Gregory Zitelli University of

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian

Solving Large-scale problems using JuMP Thuener Silva JuMP Developers meet-up Santiago, March

Model-based Deep Hand Pose Estimation Xingyi Zhou, Qingfu Wan, Wei Zhang, Xiangyang Xue, Yichen

Directed Random Graphs with Given Degree Distributions Mariana Olvera-Cravioto Columbia

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad

CSE 190 Lecture 14 Data Mining and Predictive Analytics Dimensionality-reduction approaches

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Natural Language Understanding Unsupervised Part-of-Speech Tagging Adam Lopez Slide credits:

Lecture 13: Dirichlet Processes Julia Hockenmaier - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment Finite mixture model Mixing

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content

Hierarchical Dirichlet Processes AMS 241, Fall 2010 Vadim von Brzeski vvonbrze@ucsc.edu

Nested Hierarchical Dirichlet Processes John Paisley, Chong Wang, David M. Blei, and Michael I.

Lecture 14: Inference in Dirichlet Processes (Blei &amp; Jordan, Variational inference for

Perspective Hierarchical Dirichlet Process for Perspective Hierarchical Dirichlet Process for

Boundary Representation of Dirichlet Forms on Canonically Compactifiable Graphs Michael Schwarz

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2

Lecture 1: Lvy processes A. E. Kyprianou Department of Mathematical Sciences, University of

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Eigenvalue estimates and localization of the first Dirichlet eigenfunction Thomas Beck

Interpolating sequences for the Dirichlet space Nicola Arcozzi, with R. Rochberg and E. Sawyer

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors David

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant

A Survey of Recent Results on the Hardy Space of Dirichlet Series Gregory Zitelli University of

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian

Solving Large-scale problems using JuMP Thuener Silva JuMP Developers meet-up Santiago, March

Model-based Deep Hand Pose Estimation Xingyi Zhou, Qingfu Wan, Wei Zhang, Xiangyang Xue, Yichen

Directed Random Graphs with Given Degree Distributions Mariana Olvera-Cravioto Columbia

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad

CSE 190 Lecture 14 Data Mining and Predictive Analytics Dimensionality-reduction approaches

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Natural Language Understanding Unsupervised Part-of-Speech Tagging Adam Lopez Slide credits:

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for