correlated t opic models
play

Correlated T opic Models Authors: Blei and LaffertY, 2006 - PowerPoint PPT Presentation

Correlated T opic Models Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson Recap Latent Dirichlet Allocation set of documents . = set of topics . = set of all words. || words in each doc.


  1. Correlated T opic Models Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson

  2. Recap Latent Dirichlet Allocation β€’ 𝐸 ≑ set of documents . β€’ 𝐿 = set of topics . β€’ π‘Š = set of all words. |𝑂| words in each doc. β€’ πœ„ 𝑒 ≑ Multi over topics for a document d ∈ 𝐸. πœ„ 𝑒 ~ 𝐸𝑗𝑠(𝛽) β€’ 𝛾 𝑙 ≑ Multi over words in a topic, 𝑙 ∈ 𝐿. 𝛾 𝑙 ~𝐸𝑗𝑠(πœƒ) β€’ π‘Ž 𝑒,π‘œ ≑ topic selected for word π‘œ in document 𝑒. π‘Ž 𝑒,π‘œ ~Multi(πœ„ 𝑒 ) β€’ 𝑋 𝑒,π‘œ ≑ π‘œ π‘’β„Ž word in document 𝑒. 𝑋 𝑒,π‘œ ~ Multi(𝐢 π‘Ž 𝑒,π‘œ )

  3. Latent Dirichlet Allocation β€’ Need to calculate posterior: 𝑄(πœ„ 1:𝐸 , π‘Ž 1:𝐸,1:𝑂 , 𝛾 1:𝐿 |𝑋 1:𝐸,1:𝑂 , 𝛽, πœƒ) β€’ ∝ π‘ž(πœ„ 1:𝐸 , π‘Ž 1:𝐸,1:𝑂 , 𝛾 1:𝐿 , 𝑋 1:𝐸,1:𝑂 , 𝛽, πœƒ) β€’ Normalization factor, πœ„ π‘Ž π‘ž(. . ) , is intractable 𝛾 β€’ Need to use approximate inference. β€’ Gibbs Sampling β€’ Drawback β€’ No intuitive relationship between topics. β€’ Challenge β€’ Develop method similar to LDA with relationships between topics.

  4. Normal or Gaussian Distribution 𝑓 βˆ’ π‘¦βˆ’πœˆ 2 1 𝑔 𝑦 = 2𝜏 2 𝜏 2𝜌 β€’ Continuous distribution β€’ Symmetrical and defined for βˆ’βˆž < 𝑦 < ∞ β€’ P arameters: π’ͺ 𝜈, 𝜏 2 β€’ 𝜈 ≑ mean β€’ 𝜏 2 ≑ variance β€’ 𝜏 ≑ standard deviation β€’ Estimation from Data: π‘Œ = 𝑦 1 … 𝑦 π‘œ 1 π‘œ β€’ π‘œ 𝑗=1 𝜈 = 𝑦 𝑗 𝜏 2 = 1 π‘œ 𝑦 𝑗 βˆ’ 𝜈 2 β€’ π‘œ 𝑗=1

  5. Multivariate Gaussian Distribution: 𝑙 dimensions 1 𝑓 βˆ’1 2 π’€βˆ’π‚ π‘ˆ Ξ£ βˆ’1 (π’€βˆ’π‚) 𝑔 𝒀 = 𝑔 𝑦 π‘Œ 1 … π‘Œ 𝑙 = 2𝜌 𝑙/2 det Ξ£ β€’ 𝒀 = π‘Œ 1 … π‘Œ 𝑙 π‘ˆ ~π’ͺ(𝝂, Ξ£) β€’ 𝝂 ≑ 𝑙 x 1 vector of means for each dimension β€’ 𝚻 ≑ 𝑙 x 𝑙 covariance matrix . Example : 2D Case 𝐹[𝑦 1 ] 𝜈 1 β€’ 𝜈 = 𝐹 𝒀 = 𝐹[𝑦 2 ] = 𝜈 2 𝑦 1 βˆ’ 𝜈 1 2 𝐹 𝐹 𝑦 1 βˆ’ 𝜈 1 𝑦 2 βˆ’ 𝜈 2 β€’ Ξ£ = 𝑦 2 βˆ’ 𝜈 2 2 𝐹 𝑦 1 βˆ’ 𝜈 1 𝑦 2 βˆ’ 𝜈 2 𝐹

  6. 2D Multivariate Gaussian: 2 𝜏 π‘Œ 1 𝜍 π‘Œ 1 ,π‘Œ 2 𝜏 π‘Œ 1 𝜏 π‘Œ 2 β€’ Ξ£ = 2 𝜍 π‘Œ 1 ,π‘Œ 2 𝜏 π‘Œ 1 𝜏 π‘Œ 2 𝜏 π‘Œ 2 β€’ Topic Correlations on Off Diagonal π‘Œ 𝑗,1 βˆ’πœˆ 1 π‘Œ 𝑗,2 βˆ’πœˆ 2 π‘œ β€’ 𝜍 π‘Œ 1 ,π‘Œ 2 𝜏 π‘Œ 1 𝜏 π‘Œ 2 = 𝐹 = 𝑗=1 𝑦 1 βˆ’ 𝜈 1 𝑦 2 βˆ’ 𝜈 2 π‘œ β€’ Covariance matrix is diagonal!

  7. Matlab Demo

  8. …Back to Topic Models β€’ How can we adapt LDA to have correlations between topics. β€’ In LDA, we assume two things: β€’ Assumption 1: Topics in a document are independent. πœ„ 𝑒 ~𝐸𝑗𝑠(𝛽) β€’ Assumption 2: Distribution of words in a topic is stationary. 𝐢 𝑙 ~(πœƒ) β€’ To sample topic distributions for topics that are correlated, we need to correct assumption 1.

  9. Exponential Family of Distributions β€’ Fa mily of distributions that can be placed in the following form: 𝑔 𝑦 πœ„ = β„Ž 𝑦 β‹… 𝑓 πœƒ πœ„ β‹…π‘ˆ 𝑦 βˆ’π΅ πœ„ β€’ Ex: Binomial distribution : πœ„ = π‘ž π‘œ 𝑦 π‘ž 𝑦 (1 βˆ’ π‘ž) π‘œβˆ’π‘¦ , 𝑦 ∈ 0,1,2, … , π‘œ 𝑔 𝑦|πœ„ = π‘ž π‘œ β€’ πœƒ(πœ„) = log β„Ž 𝑦 = 𝑦 , 𝐡 πœ„ = π‘œ log 1 βˆ’ π‘ž , π‘ˆ 𝑦 = 𝑦 1βˆ’π‘ž π‘ž π‘œ 𝑦⋅log 1βˆ’π‘ž +π‘œβ‹…log 1βˆ’π‘ž 𝑔 𝑦 = 𝑦 𝑓 Natural Parameterization

  10. Categorical Distribution β€’ Multinomial n=1: β€’ 𝑔 𝑦 1 = πœ„ 1 ; 𝑔 π‘Ž 1 = πœ„ π‘ˆ β‹… π‘Ž 1 β€’ where π‘Ž 1 = 1 0 0. . 0 π‘ˆ ( Iverson Bracket or Indicator Vector) β€’ 𝑨 𝑗 = 1 β€’ P arameters: πœ„ β€’ πœ„ = π‘ž 1 π‘ž 2 π‘ž 3 , where 𝑗 π‘ž 𝑗 = 1 β€’ πœ„ β€² = π‘ž 1 π‘ž 2 π‘ž 𝑙 1 π‘ž 𝑙 π‘ž 1 π‘ž 2 β€’ log πœ„ β€² = log π‘ž 𝑙 log π‘ž 𝑙 1

  11. Exponential Family Multinomial With N=1 β€’ π‘Ίπ’‡π’…π’ƒπ’Žπ’Ž: 𝑔 π‘Ž 𝑗 πœ„ = πœ„ π‘ˆ β‹… π‘Ž 𝑗 β€’ We want: 𝑔 𝑦 πœ„ = β„Ž 𝑦 β‹… 𝑓 πœƒ πœ„ β‹…π‘ˆ 𝑦 βˆ’π΅ πœ„ 𝑓 πœƒπ‘ˆβ‹…π‘Žπ‘— β€’ 𝑔 π‘Ž 𝑗 πœƒ = 𝑓 πœƒ π‘ˆ π‘Ž 𝑗 βˆ’log 𝑗=1 𝑓 πœƒπ‘— = 𝑗=1 𝑓 πœƒπ‘— β€’ Note: k-1 independent dimensions in Multinomial π‘ž 1 π‘ž 2 π‘ž 𝑗 β€’ πœƒβ€² = [log π‘ž 𝑙 log π‘ž 𝑙 … .0] , πœƒβ€² 𝑗 = log π‘ž 𝑙 𝑓 πœƒβ€²π‘ˆβ‹…π‘Žπ‘— β€’ 𝑔 π‘Ž 𝑗 πœƒ β€² =β‹… πœƒπ‘—β€² π‘™βˆ’1 𝑓 𝑗 1+ 𝑗=1

  12. Verify: Classroom participation π‘ž 1 π‘ž 2 β€’ Given: πœƒ = [log π‘ž 𝑙 log π‘ž 𝑙 … 0] β€’ Show: 𝑔 π‘Ž 𝑗 πœ„ = πœ„ π‘ˆ β‹… π‘Ž 𝑗 = 𝑓 πœƒ π‘ˆ π‘Ž 𝑗 βˆ’log 𝑗=1 𝑓 πœƒ 𝑗

  13. Intuition and Demo β€’ Can sample πœƒ from any number of places. β€’ Choose normal (allows for correlation between topic dimensions) β€’ Get a topic distribution for each document by sampling: πœƒ ~ π’ͺ π‘™βˆ’1 𝜈, 𝜏 β€’ What is the 𝜈 π‘ž 𝑗 β€’ E xpected deviation from last topic: log π‘ž 𝑙 β€’ Negative means push density towards last topic ( πœƒ 𝑗 < 0, π‘ž 𝑙 > π‘ž 𝑗 ) β€’ What about the covariance β€’ Shows variability in deviation from last topic between topics. 𝜈 = 0 0 π‘ˆ , 𝜏 = [1 0; 0 1]

  14. Favoring Topic 3 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 βˆ’ 0.9; βˆ’0.9 1] 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 0; 0 1]

  15. Favoring Topic 3: 𝜈 = βˆ’0.9, βˆ’0.9 , Ξ£ = [1 0.4; 0.4 1]

  16. Exercises

  17. Correlated Topic Model β€’ Algorithm: β€’ βˆ€π‘’ ∈ 𝐸 β€’ Draw πœƒ 𝑒 | 𝜈, Ξ£ ~ π’ͺ(𝜈, Ξ£) β€’ βˆ€ π‘œ ∈ 1 … 𝑂 : β€’ Draw topic assignment β€’ π‘Ž π‘œ,𝑒 |πœƒ 𝑒 ~ Categorical 𝑔 πœƒ 𝑒 β€’ Draw word β€’ 𝑋 𝑒,π‘œ | π‘Ž 𝑒,π‘œ , 𝛾 1:𝐿 ~ Categorical 𝛾 π‘Ž π‘œ β€’ Parameter Estimation: β€’ Intractable β€’ User variational inference (later)

  18. Evaluation I: CTM on Test Data

  19. Evaluation II: 10-Fold Cross Validation LDA vs CTM β€’ ~1500 documents in corpus. β€’ ~5600 unique words β€’ After pruning β€’ Methodology: β€’ Partition data into 10 sets β€’ 10 fold cross validation β€’ Calculate the log likelihood of a set, given you trained on the previous 9 sets, for both LDA and CTM. CTM shows a much higher log likelihood as the number of β€’ Right(L(CTM) - L(LDA)) topics increases. β€’ Left(L(CTM) – L(LDA))

  20. Evaluation II: Predictive Perplexity β€’ Perplexity measure ≑ expected number of equally likely words β€’ Lower perplexity means higher word resolution. β€’ Suppose you see a percentage of words in a document, how likely is the rest of the words in the document according to your model? β€’ CTM does better with lower #’s of observed words. β€’ Able to infer certain words given topic probabilities.

  21. Conclusions β€’ CTM changes the distribution from which hyper parameters are drawn, from a Dirichlet to a logistic normal function. β€’ Very similar to LDA β€’ Able to model correlations between topics. β€’ For larger topic sizes, CTM performs better than LDA. β€’ With known topics, CTM is able to infer words associations better than LDA.

Recommend


More recommend