 
              Correlated T opic Models Authors: Blei and LaffertY, 2006 Reviewer: Casey Hanson
Recap Latent Dirichlet Allocation β’ πΈ β‘ set of documents . β’ πΏ = set of topics . β’ π = set of all words. |π| words in each doc. β’ π π β‘ Multi over topics for a document d β πΈ. π π ~ πΈππ (π½) β’ πΎ π β‘ Multi over words in a topic, π β πΏ. πΎ π ~πΈππ (π) β’ π π,π β‘ topic selected for word π in document π. π π,π ~Multi(π π ) β’ π π,π β‘ π π’β word in document π. π π,π ~ Multi(πΆ π π,π )
Latent Dirichlet Allocation β’ Need to calculate posterior: π(π 1:πΈ , π 1:πΈ,1:π , πΎ 1:πΏ |π 1:πΈ,1:π , π½, π) β’ β π(π 1:πΈ , π 1:πΈ,1:π , πΎ 1:πΏ , π 1:πΈ,1:π , π½, π) β’ Normalization factor, π π π(. . ) , is intractable πΎ β’ Need to use approximate inference. β’ Gibbs Sampling β’ Drawback β’ No intuitive relationship between topics. β’ Challenge β’ Develop method similar to LDA with relationships between topics.
Normal or Gaussian Distribution π β π¦βπ 2 1 π π¦ = 2π 2 π 2π β’ Continuous distribution β’ Symmetrical and defined for ββ < π¦ < β β’ P arameters: πͺ π, π 2 β’ π β‘ mean β’ π 2 β‘ variance β’ π β‘ standard deviation β’ Estimation from Data: π = π¦ 1 β¦ π¦ π 1 π β’ π π=1 π = π¦ π π 2 = 1 π π¦ π β π 2 β’ π π=1
Multivariate Gaussian Distribution: π dimensions 1 π β1 2 πβπ π Ξ£ β1 (πβπ) π π = π π¦ π 1 β¦ π π = 2π π/2 det Ξ£ β’ π = π 1 β¦ π π π ~πͺ(π, Ξ£) β’ π β‘ π x 1 vector of means for each dimension β’ π» β‘ π x π covariance matrix . Example : 2D Case πΉ[π¦ 1 ] π 1 β’ π = πΉ π = πΉ[π¦ 2 ] = π 2 π¦ 1 β π 1 2 πΉ πΉ π¦ 1 β π 1 π¦ 2 β π 2 β’ Ξ£ = π¦ 2 β π 2 2 πΉ π¦ 1 β π 1 π¦ 2 β π 2 πΉ
2D Multivariate Gaussian: 2 π π 1 π π 1 ,π 2 π π 1 π π 2 β’ Ξ£ = 2 π π 1 ,π 2 π π 1 π π 2 π π 2 β’ Topic Correlations on Off Diagonal π π,1 βπ 1 π π,2 βπ 2 π β’ π π 1 ,π 2 π π 1 π π 2 = πΉ = π=1 π¦ 1 β π 1 π¦ 2 β π 2 π β’ Covariance matrix is diagonal!
Matlab Demo
β¦Back to Topic Models β’ How can we adapt LDA to have correlations between topics. β’ In LDA, we assume two things: β’ Assumption 1: Topics in a document are independent. π π ~πΈππ (π½) β’ Assumption 2: Distribution of words in a topic is stationary. πΆ π ~(π) β’ To sample topic distributions for topics that are correlated, we need to correct assumption 1.
Exponential Family of Distributions β’ Fa mily of distributions that can be placed in the following form: π π¦ π = β π¦ β π π π β π π¦ βπ΅ π β’ Ex: Binomial distribution : π = π π π¦ π π¦ (1 β π) πβπ¦ , π¦ β 0,1,2, β¦ , π π π¦|π = π π β’ π(π) = log β π¦ = π¦ , π΅ π = π log 1 β π , π π¦ = π¦ 1βπ π π π¦β log 1βπ +πβ log 1βπ π π¦ = π¦ π Natural Parameterization
Categorical Distribution β’ Multinomial n=1: β’ π π¦ 1 = π 1 ; π π 1 = π π β π 1 β’ where π 1 = 1 0 0. . 0 π ( Iverson Bracket or Indicator Vector) β’ π¨ π = 1 β’ P arameters: π β’ π = π 1 π 2 π 3 , where π π π = 1 β’ π β² = π 1 π 2 π π 1 π π π 1 π 2 β’ log π β² = log π π log π π 1
Exponential Family Multinomial With N=1 β’ πΊππ πππ: π π π π = π π β π π β’ We want: π π¦ π = β π¦ β π π π β π π¦ βπ΅ π π ππβ ππ β’ π π π π = π π π π π βlog π=1 π ππ = π=1 π ππ β’ Note: k-1 independent dimensions in Multinomial π 1 π 2 π π β’ πβ² = [log π π log π π β¦ .0] , πβ² π = log π π π πβ²πβ ππ β’ π π π π β² =β ππβ² πβ1 π π 1+ π=1
Verify: Classroom participation π 1 π 2 β’ Given: π = [log π π log π π β¦ 0] β’ Show: π π π π = π π β π π = π π π π π βlog π=1 π π π
Intuition and Demo β’ Can sample π from any number of places. β’ Choose normal (allows for correlation between topic dimensions) β’ Get a topic distribution for each document by sampling: π ~ πͺ πβ1 π, π β’ What is the π π π β’ E xpected deviation from last topic: log π π β’ Negative means push density towards last topic ( π π < 0, π π > π π ) β’ What about the covariance β’ Shows variability in deviation from last topic between topics. π = 0 0 π , π = [1 0; 0 1]
Favoring Topic 3 π = β0.9, β0.9 , Ξ£ = [1 β 0.9; β0.9 1] π = β0.9, β0.9 , Ξ£ = [1 0; 0 1]
Favoring Topic 3: π = β0.9, β0.9 , Ξ£ = [1 0.4; 0.4 1]
Exercises
Correlated Topic Model β’ Algorithm: β’ βπ β πΈ β’ Draw π π | π, Ξ£ ~ πͺ(π, Ξ£) β’ β π β 1 β¦ π : β’ Draw topic assignment β’ π π,π |π π ~ Categorical π π π β’ Draw word β’ π π,π | π π,π , πΎ 1:πΏ ~ Categorical πΎ π π β’ Parameter Estimation: β’ Intractable β’ User variational inference (later)
Evaluation I: CTM on Test Data
Evaluation II: 10-Fold Cross Validation LDA vs CTM β’ ~1500 documents in corpus. β’ ~5600 unique words β’ After pruning β’ Methodology: β’ Partition data into 10 sets β’ 10 fold cross validation β’ Calculate the log likelihood of a set, given you trained on the previous 9 sets, for both LDA and CTM. CTM shows a much higher log likelihood as the number of β’ Right(L(CTM) - L(LDA)) topics increases. β’ Left(L(CTM) β L(LDA))
Evaluation II: Predictive Perplexity β’ Perplexity measure β‘ expected number of equally likely words β’ Lower perplexity means higher word resolution. β’ Suppose you see a percentage of words in a document, how likely is the rest of the words in the document according to your model? β’ CTM does better with lower #βs of observed words. β’ Able to infer certain words given topic probabilities.
Conclusions β’ CTM changes the distribution from which hyper parameters are drawn, from a Dirichlet to a logistic normal function. β’ Very similar to LDA β’ Able to model correlations between topics. β’ For larger topic sizes, CTM performs better than LDA. β’ With known topics, CTM is able to infer words associations better than LDA.
Recommend
More recommend