Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining
Outline ❖ Background and Motivations ❖ Posterior Contraction Analysis ❖ Empirical Study & Practical Guidance CS6501: Text Mining
Background ❖ Latent Dirichlet Allocation (LDA) for topic modeling D documents, each has N words, generated from K topics 7 89 : observed words ; 9 : document-topic proportion < 89 : topic indicators = > : topic-word proportion Ge Generative process: = > | B ~ Dirichlet( B ) ; 9 | D ~ Dirichlet( D ) < 89 |; 9 ~ Multinomial( ; 9 ) 7 89 | = > , < 89 ~ Multinomial( = E FG ) Ba Bayesian estimation: J, K = argmax N(J, K|7) ∝ argmax N 7 J, K N(J, K) J,K J,K CS6501: Text Mining
Motivation ❖ Latent Dirichlet Allocation (LDA) for topic modeling ❖ Questions: ➢ Is my data topic-model “friendly”? Why did LDA fail on my data? ➢ How many documents do I need to learn 100 topics? ❖ What factors affect LDA’s performance? ➢ # documents D ➢ Length of individual documents N ➢ # topics ➢ Dirichlet hyper-parameters ❖ Formulate the goal: ➢ How fast (rate) does the posterior distribution of ! " ’s converge to the true value as D and N approaching infinity? ⟹ posterior contraction analysis CS6501: Text Mining
Posterior Contraction Analysis ❖ Latent Topic Polytope in LDA ➢ Representation of latent topic structure through convex hull: Topic polytope ! " = $%&'(" ) , … , " , ) ➢ Distance between two polytopes . / and . 0 : 1 2 ! ) , ! 3 = 456{1 ! ) , ! 3 , 1(! 3 , ! ) )} 9 . / , . 0 = max = G ∈@ABC(D G ) ||I / − I 0 || 0 min = > ∈@ABC(D > ) “extr” means the extreme points, i.e., topics in LDA o Equivalent to Hausdorff metric in convex geometry o ❖ Posterior Contraction Analysis ➢ How fast the posterior converges to the true posterior distribution . ∗ i.e., 1 2 ! ) , ! ∗ ≤ ? ➢ CS6501: Text Mining
Posterior Contraction Analysis ❖ Theorem 1 Let the Dirichlet parameters for topic proportion + , ∈ (0,1] , and assume either one holds: (A1) ( = ( ∗ , i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then as 3 → ∞ and ! → ∞ such that ! > log 3 : : ; ℳ = > , = ∗ ≤ @ A B %,' C) → 1 O HIJ F LMN G LMN G where the upper bound for contraction rate is E F,G = ( + + F ) P F G ❖ Insights ➢ Length of documents ! should be at least on the order of logarithm of # documents D "#$ % "#$ ' "#$ ' ➢ Convergence rate: max{ , , % } % ' Rate does not depend on #topic ( , if ( ∗ is known ➢ Overfitted setting is prefered, i.e., ( ≫ ( ∗ , because: ➢ CS6501: Text Mining
Posterior Contraction Analysis ❖ Theorem 2 Under the same conditions as the previous theorem, except none of (A1) and (A2) holds: (A1) ! = ! ∗ , i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then for ! ∗ < ! ≤ |'| : ( ) ℳ + , , + ∗ ≤ . / 0 1,2 3) → 1 B ;<= 8 ?@A 9 ?@A 9 where the upper bound for contraction rate is 7 8,9 = ( + + 8 ) C(DEB) 8 9 ❖ Insights ➢ The convergence is very slow, depending on K ➢ Underfitting ( ! < ! ∗ ) will result in a persistent error even with infinite data, thus not considered CS6501: Text Mining
Empirical study & practical guidance Synthetic Data: Ground truth number of topics K * = 3 ● ● Vocabulary size |V| = 5000 ● Metric: Minimum-matching Euclidean distance defined in (1) ● Focus on the variation of following parameters: ○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ K: number of topics specified for inference CS6501: Text Mining
Synthetic Data Larger number of documents => better performance Overly large number of topics for the model => worse performance Topics are known to be word-sparse, word distribution parameter should be set small(e.g. β = 0.01) Longer documents => better performance CS6501: Text Mining
Synthetic Data To verify the exponential theoretical bounds provided by the theorems CS6501: Text Mining
Empirical study & practical guidance Real Data: ● Metric: Point-wise mutual information(Newman et al., 2011) ● Focus on the variation of following parameters: ○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ ! : document topic Dirichlet distribution hyperparameter ○ K: number of topics specified for inference CS6501: Text Mining
Real Data Individual documents are associated mostly with smaller number of topics => better performance Each documents is associated with few topics, document-topic distribution parameter should be set small(e.g. α = 0.1) CS6501: Text Mining
CS6501: Text Mining
Recommend
More recommend