Understanding the Limiting Factors of Topic Modeling via Posterior - PowerPoint PPT Presentation

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining

Outline ❖ Background and Motivations ❖ Posterior Contraction Analysis ❖ Empirical Study & Practical Guidance CS6501: Text Mining

Background ❖ Latent Dirichlet Allocation (LDA) for topic modeling D documents, each has N words, generated from K topics 7 89 : observed words ; 9 : document-topic proportion < 89 : topic indicators = > : topic-word proportion Ge Generative process: = > | B ~ Dirichlet( B ) ; 9 | D ~ Dirichlet( D ) < 89 |; 9 ~ Multinomial( ; 9 ) 7 89 | = > , < 89 ~ Multinomial( = E FG ) Ba Bayesian estimation: J, K = argmax N(J, K|7) ∝ argmax N 7 J, K N(J, K) J,K J,K CS6501: Text Mining

Motivation ❖ Latent Dirichlet Allocation (LDA) for topic modeling ❖ Questions: ➢ Is my data topic-model “friendly”? Why did LDA fail on my data? ➢ How many documents do I need to learn 100 topics? ❖ What factors affect LDA’s performance? ➢ # documents D ➢ Length of individual documents N ➢ # topics ➢ Dirichlet hyper-parameters ❖ Formulate the goal: ➢ How fast (rate) does the posterior distribution of ! " ’s converge to the true value as D and N approaching infinity? ⟹ posterior contraction analysis CS6501: Text Mining

Posterior Contraction Analysis ❖ Latent Topic Polytope in LDA ➢ Representation of latent topic structure through convex hull: Topic polytope ! " = $%&'(" ) , … , " , ) ➢ Distance between two polytopes . / and . 0 : 1 2 ! ) , ! 3 = 456{1 ! ) , ! 3 , 1(! 3 , ! ) )} 9 . / , . 0 = max = G ∈@ABC(D G ) ||I / − I 0 || 0 min = > ∈@ABC(D > ) “extr” means the extreme points, i.e., topics in LDA o Equivalent to Hausdorff metric in convex geometry o ❖ Posterior Contraction Analysis ➢ How fast the posterior converges to the true posterior distribution . ∗ i.e., 1 2 ! ) , ! ∗ ≤ ? ➢ CS6501: Text Mining

Posterior Contraction Analysis ❖ Theorem 1 Let the Dirichlet parameters for topic proportion + , ∈ (0,1] , and assume either one holds: (A1) ( = ( ∗ , i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then as 3 → ∞ and ! → ∞ such that ! > log 3 : : ; ℳ = > , = ∗ ≤ @ A B %,' C) → 1 O HIJ F LMN G LMN G where the upper bound for contraction rate is E F,G = ( + + F ) P F G ❖ Insights ➢ Length of documents ! should be at least on the order of logarithm of # documents D "#$ % "#$ ' "#$ ' ➢ Convergence rate: max{ , , % } % ' Rate does not depend on #topic ( , if ( ∗ is known ➢ Overfitted setting is prefered, i.e., ( ≫ ( ∗ , because: ➢ CS6501: Text Mining

Posterior Contraction Analysis ❖ Theorem 2 Under the same conditions as the previous theorem, except none of (A1) and (A2) holds: (A1) ! = ! ∗ , i.e., the true #topic is known; (A2) Euclidean distance between each pair of topics is bounded from below. then for ! ∗ < ! ≤ |'| : ( ) ℳ + , , + ∗ ≤ . / 0 1,2 3) → 1 B ;<= 8 ?@A 9 ?@A 9 where the upper bound for contraction rate is 7 8,9 = ( + + 8 ) C(DEB) 8 9 ❖ Insights ➢ The convergence is very slow, depending on K ➢ Underfitting ( ! < ! ∗ ) will result in a persistent error even with infinite data, thus not considered CS6501: Text Mining

Empirical study & practical guidance Synthetic Data: Ground truth number of topics K * = 3 ● ● Vocabulary size |V| = 5000 ● Metric: Minimum-matching Euclidean distance defined in (1) ● Focus on the variation of following parameters: ○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ K: number of topics specified for inference CS6501: Text Mining

Synthetic Data Larger number of documents => better performance Overly large number of topics for the model => worse performance Topics are known to be word-sparse, word distribution parameter should be set small(e.g. β = 0.01) Longer documents => better performance CS6501: Text Mining

Synthetic Data To verify the exponential theoretical bounds provided by the theorems CS6501: Text Mining

Empirical study & practical guidance Real Data: ● Metric: Point-wise mutual information(Newman et al., 2011) ● Focus on the variation of following parameters: ○ D: number of documents ○ N: length of documents ○ β: topic word Dirichlet distribution hyperparameter ○ ! : document topic Dirichlet distribution hyperparameter ○ K: number of topics specified for inference CS6501: Text Mining

Real Data Individual documents are associated mostly with smaller number of topics => better performance Each documents is associated with few topics, document-topic distribution parameter should be set small(e.g. α = 0.1) CS6501: Text Mining

CS6501: Text Mining

Understanding the Limiting Factors of Topic Modeling via Posterior - PowerPoint PPT Presentation

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining Outline

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Implementing Risk-Limiting Post-Election Audits in California J.L. Hall 1 , 2 L.W. Miratrix 3 P.B.

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

COMP31212: Concurrency Topics 2.3: Understanding FSP Topic 2.3: Understanding FSP Outline Topic

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

Limiting Liability for COVID-19 Transmission June 10, 2020 Limiting Liability for COVID-19

Differential Privacy and the Right to be Forgotten Cynthia Dwork, Microsoft Research Limiting

Spatiospectral limiting on Boolean cubes Jubilee of Fourier Analysis and Applications, NWC at

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

Population Regulation The logistic equation suggests that factors limiting growth exert stronger

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Electromagnetic Form Factors of Electromagnetic Form Factors of Electromagnetic Form Factors of

Second Year Student Meeting PhD Candidacy Exam On-topic or Off-topic Candidacy Exam? On-Topic:

2017 Andrew Wood, CEO Disclaimer The information in this presentation about the WorleyParsons

North Penn School District 2020-2021 Budget Update April 14, 2020 Budget Timeline Finance

Economic & Revenue Forecast Presentation to the Joint Budget Committee June 19, 2020

Mortgages and Politics Charles W. Calomiris IAES Presidential Address October 9, 2015 Democracy

Fiscal 2019 Q2 Earnings Presentation April 10, 2019 Risks and Non-GAAP Disclosures This

1. Introduction : Mr. Chairman, first of all please allow me to express my appreciation to the

September 2015 | Edition No. 34 Main messages 1 The adjustment to the 2014 terms-of-trade

5,450 5,414 5,107 2015 2017 2018 2019 2016 Data represent total headcount as of the

Understanding the Limiting Factors of Topic Modeling via Posterior - PowerPoint PPT Presentation

Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis Jian Tang, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, Ming Zhan ICML 2014 best paper Presenter: Lu Lin & Tianlu Wang CS6501: Text Mining Outline

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

Implementing Risk-Limiting Post-Election Audits in California J.L. Hall 1 , 2 L.W. Miratrix 3 P.B.

Why learn topic modeling Pavel Oleinikov Associate Director Quantitative Analysis Center

COMP31212: Concurrency Topics 2.3: Understanding FSP Topic 2.3: Understanding FSP Outline Topic

UNIT TOPICS TOPIC 1: MINERALS TOPIC 2: IGNEOUS ROCKS TOPIC 3: SEDIMENTARY ROCKS

TOPIC #X: TOPIC NAME DATE, 2020 PRESENTATION OUTLINE Main topic #1 Main topic #2 Main

COMP31212: Concurrency Topic 5.3: Liveness and Topic 5.4 Fairness Topic 5.3: Liveness Properties

Limiting Liability for COVID-19 Transmission June 10, 2020 Limiting Liability for COVID-19

Differential Privacy and the Right to be Forgotten Cynthia Dwork, Microsoft Research Limiting

Spatiospectral limiting on Boolean cubes Jubilee of Fourier Analysis and Applications, NWC at

Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis

Population Regulation The logistic equation suggests that factors limiting growth exert stronger

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Electromagnetic Form Factors of Electromagnetic Form Factors of Electromagnetic Form Factors of

Second Year Student Meeting PhD Candidacy Exam On-topic or Off-topic Candidacy Exam? On-Topic:

2017 Andrew Wood, CEO Disclaimer The information in this presentation about the WorleyParsons

North Penn School District 2020-2021 Budget Update April 14, 2020 Budget Timeline Finance

Economic &amp; Revenue Forecast Presentation to the Joint Budget Committee June 19, 2020

Mortgages and Politics Charles W. Calomiris IAES Presidential Address October 9, 2015 Democracy

Fiscal 2019 Q2 Earnings Presentation April 10, 2019 Risks and Non-GAAP Disclosures This

1. Introduction : Mr. Chairman, first of all please allow me to express my appreciation to the

September 2015 | Edition No. 34 Main messages 1 The adjustment to the 2014 terms-of-trade

5,450 5,414 5,107 2015 2017 2018 2019 2016 Data represent total headcount as of the

Economic & Revenue Forecast Presentation to the Joint Budget Committee June 19, 2020