LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce - PowerPoint PPT Presentation

LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce Croft) Haiyun Jin, Lei Ju, Ruize Qin

Background ● Text documents representation is a critical part of information retrieval (IR). ● Topic model can capture the relationships between words.

Introduction ● Introduce the retrieval model based on LDA. ● Make comparison between LDA and other topic models. ● Evaluate the LDA model for IR problem by experiments, and compare the performance and efficiency with other model.

Related Work ● Probabilistic Latent Semantic Indexing(pLSI) ● Cluster-based Retrieval

Probabilistic Latent Semantic Indexing(pLSI)

Probabilistic Latent Semantic Indexing(pLSI) ● Drawbacks ○ Too many parameters -> Overfitting! ○ The generative semantics are not well-defined.

Cluster-based Retrieval ● Assumption ○ Each document is related to only one topic. ○ ● Linear interpolation smoothing ○

Cluster-based Retrieval ● Drawbacks ○ Assumption: each document is related to only one topic. ○ Not true for long docs and large corpus.

LDA Instead of involving a lot of parameters, LDA uses only two vectors, alpha and beta , feed them to the Dirichlet distribution to automatically generate the topic distribution for each document and word distribution for each topic.

LDA framework N d : num of words in document d; K: num of topics; N: num of documents; Z: a topic; w: a word token Use parameter � to generate a topic distribution � d for each Pick a topic z from the topic document d distribution � d of a document d, and use � z to generate the word w in d Use parameter � to generate a word distribution � z for each topic z

LDA equation revisit After we get posterior estimate � of and � :

Why LDA is better than pLSI? With the help of Dirichlet distribution, LDA treats the topic mixture distribution, ● i.e., the � , as a k-parameter hidden random variable , rather than a large set of individual parameters linked to the training set . This makes LDA a generative model Reduced the complexity (fewer parameters) ● ● Avoid the situation of too many parameters and too many local maxima

Why LDA is better than cluster-based retrieval? ● Clustering assumes each document can only talk about one topic ● LDA changes the assumption to that each document can cover multiple(up to K) topics. The probability of the coverage of each topic is controlled by the � vector. This is more flexible for large document set.

LDA-based retrieval

Experiment: Before doing experiments it’s important to preprocess data somehow, in order to compare results with other models such as cluster-based retrieval 1. We have five collections which contain the query sets and relevance judgments However, Federal Register collection has been removed. 1) no enough queries: only 21 (other collections around 100) 2) 6 of 21 queries have only one relevant doc which may cause biased results left out the Federal Register (FR) collection

Parameters 1. Parameter lambda and the number of markov chains in the gibbs sampling should be tuned: exhaustive search or manually hill-climbing search. 2. Training the model on the AP collection set and see the average performance on the other four collections: WSJ, FT, SJMN, and LA. (Metric: average precision: since our final task is retrieval) 4. Symmetric Prior: Alpha = 50/K, beta=0.01 normally being used. Results not sensitive to them, doesn’t require too much tuning.

Continue How do we choose the number of iterations? Good ： The more iterations, the more probability the markov chain will converge. Bad: Once the markov chain converges, the extra iterations will be useless. 1. we don’t know when the converge will happen 2. the model will be inefficient if the I is big Convergence detection of Markov chain is still an open research question.

Continue With fixed number of Topics=400 and lambda = 0.7, tried different iteration numbers, different number of Markov chains to find when the model get stable. From the results, 50 iterations and with more than 3 Markov chains give quite stable performance, so these values will be used in the final experiments

Topic choosing With same tuning procedure, K=800 gives best average precision when working on big collections. The number K=800 is much smaller than the optimal K=2000 in cluster model.

Experiment Results LDA-based Retrieval(LBDM) improves 21.64% over Query Likelihood Retrieval (QL) 13.97% over Cluster-based Retrieval (CBDM) Lamba = 0.7 50 iterations 3 Markov chains

LBDM gives good results on different collection

Why does LBDM Performs Better? Example Q: “buyout leverage” D:”Farley Unit Defaults on Pepperell Buyout Loan”-----relevant document D doesn’t simultaneously contain both words “buyout” and “leverage”, so ranking is low in cluster models (single topic). But in LDA-based model, D is a mixture of topics: Economic+Money_Market+.... Words in the two topics are highly correlated with word “leverage”, so LBDM makes D rank higher. “” “”

Compare LBDM with Relevance Model(RM) RM: uses pseudo-feedback information needs online processing Good: achieve best performance Bad: extra search for each query, less efficient LDA: is an offline-processing model No need any extra processing on queries, more efficient with similar performance

Use LBDM as the pseudo-feedback for the relevant model (RM). Moderate improvement are obtained, which is better than the small improvements for the combination Of RM and CBDM

Conclusion 1. Experiments show that LDA-based approach consistently outperforms the cluster-based approach. 2. The performance LBDM is close to the Relevance Model incorporated with pseudo-feedback 3. LDA model on IR task is feasible with suitable parameters. 4. LDA estimation is done offline and only one time calculation so that LDA can be a good sub for the pseudo-relevance More work need to be done with larger collections.

References 1. Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Advances in neural information processing systems. 2009. 2. Liu, Xiaoyong, and W. Bruce Croft. "Cluster-based retrieval using language models." Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2004. 3. Hofmann, Thomas. "Probabilistic latent semantic indexing." ACM SIGIR Forum. Vol. 51. No. 2. ACM, 2017.

LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce - PowerPoint PPT Presentation

LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce Croft) Haiyun Jin, Lei Ju, Ruize Qin Background Text documents representation is a critical part of information retrieval (IR). Topic model can capture the

SALT LAKE LEGAL DEFENDER (LDA) AND SOCIAL SERVICES Who we are, what we do, court system and how LDA

Understanding Landscape Visualisation for Visual Impact Assessments Lock, David.J. 1 1 LDA Design,

Your local partner of choice THE ENGCO GROUP ENGCO Group consists of six companies: ENGCO, Lda

Linking words to topics Pavel Oleinikov Associate Director DataCamp Topic Modeling in R LDA

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

SVD-LDA: Topic Modeling for Full-Text Recommender Systems Sergey Nikolenko Steklov Mathematical

Co cept Co cept Concept Detection Based on Concept Detection Based on LDA etect o etect o

Methods/Software as Standards e.g., LDA Lead: All Participants: Andre Skupin, Margaret

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

Oreste Signore, < os@orestesignore.eu> Summer School LDA Libraries in the digital age:

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Statistical Classification with Fisher Zantedeschi Introduction Kernel Topic Models LDA PLSM

Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases Matthias

Error Probability Analysis for LDA-Bayesian Based Classification of Alzheimers Disease and

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA Yan Chen, Xiaoshi

Markov-switching MIDAS models Pierre Gu erin Massimiliano Marcellino European University

Modelling micro-level insurance claim counts using Markov-modulated non-homogeneous Poisson

Markov Chains and Pandemics Caleb Dedmore and Brad Smith December 12, 2016 Markov Chain Basics

GTC 2018 GTC 2018 Motivation Discovering Order in Unordered Datasets: GMN Generative Markov

CancerBase How might we collect and leverage patient data to improve patients lives?

What would a Marshall Plan for rural Utah look like? A Tale of Two Utahs Employment change, Q4

Building a Grad Nation: Progress and Challenge in Ending the High School Dropout Epidemic 2012

District School Improvement Thursday Jan. 17, 2019 Calendar Update: Our initial agenda has

LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce - PowerPoint PPT Presentation

LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce Croft) Haiyun Jin, Lei Ju, Ruize Qin Background Text documents representation is a critical part of information retrieval (IR). Topic model can capture the

SALT LAKE LEGAL DEFENDER (LDA) AND SOCIAL SERVICES Who we are, what we do, court system and how LDA

Understanding Landscape Visualisation for Visual Impact Assessments Lock, David.J. 1 1 LDA Design,

Your local partner of choice THE ENGCO GROUP ENGCO Group consists of six companies: ENGCO, Lda

Linking words to topics Pavel Oleinikov Associate Director DataCamp Topic Modeling in R LDA

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

SVD-LDA: Topic Modeling for Full-Text Recommender Systems Sergey Nikolenko Steklov Mathematical

Co cept Co cept Concept Detection Based on Concept Detection Based on LDA etect o etect o

Methods/Software as Standards e.g., LDA Lead: All Participants: Andre Skupin, Margaret

Efficient induction of probabilistic word classes with LDA Grzegorz Chrupa la Saarland

Oreste Signore, &lt; os@orestesignore.eu&gt; Summer School LDA Libraries in the digital age:

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Statistical Classification with Fisher Zantedeschi Introduction Kernel Topic Models LDA PLSM

Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases Matthias

Error Probability Analysis for LDA-Bayesian Based Classification of Alzheimers Disease and

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA Yan Chen, Xiaoshi

Markov-switching MIDAS models Pierre Gu erin Massimiliano Marcellino European University

Modelling micro-level insurance claim counts using Markov-modulated non-homogeneous Poisson

Markov Chains and Pandemics Caleb Dedmore and Brad Smith December 12, 2016 Markov Chain Basics

GTC 2018 GTC 2018 Motivation Discovering Order in Unordered Datasets: GMN Generative Markov

CancerBase How might we collect and leverage patient data to improve patients lives?

What would a Marshall Plan for rural Utah look like? A Tale of Two Utahs Employment change, Q4

Building a Grad Nation: Progress and Challenge in Ending the High School Dropout Epidemic 2012

District School Improvement Thursday Jan. 17, 2019 Calendar Update: Our initial agenda has

Oreste Signore, < os@orestesignore.eu> Summer School LDA Libraries in the digital age: