LDA-Based Document Models for Ad-hoc Retrieval (Xing Wei and W. Bruce Croft) Haiyun Jin, Lei Ju, Ruize Qin
Background ● Text documents representation is a critical part of information retrieval (IR). ● Topic model can capture the relationships between words.
Introduction ● Introduce the retrieval model based on LDA. ● Make comparison between LDA and other topic models. ● Evaluate the LDA model for IR problem by experiments, and compare the performance and efficiency with other model.
Related Work ● Probabilistic Latent Semantic Indexing(pLSI) ● Cluster-based Retrieval
Probabilistic Latent Semantic Indexing(pLSI)
Probabilistic Latent Semantic Indexing(pLSI) ● Drawbacks ○ Too many parameters -> Overfitting! ○ The generative semantics are not well-defined.
Cluster-based Retrieval ● Assumption ○ Each document is related to only one topic. ○ ● Linear interpolation smoothing ○
Cluster-based Retrieval ● Drawbacks ○ Assumption: each document is related to only one topic. ○ Not true for long docs and large corpus.
LDA Instead of involving a lot of parameters, LDA uses only two vectors, alpha and beta , feed them to the Dirichlet distribution to automatically generate the topic distribution for each document and word distribution for each topic.
LDA framework N d : num of words in document d; K: num of topics; N: num of documents; Z: a topic; w: a word token Use parameter � to generate a topic distribution � d for each Pick a topic z from the topic document d distribution � d of a document d, and use � z to generate the word w in d Use parameter � to generate a word distribution � z for each topic z
LDA equation revisit After we get posterior estimate � of and � :
Why LDA is better than pLSI? With the help of Dirichlet distribution, LDA treats the topic mixture distribution, ● i.e., the � , as a k-parameter hidden random variable , rather than a large set of individual parameters linked to the training set . This makes LDA a generative model Reduced the complexity (fewer parameters) ● ● Avoid the situation of too many parameters and too many local maxima
Why LDA is better than cluster-based retrieval? ● Clustering assumes each document can only talk about one topic ● LDA changes the assumption to that each document can cover multiple(up to K) topics. The probability of the coverage of each topic is controlled by the � vector. This is more flexible for large document set.
LDA-based retrieval
Experiment: Before doing experiments it’s important to preprocess data somehow, in order to compare results with other models such as cluster-based retrieval 1. We have five collections which contain the query sets and relevance judgments However, Federal Register collection has been removed. 1) no enough queries: only 21 (other collections around 100) 2) 6 of 21 queries have only one relevant doc which may cause biased results left out the Federal Register (FR) collection
Parameters 1. Parameter lambda and the number of markov chains in the gibbs sampling should be tuned: exhaustive search or manually hill-climbing search. 2. Training the model on the AP collection set and see the average performance on the other four collections: WSJ, FT, SJMN, and LA. (Metric: average precision: since our final task is retrieval) 4. Symmetric Prior: Alpha = 50/K, beta=0.01 normally being used. Results not sensitive to them, doesn’t require too much tuning.
Continue How do we choose the number of iterations? Good : The more iterations, the more probability the markov chain will converge. Bad: Once the markov chain converges, the extra iterations will be useless. 1. we don’t know when the converge will happen 2. the model will be inefficient if the I is big Convergence detection of Markov chain is still an open research question.
Continue With fixed number of Topics=400 and lambda = 0.7, tried different iteration numbers, different number of Markov chains to find when the model get stable. From the results, 50 iterations and with more than 3 Markov chains give quite stable performance, so these values will be used in the final experiments
Topic choosing With same tuning procedure, K=800 gives best average precision when working on big collections. The number K=800 is much smaller than the optimal K=2000 in cluster model.
Experiment Results LDA-based Retrieval(LBDM) improves 21.64% over Query Likelihood Retrieval (QL) 13.97% over Cluster-based Retrieval (CBDM) Lamba = 0.7 50 iterations 3 Markov chains
LBDM gives good results on different collection
Why does LBDM Performs Better? Example Q: “buyout leverage” D:”Farley Unit Defaults on Pepperell Buyout Loan”-----relevant document D doesn’t simultaneously contain both words “buyout” and “leverage”, so ranking is low in cluster models (single topic). But in LDA-based model, D is a mixture of topics: Economic+Money_Market+.... Words in the two topics are highly correlated with word “leverage”, so LBDM makes D rank higher. “” “”
Compare LBDM with Relevance Model(RM) RM: uses pseudo-feedback information needs online processing Good: achieve best performance Bad: extra search for each query, less efficient LDA: is an offline-processing model No need any extra processing on queries, more efficient with similar performance
Use LBDM as the pseudo-feedback for the relevant model (RM). Moderate improvement are obtained, which is better than the small improvements for the combination Of RM and CBDM
Conclusion 1. Experiments show that LDA-based approach consistently outperforms the cluster-based approach. 2. The performance LBDM is close to the Relevance Model incorporated with pseudo-feedback 3. LDA model on IR task is feasible with suitable parameters. 4. LDA estimation is done offline and only one time calculation so that LDA can be a good sub for the pseudo-relevance More work need to be done with larger collections.
References 1. Chang, Jonathan, et al. "Reading tea leaves: How humans interpret topic models." Advances in neural information processing systems. 2009. 2. Liu, Xiaoyong, and W. Bruce Croft. "Cluster-based retrieval using language models." Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2004. 3. Hofmann, Thomas. "Probabilistic latent semantic indexing." ACM SIGIR Forum. Vol. 51. No. 2. ACM, 2017.
Recommend
More recommend