Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - - PowerPoint PPT Presentation
Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - - PowerPoint PPT Presentation
Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on word overlap between document and query is not enough Synonymy - decreases recall Polysemy - decreases precision Retrieval based on
Introduction to LSI
Retrieval based on word overlap between
document and query is not enough
Synonymy - decreases recall Polysemy - decreases precision Retrieval based on underlying concept or
topic is important
Assumption: Some underlying
latent/hidden semantic structure in the corpus
Projects both documents and terms in
lower dimensional space which represents latent semantic concepts/topic in the corpus.
Introduction to LSI
T echnical Details
rr rd
- For an term document matrix A of size t x d and
rank r, there exists a factorization using SVD as follows:
tr
U and V are orthonormal matrices. Σ is r x r diagonal matrix containing singular values of A in descending order A = U x Σ x VT
Documents Terms A t x r r x r r x d = t x d U Σ VT Documents Terms Ak t x k k x k k x d = t x d U Σ VT
Low Rank Approximation
Query Processing
Query q must be projected in k-dimensional
space qk = Σ-1 x UT
k x q
Require number of top ranking similar
documents are retrieved
Representation
in k-dimensional space captures semantic structure, so queries are placed near similar terms and documents in semantic space (even though they do not have any word overlap)
Applications
Information Filtering
- Compute SVD on initial set of documents.
- Represent user’s interest as one or more document
vectors in latent semantic space using SVD
- New documents matching with these vectors are
returned.
Cross Language Retrieval
- Apply SVD on bi-lingual corpus to generate semantic
space and then process queries on this semantic space without any query translation
T
ext Summarization
- Construct term-sentence matrix and considers
sentences with highest singular value for each pattern
Current State of Research
Issue : Scaling it to large collections Some Recent Steps towards it
- Sparse LSA, 2010
Use L1 regularization to enforce sparsity constraints on projection matrix. Compact representation
- Regularized LSI, 2011
Completely new model in which term document matrix is represented as product of two matrices: term topic and topic document. It also uses regularization to constraint the soultion. Main advantage : can be parallelized.
References
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic
- analysis. Journal of the American Society for Information Science, 41(6):391–407, September 1990.
Zha, H., Simon, H. D. On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21 (2), 782–791,1999.
- Y. Gong, X. Liu: Generic Text Summarization Using Relevance Measure and Latent Semantic
- Analysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and
development in information retrieval, New Orleans, Louisiana, United States, pp. 19-25, 2001.
Michael Littman , Susan T. Dumais , Thomas K. Landauer. Automatic Cross-Language Information Retrieval using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval .pp. 18-24.Stanford University, 1997.
Quan Wang, Jun Xu, Hang Li, and Nick Craswell. Regularized latent semantic indexing. Proceedings
- f the 34th international ACM SIGIR conference on Research and development in Information Retrieval,
SIGIR ’11, pages 685–694, New York, NY, USA, 2011.
- X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. NIPS Workshop,
2010.
Michael Berry, Susan Dumais, and Gavin O'Brien. Using linear algebra for intelligent information
- retrieval. SIAM Review, 37(4):573-595, December 1995.