Latent Semantic Indexing Mandar Haldekar CMSC 676
Introduction to LSI Retrieval based on word overlap between document and query is not enough Synonymy - decreases recall Polysemy - decreases precision Retrieval based on underlying concept or topic is important
Introduction to LSI Assumption: Some underlying latent/hidden semantic structure in the corpus Projects both documents and terms in lower dimensional space which represents latent semantic concepts/topic in the corpus.
T echnical Details • For an term document matrix A of size t x d and rank r , there exists a factorization using SVD as follows: A = U x Σ x V T t r r r r d U and V are orthonormal matrices. Σ is r x r diagonal matrix containing singular values of A in descending order
Low Rank Approximation Documents = Terms r x d r x r t x r t x d A U V T Σ Documents = Terms t x k t x d k x k k x d A k U Σ V T
Query Processing Query q must be projected in k-dimensional space q k = Σ -1 x U T k x q Require number of top ranking similar documents are retrieved Representation in k-dimensional space captures semantic structure, so queries are placed near similar terms and documents in semantic space (even though they do not have any word overlap)
Applications Information Filtering ◦ Compute SVD on initial set of documents. ◦ Represent user’s interest as one or more document vectors in latent semantic space using SVD ◦ New documents matching with these vectors are returned. Cross Language Retrieval ◦ Apply SVD on bi-lingual corpus to generate semantic space and then process queries on this semantic space without any query translation T ext Summarization ◦ Construct term-sentence matrix and considers sentences with highest singular value for each pattern
Current State of Research Issue : Scaling it to large collections Some Recent Steps towards it ◦ Sparse LSA, 2010 Use L1 regularization to enforce sparsity constraints on projection matrix. Compact representation ◦ Regularized LSI, 2011 Completely new model in which term document matrix is represented as product of two matrices: term topic and topic document. It also uses regularization to constraint the soultion. Main advantage : can be parallelized.
References S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science , 41(6):391 – 407, September 1990. Zha, H., Simon, H. D. On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21 (2), 782 – 791,1999. Y. Gong, X. Liu: Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States , pp. 19-25, 2001. Michael Littman , Susan T. Dumais , Thomas K. Landauer. Automatic Cross-Language Information Retrieval using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval .pp. 18-24.Stanford University, 1997. Quan Wang, Jun Xu, Hang Li, and Nick Craswell. Regularized latent semantic indexing. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’11 , pages 685 – 694, New York, NY, USA, 2011. X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. NIPS Workshop , 2010. Michael Berry, Susan Dumais, and Gavin O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review , 37(4):573-595, December 1995.
Thanks you!! Questions ??
Recommend
More recommend