latent semantic
play

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - PowerPoint PPT Presentation

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on word overlap between document and query is not enough Synonymy - decreases recall Polysemy - decreases precision Retrieval based on


  1. Latent Semantic Indexing Mandar Haldekar CMSC 676

  2. Introduction to LSI  Retrieval based on word overlap between document and query is not enough  Synonymy - decreases recall  Polysemy - decreases precision  Retrieval based on underlying concept or topic is important

  3. Introduction to LSI  Assumption: Some underlying latent/hidden semantic structure in the corpus  Projects both documents and terms in lower dimensional space which represents latent semantic concepts/topic in the corpus.

  4. T echnical Details • For an term document matrix A of size t x d and rank r , there exists a factorization using SVD as follows: A = U x Σ x V T t  r r  r r  d U and V are orthonormal matrices. Σ is r x r diagonal matrix containing singular values of A in descending order

  5. Low Rank Approximation Documents = Terms r x d r x r t x r t x d A U V T Σ Documents = Terms t x k t x d k x k k x d A k U Σ V T

  6. Query Processing  Query q must be projected in k-dimensional space q k = Σ -1 x U T k x q  Require number of top ranking similar documents are retrieved  Representation in k-dimensional space captures semantic structure, so queries are placed near similar terms and documents in semantic space (even though they do not have any word overlap)

  7. Applications  Information Filtering ◦ Compute SVD on initial set of documents. ◦ Represent user’s interest as one or more document vectors in latent semantic space using SVD ◦ New documents matching with these vectors are returned.  Cross Language Retrieval ◦ Apply SVD on bi-lingual corpus to generate semantic space and then process queries on this semantic space without any query translation  T ext Summarization ◦ Construct term-sentence matrix and considers sentences with highest singular value for each pattern

  8. Current State of Research  Issue : Scaling it to large collections  Some Recent Steps towards it ◦ Sparse LSA, 2010  Use L1 regularization to enforce sparsity constraints on projection matrix.  Compact representation ◦ Regularized LSI, 2011  Completely new model in which term document matrix is represented as product of two matrices: term topic and topic document.  It also uses regularization to constraint the soultion.  Main advantage : can be parallelized.

  9. References S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic  analysis. Journal of the American Society for Information Science , 41(6):391 – 407, September 1990. Zha, H., Simon, H. D. On updating problems in latent semantic indexing. SIAM Journal on Scientific  Computing, 21 (2), 782 – 791,1999. Y. Gong, X. Liu: Generic Text Summarization Using Relevance Measure and Latent Semantic  Analysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States , pp. 19-25, 2001. Michael Littman , Susan T. Dumais , Thomas K. Landauer. Automatic Cross-Language Information  Retrieval using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval .pp. 18-24.Stanford University, 1997. Quan Wang, Jun Xu, Hang Li, and Nick Craswell. Regularized latent semantic indexing. Proceedings  of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’11 , pages 685 – 694, New York, NY, USA, 2011. X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. NIPS Workshop ,  2010. Michael Berry, Susan Dumais, and Gavin O'Brien. Using linear algebra for intelligent information  retrieval. SIAM Review , 37(4):573-595, December 1995.

  10. Thanks you!! Questions ??

Recommend


More recommend