Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - - PowerPoint PPT Presentation

latent semantic
SMART_READER_LITE
LIVE PREVIEW

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - - PowerPoint PPT Presentation

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on word overlap between document and query is not enough Synonymy - decreases recall Polysemy - decreases precision Retrieval based on


slide-1
SLIDE 1

Latent Semantic Indexing

Mandar Haldekar CMSC 676

slide-2
SLIDE 2

Introduction to LSI

 Retrieval based on word overlap between

document and query is not enough

 Synonymy - decreases recall  Polysemy - decreases precision  Retrieval based on underlying concept or

topic is important

slide-3
SLIDE 3

 Assumption: Some underlying

latent/hidden semantic structure in the corpus

 Projects both documents and terms in

lower dimensional space which represents latent semantic concepts/topic in the corpus.

Introduction to LSI

slide-4
SLIDE 4

T echnical Details

rr rd

  • For an term document matrix A of size t x d and

rank r, there exists a factorization using SVD as follows:

tr

U and V are orthonormal matrices. Σ is r x r diagonal matrix containing singular values of A in descending order A = U x Σ x VT

slide-5
SLIDE 5

Documents Terms A t x r r x r r x d = t x d U Σ VT Documents Terms Ak t x k k x k k x d = t x d U Σ VT

Low Rank Approximation

slide-6
SLIDE 6

Query Processing

 Query q must be projected in k-dimensional

space qk = Σ-1 x UT

k x q

 Require number of top ranking similar

documents are retrieved

 Representation

in k-dimensional space captures semantic structure, so queries are placed near similar terms and documents in semantic space (even though they do not have any word overlap)

slide-7
SLIDE 7

Applications

 Information Filtering

  • Compute SVD on initial set of documents.
  • Represent user’s interest as one or more document

vectors in latent semantic space using SVD

  • New documents matching with these vectors are

returned.

 Cross Language Retrieval

  • Apply SVD on bi-lingual corpus to generate semantic

space and then process queries on this semantic space without any query translation

 T

ext Summarization

  • Construct term-sentence matrix and considers

sentences with highest singular value for each pattern

slide-8
SLIDE 8

Current State of Research

 Issue : Scaling it to large collections  Some Recent Steps towards it

  • Sparse LSA, 2010

 Use L1 regularization to enforce sparsity constraints on projection matrix.  Compact representation

  • Regularized LSI, 2011

 Completely new model in which term document matrix is represented as product of two matrices: term topic and topic document.  It also uses regularization to constraint the soultion.  Main advantage : can be parallelized.

slide-9
SLIDE 9

References

  • S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic
  • analysis. Journal of the American Society for Information Science, 41(6):391–407, September 1990.

Zha, H., Simon, H. D. On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21 (2), 782–791,1999.

  • Y. Gong, X. Liu: Generic Text Summarization Using Relevance Measure and Latent Semantic
  • Analysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and

development in information retrieval, New Orleans, Louisiana, United States, pp. 19-25, 2001.

Michael Littman , Susan T. Dumais , Thomas K. Landauer. Automatic Cross-Language Information Retrieval using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval .pp. 18-24.Stanford University, 1997.

Quan Wang, Jun Xu, Hang Li, and Nick Craswell. Regularized latent semantic indexing. Proceedings

  • f the 34th international ACM SIGIR conference on Research and development in Information Retrieval,

SIGIR ’11, pages 685–694, New York, NY, USA, 2011.

  • X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. NIPS Workshop,

2010.

Michael Berry, Susan Dumais, and Gavin O'Brien. Using linear algebra for intelligent information

  • retrieval. SIAM Review, 37(4):573-595, December 1995.
slide-10
SLIDE 10

Thanks you!! Questions ??