Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - PowerPoint PPT Presentation

Latent Semantic Indexing Mandar Haldekar CMSC 676

Introduction to LSI  Retrieval based on word overlap between document and query is not enough  Synonymy - decreases recall  Polysemy - decreases precision  Retrieval based on underlying concept or topic is important

Introduction to LSI  Assumption: Some underlying latent/hidden semantic structure in the corpus  Projects both documents and terms in lower dimensional space which represents latent semantic concepts/topic in the corpus.

T echnical Details • For an term document matrix A of size t x d and rank r , there exists a factorization using SVD as follows: A = U x Σ x V T t  r r  r r  d U and V are orthonormal matrices. Σ is r x r diagonal matrix containing singular values of A in descending order

Low Rank Approximation Documents = Terms r x d r x r t x r t x d A U V T Σ Documents = Terms t x k t x d k x k k x d A k U Σ V T

Query Processing  Query q must be projected in k-dimensional space q k = Σ -1 x U T k x q  Require number of top ranking similar documents are retrieved  Representation in k-dimensional space captures semantic structure, so queries are placed near similar terms and documents in semantic space (even though they do not have any word overlap)

Applications  Information Filtering ◦ Compute SVD on initial set of documents. ◦ Represent user’s interest as one or more document vectors in latent semantic space using SVD ◦ New documents matching with these vectors are returned.  Cross Language Retrieval ◦ Apply SVD on bi-lingual corpus to generate semantic space and then process queries on this semantic space without any query translation  T ext Summarization ◦ Construct term-sentence matrix and considers sentences with highest singular value for each pattern

Current State of Research  Issue : Scaling it to large collections  Some Recent Steps towards it ◦ Sparse LSA, 2010  Use L1 regularization to enforce sparsity constraints on projection matrix.  Compact representation ◦ Regularized LSI, 2011  Completely new model in which term document matrix is represented as product of two matrices: term topic and topic document.  It also uses regularization to constraint the soultion.  Main advantage : can be parallelized.

References S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic  analysis. Journal of the American Society for Information Science , 41(6):391 – 407, September 1990. Zha, H., Simon, H. D. On updating problems in latent semantic indexing. SIAM Journal on Scientific  Computing, 21 (2), 782 – 791,1999. Y. Gong, X. Liu: Generic Text Summarization Using Relevance Measure and Latent Semantic  Analysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States , pp. 19-25, 2001. Michael Littman , Susan T. Dumais , Thomas K. Landauer. Automatic Cross-Language Information  Retrieval using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval .pp. 18-24.Stanford University, 1997. Quan Wang, Jun Xu, Hang Li, and Nick Craswell. Regularized latent semantic indexing. Proceedings  of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’11 , pages 685 – 694, New York, NY, USA, 2011. X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. NIPS Workshop ,  2010. Michael Berry, Susan Dumais, and Gavin O'Brien. Using linear algebra for intelligent information  retrieval. SIAM Review , 37(4):573-595, December 1995.

Thanks you!! Questions ??

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - PowerPoint PPT Presentation

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on word overlap between document and query is not enough Synonymy - decreases recall Polysemy - decreases precision Retrieval based on

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

An Introduction to Latent Semantic Analysis Thomas K Landauer Department of Psychology

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Creating Semantic Mashups: Bridging Web 2.0 and the Semantic Web Jamie Taylor, Colin Evans, Toby

: on the Semantic Web : on the Semantic Web Building a Semantic Prototype for Danish Building a

Semantic Processing Augmenting CFGs Currying Quantifier scope Semantic Grammars L445 / L545

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Module 13 Introduction to Semantic Technology, Ontologies and the Semantic Web Module 13 Outline

A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies James

Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar

PWC Basics: A Simple Chemical Process Recycle A Fresh A REACTOR C A B Cooling O Duty L

Supplemental Financial Report First Quarter 2017 May 9, 2017 NYSE:CLNS | A Diversified Equity

Investor Presentation March 2019 1 FORWARD-LOOKING STATEMENTS Cautionary Statement Regarding

The Nassau Hub Study The Nassau Hub Study TAC Presentation January 17, 2012 Agenda Agenda 1)

NEWMARK GRUBB KNIGHT FRANK JMP SECURITIES COMMERCIAL REAL ESTATE BROKERAGE CONFERENCE Michael

Influence of Airport Operations Management on Traffic Complexity and Efficiency Tatjana Krsti

BeochFleischmon January 28,2019 Audit Committee Arizona's Children Association and Arizona's

Ergodic and Outage Capacity of Narrowband MIMO Gaussian Channels Yang Wen Liang Department of