Latent Semantic Indexing: A Regularized approach to large-scale - PowerPoint PPT Presentation

Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610

INTRODUCTION ● It finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing). ● Document similarity is defined by the ways in which those words occur or do not occur. ● LSI performs a low-rank approximation of document-term matrix (typical rank 100-300) ● Retrieval is relevant based on the underlying definition or subject

EXAMPLE

GENERAL IDEA ● Map documents (and terms) to a low-dimensional representation. ● Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). ● Compute document similarity based on the inner product in this latent semantic space ● It uses SVD(Singular Value Decomposition).

● SVD decomposes a matrix as a product of 3 matrices. For an term document matrix A of size t x d and rank r, there exists a factorization using SVD as follows: ● Where, U and V are Left and Right Singular matrices respectively, Σ is r x r diagonal matrix containing singular values of A in descending order.

BUILDING LSI 1. Preprocess the collection of documents. a. Stemming b. Removing stop words 2. Build Frequency Matrix 3. Apply Pre-weights 4. Decompose FM into U, S, V 5. Project Queries

WHY TO USE LSI ● Provides defense against ‘Keyword Stuffing’ ● LSI targets Synonymy and Polysemy ● It also gives better results and best ranked pages.

ISSUE AND SOLUTION ● The main issue with LSI is Scalability issue. Scaling to larger document collections via parallelization is difficult. ● Few alternatives are available such as PLSI(Probabilistic LSI), LDA(Latent Dirichlet Allocation), but most solution requires drastic step such as vastly reducing input vocabulary. ● Regularized LSI is the solution to this problem in which term document matrix is represented as product of two matrices: term topic and topic document. ● It also uses regularization to constrain the solution. ● The main advantage is that it can be parallelized.

REGULARIZED LSI (RLSI) ● RLSI is different from LSI in that it uses regularization instead of orthogonality to constrain the solution. ● Two methods of RLSI: ○ batch Regularized Latent Semantic Indexing(bRLSI) ○ online Regularized Latent Semantic Indexing(oRLSI) ● Both methods are formalized as minimization of a quadratic loss function regularized by ℓ1 and/or ℓ2 norm. ● Collection is represented as a term-document matrix, where each entry represents the occurrence (or tf-idf score) of a term in a document.

● The term-document matrix is then approximated by the product of two matrices: a term-topic matrix and topic-document matrix. ○ term-topic matrix : represents the latent topics with terms ○ topic-document matrix : represents the documents with topics

Performance of RLSI ● TREC datasets are used to compare different RLSI regularization strategies and to compare RLSI with existing topic modeling methods. ● TREC datasets used were AP, WSJ, and OHSUMED, which are widely used in relevance ranking experiments. ● Compared different regularization strategies on (batch) RLSI. For example RLSI (Uℓ 1 -Vℓ 2 ), RLSI (Uℓ 2 -Vℓ 1 ), RLSI (Uℓ 1 -Vℓ 1 ), and RLSI (Uℓ 2 -Vℓ 2 ).

● Topics Discovered by RLSI Variants on AP ● Average topic compactness is defined as the average ratio of terms with nonzero weights per topic

Retrieval Performance of RLSI Variants on AP Retrieval Performance of RLSI Variants on WSJ ● Combined topic-matching scores with term-matching scores given by conventional IR models of BM25 ● Normalized Discounted Cumulative Gain is a measure of ranking quality

Retrieval Performance of different methods on AP dataset

● RLSI Variants results in terms of topic readability, topic compactness, and retrieval performance. ● It is a better practice to apply ℓ1 norm on U and ℓ2 norm on V in RLSI for achieving good topic readability, topic compactness, and retrieval performance. ● Where U is Term-topic matrix and V is Topic-document matrix

Application ● Cross Language Retrieval ○ Apply SVD on bilingual corpus to generate semantic space and then process queries on this semantic space without any query translation. ● Text Summarization ○ Construct term-sentence matrix and considers sentences with highest singular value for each pattern. ● Search Engine Optimization(SEO)

REFERENCES 1. Wang, Q., Xu, J., Li, H., & Craswell, N. (2013). Regularized Latent Semantic Indexing. ACM Transactions on Information Systems , 31 (1), 1–44. DOI:10.1145/2414782.2414787 2. Atreya, A., & Elkan, C. (2011). Latent semantic indexing (LSI) fails for TREC collections. ACM SIGKDD Explorations Newsletter, 12 (2), 5. DOI:10.1145/1964897.1964900 3. Chen, X., Qi, Y., Bai, B., Lin, Q., & Carbonell, J. G. (2011). Sparse Latent Semantic Analysis. Proceedings of the 2011 SIAM International Conference on Data Mining . DOI: 10.1137/1.9781611972818.41 4. Crain S.P., Zhou K., Yang SH., Zha H. (2012) Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA

Thank You!!

Latent Semantic Indexing: A Regularized approach to large-scale - PowerPoint PPT Presentation

Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610 INTRODUCTION It finds the hidden (latent) relationships between words (semantics) in order to improve information

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes Outline Precision

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Lecture 6: (Probabilistic) Latent Semantic Analysis Julia Hockenmaier juliahmr@illinois.edu

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on

Latent Semantic Indexing for Video Content Modeling and Analysis Fabrice Souvannavong, Bernard

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of

1 Latent variable models In the next section we will discuss latent variable models for

EMILY WINTER Source: SMMT. Annual car registrations, 2002 - 2018 11 2.7m 18 17 16

STEPHEN BEAMISH Entrepreneur In Residence (EIR) stephen.beamish@launchlab.ca www.launchlab.ca

External Findability Student Web Presence Guidelines Findability? Definition 1: the ease with

4 Overview Content Strategy Checklist Additional Content Strategy Resources Produced

The Studies of Thermally stability of Tungsten Carbon Nitride (W-C-N) Thin Films using

CREATING A STEAM INNOVATION LAB FOR CONCORD ELEMENTARY STUDENTS Project Proposal to Concord

CORPORATE PRESENTATION JULY 2020 PROJECT DEVELOPMENT WITH A SIGNIFICANT MARGIN FOR GROWTH #GOLD

Financing the Technology Innovation of SMEs in Korea i SME K H Hong Jae-Keun (Ph.D) J