Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610
INTRODUCTION ● It finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing). ● Document similarity is defined by the ways in which those words occur or do not occur. ● LSI performs a low-rank approximation of document-term matrix (typical rank 100-300) ● Retrieval is relevant based on the underlying definition or subject
EXAMPLE
GENERAL IDEA ● Map documents (and terms) to a low-dimensional representation. ● Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). ● Compute document similarity based on the inner product in this latent semantic space ● It uses SVD(Singular Value Decomposition).
● SVD decomposes a matrix as a product of 3 matrices. For an term document matrix A of size t x d and rank r, there exists a factorization using SVD as follows: ● Where, U and V are Left and Right Singular matrices respectively, Σ is r x r diagonal matrix containing singular values of A in descending order.
BUILDING LSI 1. Preprocess the collection of documents. a. Stemming b. Removing stop words 2. Build Frequency Matrix 3. Apply Pre-weights 4. Decompose FM into U, S, V 5. Project Queries
WHY TO USE LSI ● Provides defense against ‘Keyword Stuffing’ ● LSI targets Synonymy and Polysemy ● It also gives better results and best ranked pages.
ISSUE AND SOLUTION ● The main issue with LSI is Scalability issue. Scaling to larger document collections via parallelization is difficult. ● Few alternatives are available such as PLSI(Probabilistic LSI), LDA(Latent Dirichlet Allocation), but most solution requires drastic step such as vastly reducing input vocabulary. ● Regularized LSI is the solution to this problem in which term document matrix is represented as product of two matrices: term topic and topic document. ● It also uses regularization to constrain the solution. ● The main advantage is that it can be parallelized.
REGULARIZED LSI (RLSI) ● RLSI is different from LSI in that it uses regularization instead of orthogonality to constrain the solution. ● Two methods of RLSI: ○ batch Regularized Latent Semantic Indexing(bRLSI) ○ online Regularized Latent Semantic Indexing(oRLSI) ● Both methods are formalized as minimization of a quadratic loss function regularized by ℓ1 and/or ℓ2 norm. ● Collection is represented as a term-document matrix, where each entry represents the occurrence (or tf-idf score) of a term in a document.
● The term-document matrix is then approximated by the product of two matrices: a term-topic matrix and topic-document matrix. ○ term-topic matrix : represents the latent topics with terms ○ topic-document matrix : represents the documents with topics
Performance of RLSI ● TREC datasets are used to compare different RLSI regularization strategies and to compare RLSI with existing topic modeling methods. ● TREC datasets used were AP, WSJ, and OHSUMED, which are widely used in relevance ranking experiments. ● Compared different regularization strategies on (batch) RLSI. For example RLSI (Uℓ 1 -Vℓ 2 ), RLSI (Uℓ 2 -Vℓ 1 ), RLSI (Uℓ 1 -Vℓ 1 ), and RLSI (Uℓ 2 -Vℓ 2 ).
● Topics Discovered by RLSI Variants on AP ● Average topic compactness is defined as the average ratio of terms with nonzero weights per topic
Retrieval Performance of RLSI Variants on AP Retrieval Performance of RLSI Variants on WSJ ● Combined topic-matching scores with term-matching scores given by conventional IR models of BM25 ● Normalized Discounted Cumulative Gain is a measure of ranking quality
Retrieval Performance of different methods on AP dataset
● RLSI Variants results in terms of topic readability, topic compactness, and retrieval performance. ● It is a better practice to apply ℓ1 norm on U and ℓ2 norm on V in RLSI for achieving good topic readability, topic compactness, and retrieval performance. ● Where U is Term-topic matrix and V is Topic-document matrix
Application ● Cross Language Retrieval ○ Apply SVD on bilingual corpus to generate semantic space and then process queries on this semantic space without any query translation. ● Text Summarization ○ Construct term-sentence matrix and considers sentences with highest singular value for each pattern. ● Search Engine Optimization(SEO)
REFERENCES 1. Wang, Q., Xu, J., Li, H., & Craswell, N. (2013). Regularized Latent Semantic Indexing. ACM Transactions on Information Systems , 31 (1), 1–44. DOI:10.1145/2414782.2414787 2. Atreya, A., & Elkan, C. (2011). Latent semantic indexing (LSI) fails for TREC collections. ACM SIGKDD Explorations Newsletter, 12 (2), 5. DOI:10.1145/1964897.1964900 3. Chen, X., Qi, Y., Bai, B., Lin, Q., & Carbonell, J. G. (2011). Sparse Latent Semantic Analysis. Proceedings of the 2011 SIAM International Conference on Data Mining . DOI: 10.1137/1.9781611972818.41 4. Crain S.P., Zhou K., Yang SH., Zha H. (2012) Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA
Thank You!!
Recommend
More recommend