latent semantic indexing a regularized approach to large
play

Latent Semantic Indexing: A Regularized approach to large-scale - PowerPoint PPT Presentation

Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610 INTRODUCTION It finds the hidden (latent) relationships between words (semantics) in order to improve information


  1. Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610

  2. INTRODUCTION ● It finds the hidden (latent) relationships between words (semantics) in order to improve information understanding (indexing). ● Document similarity is defined by the ways in which those words occur or do not occur. ● LSI performs a low-rank approximation of document-term matrix (typical rank 100-300) ● Retrieval is relevant based on the underlying definition or subject

  3. EXAMPLE

  4. GENERAL IDEA ● Map documents (and terms) to a low-dimensional representation. ● Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). ● Compute document similarity based on the inner product in this latent semantic space ● It uses SVD(Singular Value Decomposition).

  5. ● SVD decomposes a matrix as a product of 3 matrices. For an term document matrix A of size t x d and rank r, there exists a factorization using SVD as follows: ● Where, U and V are Left and Right Singular matrices respectively, Σ is r x r diagonal matrix containing singular values of A in descending order.

  6. BUILDING LSI 1. Preprocess the collection of documents. a. Stemming b. Removing stop words 2. Build Frequency Matrix 3. Apply Pre-weights 4. Decompose FM into U, S, V 5. Project Queries

  7. WHY TO USE LSI ● Provides defense against ‘Keyword Stuffing’ ● LSI targets Synonymy and Polysemy ● It also gives better results and best ranked pages.

  8. ISSUE AND SOLUTION ● The main issue with LSI is Scalability issue. Scaling to larger document collections via parallelization is difficult. ● Few alternatives are available such as PLSI(Probabilistic LSI), LDA(Latent Dirichlet Allocation), but most solution requires drastic step such as vastly reducing input vocabulary. ● Regularized LSI is the solution to this problem in which term document matrix is represented as product of two matrices: term topic and topic document. ● It also uses regularization to constrain the solution. ● The main advantage is that it can be parallelized.

  9. REGULARIZED LSI (RLSI) ● RLSI is different from LSI in that it uses regularization instead of orthogonality to constrain the solution. ● Two methods of RLSI: ○ batch Regularized Latent Semantic Indexing(bRLSI) ○ online Regularized Latent Semantic Indexing(oRLSI) ● Both methods are formalized as minimization of a quadratic loss function regularized by ℓ1 and/or ℓ2 norm. ● Collection is represented as a term-document matrix, where each entry represents the occurrence (or tf-idf score) of a term in a document.

  10. ● The term-document matrix is then approximated by the product of two matrices: a term-topic matrix and topic-document matrix. ○ term-topic matrix : represents the latent topics with terms ○ topic-document matrix : represents the documents with topics

  11. Performance of RLSI ● TREC datasets are used to compare different RLSI regularization strategies and to compare RLSI with existing topic modeling methods. ● TREC datasets used were AP, WSJ, and OHSUMED, which are widely used in relevance ranking experiments. ● Compared different regularization strategies on (batch) RLSI. For example RLSI (Uℓ 1 -Vℓ 2 ), RLSI (Uℓ 2 -Vℓ 1 ), RLSI (Uℓ 1 -Vℓ 1 ), and RLSI (Uℓ 2 -Vℓ 2 ).

  12. ● Topics Discovered by RLSI Variants on AP ● Average topic compactness is defined as the average ratio of terms with nonzero weights per topic

  13. Retrieval Performance of RLSI Variants on AP Retrieval Performance of RLSI Variants on WSJ ● Combined topic-matching scores with term-matching scores given by conventional IR models of BM25 ● Normalized Discounted Cumulative Gain is a measure of ranking quality

  14. Retrieval Performance of different methods on AP dataset

  15. ● RLSI Variants results in terms of topic readability, topic compactness, and retrieval performance. ● It is a better practice to apply ℓ1 norm on U and ℓ2 norm on V in RLSI for achieving good topic readability, topic compactness, and retrieval performance. ● Where U is Term-topic matrix and V is Topic-document matrix

  16. Application ● Cross Language Retrieval ○ Apply SVD on bilingual corpus to generate semantic space and then process queries on this semantic space without any query translation. ● Text Summarization ○ Construct term-sentence matrix and considers sentences with highest singular value for each pattern. ● Search Engine Optimization(SEO)

  17. REFERENCES 1. Wang, Q., Xu, J., Li, H., & Craswell, N. (2013). Regularized Latent Semantic Indexing. ACM Transactions on Information Systems , 31 (1), 1–44. DOI:10.1145/2414782.2414787 2. Atreya, A., & Elkan, C. (2011). Latent semantic indexing (LSI) fails for TREC collections. ACM SIGKDD Explorations Newsletter, 12 (2), 5. DOI:10.1145/1964897.1964900 3. Chen, X., Qi, Y., Bai, B., Lin, Q., & Carbonell, J. G. (2011). Sparse Latent Semantic Analysis. Proceedings of the 2011 SIAM International Conference on Data Mining . DOI: 10.1137/1.9781611972818.41 4. Crain S.P., Zhou K., Yang SH., Zha H. (2012) Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA

  18. Thank You!!

Recommend


More recommend