NPFL103: Information Retrieval (11) Latent semantic indexing Pavel - PowerPoint PPT Presentation

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 30 pecina@ufal.mff.cuni.cz

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Contents Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering 2 / 30

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Latent semantic indexing 3 / 30

Latent semantic indexing 1.51 0.0 1.54 0.0 0.0 0.0 0.0 cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 0.0 Dimensionality reduction 4.15 similarity between documents and queries? Today: Can we transform this matrix, so that we get a betuer measure of and queries. … 1.95 0.25 0.11 1.90 0.0 1.37 worser 0.88 5.25 0.12 calpurnia 0.0 0.25 5.25 LSI in information retrieval LSI as sofu clustering Recall: Term-document matrix Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra anthony 3.18 1.51 0.0 0.0 2.54 8.59 caesar 0.0 0.0 1.0 0.0 6.10 1.21 brutus 0.35 0.0 0.0 4 / 30 This matrix is the basis for computing the similarity between documents

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Latent semantic indexing: Overview 5 / 30 ▶ We decompose the term-document matrix into a product of matrices. ▶ The particular decomposition: singular value decomposition (SVD). ▶ SVD: C = U Σ V T (where C = term-document matrix) ▶ We use SVD to compute a new, improved term-document matrix C ′ . ▶ We get betuer similarity values out of C ′ (compared to C ). ▶ Using SVD for this purpose is called latent semantic indexing or LSI.

Latent semantic indexing 0 1 1 0 0 0 0 wood 1 0 1 Dimensionality reduction 1 0 tree 0 0 0 1 0 1 ocean 0 0 0 LSI in information retrieval LSI as sofu clustering C 0 ship 1 6 / 30 1 0 0 0 boat 0 1 0 Example of C = U Σ V T : The matrix C d 1 d 2 d 3 d 4 d 5 d 6 ▶ This is a standard term-document matrix. ▶ Actually, we use a non-weighted matrix here to simplify the example.

Latent semantic indexing 0.16 ocean Dimensionality reduction 0.00 wood 0.35 0.15 tree 0.00 0.65 0.58 of terms and N is the number of documents. Any two distinct row vectors are orthogonal to each other. distinct topics like politics, sports, economics. 2 = land/water to the topic represented by semantic dimension j . 0.73 7 / 30 4 3 LSI in information retrieval boat 0.25 0.58 LSI as sofu clustering 0.57 U 1 ship 5 2 Example of C = U Σ V T : The matrix U − 0.44 − 0.30 − 0.13 − 0.33 − 0.59 − 0.48 − 0.51 − 0.37 − 0.61 − 0.70 − 0.58 − 0.26 − 0.41 − 0.09 ▶ One row per term, one column per min ( M , N ) where M is the number ▶ This is an orthonormal matrix: (i) Row vectors have unit length. (ii) ▶ Think of the dimensions as “semantic” dimensions that capture ▶ Each number u ij in the matrix indicates how strongly related term i is

Latent semantic indexing 0.00 Dimensionality reduction 0.00 1.28 0.00 0.00 4 0.00 0.00 1.00 0.00 0.00 5 0.00 0.00 0.00 0.00 0.39 corresponding semantic dimension. 3 0.00 0.00 1 LSI in information retrieval LSI as sofu clustering 1 2 0.00 4 5 3 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 8 / 30 Example of C = U Σ V T : The matrix Σ Σ ▶ This is a square, diagonal matrix of dimensionality min ( M , N ) × min ( M , N ) . ▶ The diagonal consists of the singular values of C . ▶ The magnitude of the singular value measures the importance of the ▶ We’ll make use of this by omituing unimportant dimensions.

Latent semantic indexing 0.00 3 0.28 Dimensionality reduction 0.45 0.12 4 0.00 0.00 0.58 0.58 0.22 5 0.29 0.63 0.19 0.41 number of terms and N is the number of documents. (ii) Any two distinct column vectors are orthogonal to each other. capture distinct topics like politics, sports, economics. document i is to the topic represented by semantic dimension j . 0.41 9 / 30 0.63 1 LSI in information retrieval LSI as sofu clustering 2 V T Example of C = U Σ V T : The matrix V T d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 − 0.29 − 0.53 − 0.19 − 0.75 − 0.20 − 0.33 − 0.58 − 0.53 − 0.22 ▶ One column per document, one row per min ( M , N ) where M is the ▶ This is an orthonormal matrix: (i) Column vectors have unit length. ▶ These are again the semantic dimensions from matrices U and Σ that ▶ Each number v ij in the matrix indicates how strongly related

Latent semantic indexing 0.00 0.00 0.00 1.28 0.00 0.00 3 0.00 0.00 0.00 1.59 0.00 2 0.00 0.00 0.00 0.00 2.16 1 5 4 3 2 1 Dimensionality reduction 0.58 0.65 tree 0.16 0.15 0.35 4 0.00 0.00 0.45 documents and a representation of the importance of the “semantic” dimensions. LSI is decomposition of C into a representation of the terms, a representation of the 0.41 0.19 0.63 0.29 5 0.58 0.00 0.58 0.00 0.00 4 0.12 0.28 0.00 0.00 1.00 0.00 5 0.00 0.00 0.00 0.39 3 V T 1 2 0.63 0.22 0.41 wood 10 / 30 1 0 0 1 tree 0 1 1 0 0 1 wood 0 0 0 1 1 ocean 0 0 ship 0 1 0 1 0 0 0 0 boat 0 0 0 0.58 LSI in information retrieval LSI as sofu clustering ocean 0.73 0.00 C 1 boat 0.25 0 0.57 4 = U 1 2 ship 5 3 Example of C = U Σ V T : All four matrices d 1 d 2 d 3 d 4 d 5 d 6 Σ − 0.44 − 0.30 − 0.13 − 0.33 − 0.59 × × − 0.48 − 0.51 − 0.37 − 0.61 − 0.70 − 0.58 − 0.26 − 0.41 − 0.09 d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 − 0.29 − 0.53 − 0.19 − 0.75 − 0.20 − 0.33 − 0.58 − 0.53 − 0.22

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering LSI: Summary document reflecting importance of each dimension 11 / 30 ▶ We’ve decomposed the term-document matrix C into a product of three matrices: U Σ V T . ▶ The term matrix U – consists of one (row) vector for each term ▶ The document matrix V T – consists of one (column) vector for each ▶ The singular value matrix Σ – diagonal matrix with singular values, ▶ Next: Why are we doing this?

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Dimensionality reduction 12 / 30

Latent semantic indexing information, but get rid of the “details”. betuer representation because it represents similarity betuer. noisy. Dimensionality reduction 13 / 30 dimension is. How we use the SVD in LSI LSI as sofu clustering LSI in information retrieval ▶ Key property: Each singular value tells us how important its ▶ By setuing less important dimensions to zero, we keep the important ▶ These details may ▶ be noise – the reduced LSI is a betuer representation because it is less ▶ make things dissimilar that should be similar – the reduced LSI is a ▶ Analogy for “fewer details is betuer” ▶ Image of a blue flower ▶ Image of a yellow flower ▶ Omituing color makes is easier to see the similarity

Latent semantic indexing 0.00 0.00 3 0.41 0.22 0.63 2 1 V T 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 Dimensionality reduction 0.00 4 0.00 0.00 0.00 0.00 0.00 0.00 3 0.00 product computing the zero when dimensions in corresponding setuing the the efgect of singular values only zero out Actually, we 0.00 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 boat tree 0.00 0.00 0.00 0.35 wood 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.65 0.00 0.00 ship 5 4 3 2 1 U Reducing the dimensionality to 2 LSI as sofu clustering LSI in information retrieval 0.00 14 / 30 1 0.00 4 5 0.00 0.00 1 3 2.16 2 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 − 0.44 − 0.30 − 0.13 − 0.33 ocean − 0.48 − 0.51 − 0.70 in Σ . This has − 0.26 Σ 2 U and V T to d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 C = U Σ V T . − 0.29 − 0.53 − 0.19

Latent semantic indexing 0.00 0.00 0.00 0.00 0.00 0.00 3 0.00 0.00 0.00 1.59 0.00 2 0.00 0.00 0.00 0.00 2.16 1 5 4 3 2 1 Dimensionality reduction 0.65 tree 0.16 0.15 0.35 4 0.00 0.00 0.28 0.41 0.19 0.63 0.29 5 0.58 0.00 0.58 0.00 0.00 4 0.12 0.45 3 0.00 0.41 0.22 0.63 2 1 V T 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 wood 0.58 ocean 1.01 0.90 0.12 tree 0.41 0.62 1.03 0.20 0.12 0.97 wood 0.16 0.36 ocean 0.49 0.16 0.36 0.36 boat 0.21 0.13 0.28 0.52 0.85 ship Reducing the dimensionality to 2 LSI as sofu clustering LSI in information retrieval 0.41 0.72 = 0.57 0.73 0.00 boat U 0.25 0.58 15 / 30 ship 4 1 2 3 5 C 2 d 1 d 2 d 3 d 4 d 5 d 6 − 0.08 − 0.20 − 0.02 − 0.18 − 0.04 − 0.21 − 0.39 − 0.08 Σ 2 − 0.44 − 0.30 − 0.13 − 0.33 − 0.59 × × − 0.48 − 0.51 − 0.37 − 0.61 − 0.70 − 0.58 − 0.26 − 0.41 − 0.09 d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 − 0.29 − 0.53 − 0.19 − 0.75 − 0.20 − 0.33 − 0.58 − 0.53 − 0.22

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel - PowerPoint PPT Presentation

Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal and Applied Linguistics Faculty of

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Fibonacci Heap CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Heaps as Priority Queues

Deciding the First-Order Theory of an Algebra of Feature Trees with Updates Nicolas Jeannerod

What do graphs have that trees dont? After this lesson, you should be able to define

Regression trees DAAG Chapter 11 Learning objectives In this section, we will learn about

Unpack learnings What did you learn today? What was new? Which tools/activities did you find

[Networking Hardwares] [Maninder Kaur] professormaninder@gmail.com What is Networking Hardware?

Super Operations Dentry Operations Interacting with the VFS struct super_operations { struct

Basic Data Types (cont.) Data Types in C Four Basic Data Types Char (1 Byte = 8 Bits) Int

Sambuz

Useful Links

Newsletter

Mail Us