Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 30 pecina@ufal.mff.cuni.cz
Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Contents Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering 2 / 30
Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Latent semantic indexing 3 / 30
Latent semantic indexing 1.51 0.0 1.54 0.0 0.0 0.0 0.0 cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 0.0 Dimensionality reduction 4.15 similarity between documents and queries? Today: Can we transform this matrix, so that we get a betuer measure of and queries. … 1.95 0.25 0.11 1.90 0.0 1.37 worser 0.88 5.25 0.12 calpurnia 0.0 0.25 5.25 LSI in information retrieval LSI as sofu clustering Recall: Term-document matrix Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra anthony 3.18 1.51 0.0 0.0 2.54 8.59 caesar 0.0 0.0 1.0 0.0 6.10 1.21 brutus 0.35 0.0 0.0 4 / 30 This matrix is the basis for computing the similarity between documents
Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Latent semantic indexing: Overview 5 / 30 ▶ We decompose the term-document matrix into a product of matrices. ▶ The particular decomposition: singular value decomposition (SVD). ▶ SVD: C = U Σ V T (where C = term-document matrix) ▶ We use SVD to compute a new, improved term-document matrix C ′ . ▶ We get betuer similarity values out of C ′ (compared to C ). ▶ Using SVD for this purpose is called latent semantic indexing or LSI.
Latent semantic indexing 0 1 1 0 0 0 0 wood 1 0 1 Dimensionality reduction 1 0 tree 0 0 0 1 0 1 ocean 0 0 0 LSI in information retrieval LSI as sofu clustering C 0 ship 1 6 / 30 1 0 0 0 boat 0 1 0 Example of C = U Σ V T : The matrix C d 1 d 2 d 3 d 4 d 5 d 6 ▶ This is a standard term-document matrix. ▶ Actually, we use a non-weighted matrix here to simplify the example.
Latent semantic indexing 0.16 ocean Dimensionality reduction 0.00 wood 0.35 0.15 tree 0.00 0.65 0.58 of terms and N is the number of documents. Any two distinct row vectors are orthogonal to each other. distinct topics like politics, sports, economics. 2 = land/water to the topic represented by semantic dimension j . 0.73 7 / 30 4 3 LSI in information retrieval boat 0.25 0.58 LSI as sofu clustering 0.57 U 1 ship 5 2 Example of C = U Σ V T : The matrix U − 0.44 − 0.30 − 0.13 − 0.33 − 0.59 − 0.48 − 0.51 − 0.37 − 0.61 − 0.70 − 0.58 − 0.26 − 0.41 − 0.09 ▶ One row per term, one column per min ( M , N ) where M is the number ▶ This is an orthonormal matrix: (i) Row vectors have unit length. (ii) ▶ Think of the dimensions as “semantic” dimensions that capture ▶ Each number u ij in the matrix indicates how strongly related term i is
Latent semantic indexing 0.00 Dimensionality reduction 0.00 1.28 0.00 0.00 4 0.00 0.00 1.00 0.00 0.00 5 0.00 0.00 0.00 0.00 0.39 corresponding semantic dimension. 3 0.00 0.00 1 LSI in information retrieval LSI as sofu clustering 1 2 0.00 4 5 3 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 8 / 30 Example of C = U Σ V T : The matrix Σ Σ ▶ This is a square, diagonal matrix of dimensionality min ( M , N ) × min ( M , N ) . ▶ The diagonal consists of the singular values of C . ▶ The magnitude of the singular value measures the importance of the ▶ We’ll make use of this by omituing unimportant dimensions.
Latent semantic indexing 0.00 3 0.28 Dimensionality reduction 0.45 0.12 4 0.00 0.00 0.58 0.58 0.22 5 0.29 0.63 0.19 0.41 number of terms and N is the number of documents. (ii) Any two distinct column vectors are orthogonal to each other. capture distinct topics like politics, sports, economics. document i is to the topic represented by semantic dimension j . 0.41 9 / 30 0.63 1 LSI in information retrieval LSI as sofu clustering 2 V T Example of C = U Σ V T : The matrix V T d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 − 0.29 − 0.53 − 0.19 − 0.75 − 0.20 − 0.33 − 0.58 − 0.53 − 0.22 ▶ One column per document, one row per min ( M , N ) where M is the ▶ This is an orthonormal matrix: (i) Column vectors have unit length. ▶ These are again the semantic dimensions from matrices U and Σ that ▶ Each number v ij in the matrix indicates how strongly related
Latent semantic indexing 0.00 0.00 0.00 1.28 0.00 0.00 3 0.00 0.00 0.00 1.59 0.00 2 0.00 0.00 0.00 0.00 2.16 1 5 4 3 2 1 Dimensionality reduction 0.58 0.65 tree 0.16 0.15 0.35 4 0.00 0.00 0.45 documents and a representation of the importance of the “semantic” dimensions. LSI is decomposition of C into a representation of the terms, a representation of the 0.41 0.19 0.63 0.29 5 0.58 0.00 0.58 0.00 0.00 4 0.12 0.28 0.00 0.00 1.00 0.00 5 0.00 0.00 0.00 0.39 3 V T 1 2 0.63 0.22 0.41 wood 10 / 30 1 0 0 1 tree 0 1 1 0 0 1 wood 0 0 0 1 1 ocean 0 0 ship 0 1 0 1 0 0 0 0 boat 0 0 0 0.58 LSI in information retrieval LSI as sofu clustering ocean 0.73 0.00 C 1 boat 0.25 0 0.57 4 = U 1 2 ship 5 3 Example of C = U Σ V T : All four matrices d 1 d 2 d 3 d 4 d 5 d 6 Σ − 0.44 − 0.30 − 0.13 − 0.33 − 0.59 × × − 0.48 − 0.51 − 0.37 − 0.61 − 0.70 − 0.58 − 0.26 − 0.41 − 0.09 d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 − 0.29 − 0.53 − 0.19 − 0.75 − 0.20 − 0.33 − 0.58 − 0.53 − 0.22
Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering LSI: Summary document reflecting importance of each dimension 11 / 30 ▶ We’ve decomposed the term-document matrix C into a product of three matrices: U Σ V T . ▶ The term matrix U – consists of one (row) vector for each term ▶ The document matrix V T – consists of one (column) vector for each ▶ The singular value matrix Σ – diagonal matrix with singular values, ▶ Next: Why are we doing this?
Latent semantic indexing Dimensionality reduction LSI in information retrieval LSI as sofu clustering Dimensionality reduction 12 / 30
Latent semantic indexing information, but get rid of the “details”. betuer representation because it represents similarity betuer. noisy. Dimensionality reduction 13 / 30 dimension is. How we use the SVD in LSI LSI as sofu clustering LSI in information retrieval ▶ Key property: Each singular value tells us how important its ▶ By setuing less important dimensions to zero, we keep the important ▶ These details may ▶ be noise – the reduced LSI is a betuer representation because it is less ▶ make things dissimilar that should be similar – the reduced LSI is a ▶ Analogy for “fewer details is betuer” ▶ Image of a blue flower ▶ Image of a yellow flower ▶ Omituing color makes is easier to see the similarity
Latent semantic indexing 0.00 0.00 3 0.41 0.22 0.63 2 1 V T 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 Dimensionality reduction 0.00 4 0.00 0.00 0.00 0.00 0.00 0.00 3 0.00 product computing the zero when dimensions in corresponding setuing the the efgect of singular values only zero out Actually, we 0.00 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 boat tree 0.00 0.00 0.00 0.35 wood 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.65 0.00 0.00 ship 5 4 3 2 1 U Reducing the dimensionality to 2 LSI as sofu clustering LSI in information retrieval 0.00 14 / 30 1 0.00 4 5 0.00 0.00 1 3 2.16 2 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 − 0.44 − 0.30 − 0.13 − 0.33 ocean − 0.48 − 0.51 − 0.70 in Σ . This has − 0.26 Σ 2 U and V T to d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 C = U Σ V T . − 0.29 − 0.53 − 0.19
Latent semantic indexing 0.00 0.00 0.00 0.00 0.00 0.00 3 0.00 0.00 0.00 1.59 0.00 2 0.00 0.00 0.00 0.00 2.16 1 5 4 3 2 1 Dimensionality reduction 0.65 tree 0.16 0.15 0.35 4 0.00 0.00 0.28 0.41 0.19 0.63 0.29 5 0.58 0.00 0.58 0.00 0.00 4 0.12 0.45 3 0.00 0.41 0.22 0.63 2 1 V T 0.00 0.00 0.00 0.00 0.00 5 0.00 0.00 wood 0.58 ocean 1.01 0.90 0.12 tree 0.41 0.62 1.03 0.20 0.12 0.97 wood 0.16 0.36 ocean 0.49 0.16 0.36 0.36 boat 0.21 0.13 0.28 0.52 0.85 ship Reducing the dimensionality to 2 LSI as sofu clustering LSI in information retrieval 0.41 0.72 = 0.57 0.73 0.00 boat U 0.25 0.58 15 / 30 ship 4 1 2 3 5 C 2 d 1 d 2 d 3 d 4 d 5 d 6 − 0.08 − 0.20 − 0.02 − 0.18 − 0.04 − 0.21 − 0.39 − 0.08 Σ 2 − 0.44 − 0.30 − 0.13 − 0.33 − 0.59 × × − 0.48 − 0.51 − 0.37 − 0.61 − 0.70 − 0.58 − 0.26 − 0.41 − 0.09 d 1 d 2 d 3 d 4 d 5 d 6 − 0.75 − 0.28 − 0.20 − 0.45 − 0.33 − 0.12 − 0.29 − 0.53 − 0.19 − 0.75 − 0.20 − 0.33 − 0.58 − 0.53 − 0.22
Recommend
More recommend