compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall 2019. Lecture 14 0
logistics exam/discuss solutions. 1 • Midterm grades are on Moodle. • Average was 32 . 67, median 33, standard deviation 6 . 8 • Come to office hours if you would like to see your
• How do we compress data that does not lie close to a • Spectral methods (SVD and eigendecomposition) are still key • Spectral graph theory, spectral clustering. • Compress data that lies close to a k -dimensional subspace. • Equivalent to finding a low-rank approximation of the data • Optimal solution via PCA (eigendecomposition of X T X or summary Last Few Weeks: Low-Rank Approximation and PCA matrix X : X XVV T . equivalently, SVD of X ). This Class: Non-linear dimensionality reduction. k -dimensional subspace? techniques in this setting. 2
• How do we compress data that does not lie close to a • Spectral methods (SVD and eigendecomposition) are still key • Spectral graph theory, spectral clustering. summary Last Few Weeks: Low-Rank Approximation and PCA equivalently, SVD of X ). This Class: Non-linear dimensionality reduction. k -dimensional subspace? techniques in this setting. 2 • Compress data that lies close to a k -dimensional subspace. • Equivalent to finding a low-rank approximation of the data matrix X : X ≈ XVV T . • Optimal solution via PCA (eigendecomposition of X T X or
• How do we compress data that does not lie close to a • Spectral methods (SVD and eigendecomposition) are still key • Spectral graph theory, spectral clustering. summary Last Few Weeks: Low-Rank Approximation and PCA equivalently, SVD of X ). This Class: Non-linear dimensionality reduction. k -dimensional subspace? techniques in this setting. 2 • Compress data that lies close to a k -dimensional subspace. • Equivalent to finding a low-rank approximation of the data matrix X : X ≈ XVV T . • Optimal solution via PCA (eigendecomposition of X T X or
summary Last Few Weeks: Low-Rank Approximation and PCA equivalently, SVD of X ). This Class: Non-linear dimensionality reduction. k -dimensional subspace? techniques in this setting. 2 • Compress data that lies close to a k -dimensional subspace. • Equivalent to finding a low-rank approximation of the data matrix X : X ≈ XVV T . • Optimal solution via PCA (eigendecomposition of X T X or • How do we compress data that does not lie close to a • Spectral methods (SVD and eigendecomposition) are still key • Spectral graph theory, spectral clustering.
• Documents (for topic-based search and classification) • Words (to identify synonyms, translations, etc.) • Nodes in a social network entity embeddings End of Last Class: Embedding objects other than vectors into Euclidean space. Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation 3
entity embeddings End of Last Class: Embedding objects other than vectors into Euclidean space. Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation 3 • Documents (for topic-based search and classification) • Words (to identify synonyms, translations, etc.) • Nodes in a social network
entity embeddings End of Last Class: Embedding objects other than vectors into Euclidean space. Usual Approach: Convert each item into a high-dimensional feature vector and then apply low-rank approximation 3 • Documents (for topic-based search and classification) • Words (to identify synonyms, translations, etc.) • Nodes in a social network
example: latent semantic analysis 4
example: latent semantic analysis 4
• I.e., y i z a 1 when doc i contains word a . • If doc i and doc j both contain word a , y i z a y j z a YZ T F is small, then on average, • If the error X y i z a example: latent semantic analysis X i a YZ T i a 1. 5
• I.e., y i z a 1 when doc i contains word a . • If doc i and doc j both contain word a , y i z a y j z a example: latent semantic analysis 1. 5 • If the error ∥ X − YZ T ∥ F is small, then on average, X i , a ≈ ( YZ T ) i , a = ⟨ ⃗ y i ,⃗ z a ⟩ .
• If doc i and doc j both contain word a , y i z a y j z a example: latent semantic analysis 1. 5 • If the error ∥ X − YZ T ∥ F is small, then on average, X i , a ≈ ( YZ T ) i , a = ⟨ ⃗ y i ,⃗ z a ⟩ . • I.e., ⟨ ⃗ y i ,⃗ z a ⟩ ≈ 1 when doc i contains word a .
example: latent semantic analysis 5 • If the error ∥ X − YZ T ∥ F is small, then on average, X i , a ≈ ( YZ T ) i , a = ⟨ ⃗ y i ,⃗ z a ⟩ . • I.e., ⟨ ⃗ y i ,⃗ z a ⟩ ≈ 1 when doc i contains word a . • If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1.
Another View: Each column of Y represents a ‘topic’. y i j indicates how much doc i belongs to topic j . z a j indicates how much word a example: latent semantic analysis associates with that topic. 6 If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1
associates with that topic. example: latent semantic analysis 6 If doc i and doc j both contain word a , ⟨ ⃗ y i ,⃗ z a ⟩ ≈ ⟨ ⃗ y j ,⃗ z a ⟩ = 1 Another View: Each column of Y represents a ‘topic’. ⃗ y i ( j ) indicates how much doc i belongs to topic j . ⃗ z a ( j ) indicates how much word a
• In an SVD decomposition we set Z • The columns of V k are equivalently: the top k eigenvectors of XX T . The eigendecomposition of XX T is XX T • What is the best rank- k approximation of XX T ? I.e. k B XX T B F • XX T example: latent semantic analysis ZZ T . k k V T 2 V k 2 V T . arg min rank V K . k V T documents. 7 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same
• The columns of V k are equivalently: the top k eigenvectors of XX T . The eigendecomposition of XX T is XX T • What is the best rank- k approximation of XX T ? I.e. k B XX T B F • XX T example: latent semantic analysis ZZ T . k k V T 2 V k 2 V T . arg min rank V K . documents. 7 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T
The eigendecomposition of XX T is XX T • What is the best rank- k approximation of XX T ? I.e. k B XX T B F • XX T example: latent semantic analysis ZZ T . k k V T 2 V k arg min rank 2 V T . V K . documents. 7 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T • The columns of V k are equivalently: the top k eigenvectors of XX T .
• What is the best rank- k approximation of XX T ? I.e. k B XX T B F • XX T arg min rank ZZ T . k k V T 2 V k example: latent semantic analysis K . documents. 7 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T • The columns of V k are equivalently: the top k eigenvectors of XX T . The eigendecomposition of XX T is XX T = V Σ 2 V T .
• XX T example: latent semantic analysis ZZ T . k k V T 2 V k 7 documents. K . • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T • The columns of V k are equivalently: the top k eigenvectors of XX T . The eigendecomposition of XX T is XX T = V Σ 2 V T . • What is the best rank- k approximation of XX T ? I.e. arg min rank − k B ∥ XX T − B ∥ F
example: latent semantic analysis documents. K . k V T 7 • Just like with documents, ⃗ z a and ⃗ z b will tend to have high dot product if word i and word j appear in many of the same • In an SVD decomposition we set Z = Σ k V T • The columns of V k are equivalently: the top k eigenvectors of XX T . The eigendecomposition of XX T is XX T = V Σ 2 V T . • What is the best rank- k approximation of XX T ? I.e. arg min rank − k B ∥ XX T − B ∥ F • XX T = V k Σ 2 k = ZZ T .
• Think about XX T as a similarity matrix (gram matrix, kernel matrix) with entry a b being the similarity between word a and word b . • Many ways to measure similarity: number of sentences both occur • Replacing XX T with these different metrics (sometimes example: word embedding LSA gives a way of embedding words into k -dimensional space. in, number of time both appear in the same window of w words, in similar positions of documents in different languages, etc. appropriately transformed) leads to popular word embedding algorithms: word2vec, GloVe, fastTest, etc. 8 • Embedding is via low-rank approximation of XX T : where ( XX T ) a , b is the number of documents that both word a and word b appear in.
Recommend
More recommend