Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University October 20, 2016 K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 1 / 17
Introduction Topic Modeling • Idea: Represent documents as combination of topics. • Advantages: • Low-dimensional representation of documents • Uncover hidden structure from large collections • Applications: • Summarizing documents with the topics • Clustering documents by similarity in topics K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 2 / 17
Joint Stochastic Matrix Factorization Co-occurrence Matrix • The relationships between words can be more revealing than the words themselves. C ≈ BAB T • C ∈ R n × n - Word-Word Matrix. C ij = p ( X 1 = i , X 2 = j ) • A ∈ R k × k - Topic-Topic Matrix. A k ℓ = p ( Z 1 = k , Z 2 = ℓ ) • B ∈ R n × k - Word-Topic Matrix. B ik = p ( X = i | Z = k ) K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 3 / 17
What We Observe K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 4 / 17
Anchor Word • Separability: The word-topic matrix B is p -separable if for each topic k there is some word i such that A i , k ≥ p and A i ,ℓ = 0 for ℓ � = k • Every topic k has an anchor word i exclusive to it. • Documents containing anchor word i must contain topic k . K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 5 / 17
Anchor Word Algorithm • Under this assumption, Arora et al.(2013) showed Anchor Word Algorithm computes this decomposition in polynomial time. • Use QR with row-pivoting after random projection on C . Choose the points that are farthest away from each other. • However, it fails to produce doubly nonnegative topic-topic matrix. • It tends to choose rare words as anchors and generate less meaningful topics. K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 6 / 17
Probablistic Structure • For m -th document with n m words, we view it as n m ( n m − 1) pairs. • Generate a distribution A over pairs of topics with parameter α . • Sample two topics ( z 1 , z 2 ) ∼ A . • Sample actual word-pair ( x 1 , x 2 ) ∼ ( B z 1 , B z 2 ). K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 7 / 17
Statistical Structure • Let f ( α ) be a distribution of topic-distributions. • Documents are M i.i.d. samples { W 1 , · · · , W m } ∼ f ( α ). • Let the posterior topic-topic matrix A ∗ M = 1 � M m =1 W m W T m and M the expectation A ∗ = E [ W m W T M → A ∗ as M → ∞ . m ]. A ∗ m B T and • Let posterior word-word matrix C ∗ m = BW m W T C ∗ = 1 � M m =1 C ∗ m . M • Let C be the noisy observation for all samples. C → E [ C ] = C ∗ = BA ∗ M B T → BA ∗ B M , A ∗ ∈ DNN K and C ∗ ∈ DNN N . • A ∗ K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 8 / 17
Generating Co-occurrence C • Let H m be the vector of word counts for m -th document and W m be the latent topic distribution. • Let p m = BW m , and we assume H m ∼ Multi ( n m , p m ). • E [ H m ] = n m p m = n m BW m and Cov ( H m ) = n m ( diag ( p m ) − p m p T m ). • Let co-occurrence C m = H m H T m − diag ( H m ) . n m ( n m − 1) • E [ C m | W m ] = C ∗ m so E [ C | W ] = C ∗ . K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 9 / 17
Rectifying Co-occurrence C • In reality C could still mismatch C ∗ because of model assumption violation and limited data. • We can rectify C into low-rank, doubly non-negative and joint-stochastic by Alternating Projection (Dykstra’s Algorithm). PSD NK ( C ) = U Λ + K U T � 1 − � i , j C ij 11 T � NOR N ( C ) = C + N 2 NN N ( C ) = max { C , 0 } � K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 10 / 17
Finding Anchor Words • Use a column-pivoting QR algorithm to greedily find topics farthest away from each other. • Exploit sparsity and avoid using random projection. K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 11 / 17
Recovering Word-Topic Matrix B • If we row-normalize C to get C , C ij = p ( w 2 = j | w 1 = i ). • Under separability assumption, � p ( z 1 = k ′ | w 1 = s k ) p ( w 2 = j | z 1 = k ′ ) = p ( w 2 = j | z 1 = k ) C s k , j = k ′ • The row-space of C lies in the convex hull of C s k rows. � � C ij = p ( z 1 = k | w 1 = i ) p ( w 2 = j | z 1 = k ) = Q ik C s k , j k k • Find Q ik through NNLS and infer B ik with Bayes’ rule. K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 12 / 17
Example of Recovered Topics K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 13 / 17
Recovering Topic-Topic Matrix A K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 14 / 17
Conclusion • This algorithm can handle noisy co-occurrence by rectification. • It produces quality anchor words and topics, even when sample size is small. • Preserve the structure of the decomposition under our assumption. K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 15 / 17
Citation Sanjeev Arora, Rong Ge, Yonatan Halpern, David M Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. A practical algorithm for topic modeling with provable guarantees. Moontae Lee, David Bindel, and David Mimno. Robust spectral inference for joint stochastic matrix factorization. In Advances in Neural Information Processing Systems , pages 2710–2718, 2015. K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 16 / 17
Thank you! K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization October 20, 2016 17 / 17
Recommend
More recommend