spectral methods for analyzing large data using
play

Spectral Methods for Analyzing Large Data using Reweighted Topic - PowerPoint PPT Presentation

Spectral Methods for Analyzing Large Data using Reweighted Topic Modeling Blake Hunter Jason Bello, Brian de Silva, Arjuna Flenner , Jerry Luo Daniel Bernstein , Yang Hu , Anna Ma , Paul Sharkey UCLA Applied Math UCLA


  1. Spectral Methods for Analyzing Large Data using Reweighted Topic Modeling Blake Hunter ⋆ Jason Bello, Brian de Silva, Arjuna Flenner † , Jerry Luo Daniel Bernstein ‡ , Yang Hu ‡ , Anna Ma ‡ , Paul Sharkey ‡ ⋆ UCLA Applied Math UCLA Applied Math REU 2013 † China Lake Naval Research Lab and CGU ‡ IPAM RIPS 2013 February 5, 2014 Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 1 / 35

  2. Data Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 2 / 35

  3. Data Mining Data Mining Extracting knowledge from a dataset. Goal to transform it into understandable/usable structure for future use. Search and Summarization ◮ Topic Modeling ◮ LDA and NNMF ◮ Ranking and PCA ◮ multiple modalities and data fusion Clustering and Classification ◮ Spectral Clustering ◮ Diffusion Maps · · · Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 3 / 35

  4. Applications Applications Imaging ◮ Navel Research ◮ Hyperspectral ◮ Medical - MRI, fMRI, PET, EEG Text Mining ◮ Medical Reports, Exams, Analysis ◮ Large Documents ◮ Classified Documents ◮ Twitter ◮ Emerging Topics Networks and Social Networks ◮ Community Detection ◮ Twitter ◮ Gang Networks · · · Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 4 / 35

  5. Sidewinder Documents Original Sidewinder Document Converted from Image to Text Thousands of Sidewinder Documents from the Navy. Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 5 / 35

  6. Classification of Sidewinder Documents Tens of thousands of Sidewinder Documents from the Navy Certain documents can be declassified Problem is unsupervised Content-based search ◮ Searching with entire document for documents with similar document ◮ More useful than keyword search for an unsupervised problem Limitations of current search Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 6 / 35

  7. Graphs Data points x i are represented by nodes in an undirected graph. Similarity is encoded in edge weights w ij . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 7 / 35

  8. Test Documents 40 test documents Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 8 / 35

  9. Using ’keyword search’ on a document Table: Search Results for Cincinatti Reds Recap 7/5 Document Description Date Cincinatti Reds Recap 7/5 Cincinatti Reds Recap 6/23 Cincinatti Reds Recap 6/23 Toronto Blue Jays Recap 6/31 Toronto Blue Jays Recap 7/3 Minnesota Twins Recap 7/25 Cincinatti Reds Recap 8/13 Minnesota Twins Recap 7/30 Toronto Blue Jays Recap 7/23 Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 9 / 35

  10. Converting a Corpus of Documents into a Matrix Bag-of-Words ◮ Removes most common words, e.g. “the”, “and”, “because” ◮ Produces histogram vector for each document where each entry is the count of a specific word Term Frequency - Inverse Document Frequency (TF-IDF) (more popular) ◮ Diminishes the weight of words that occur frequently throughout the corpus and adds weight to those that occur rarely Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 10 / 35

  11. Histogram Matrix Documents x 11 . . . x 1 n     Words   X =   . . ...   . . . .       x m 1 . . . x mn Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 11 / 35

  12. Topic Modeling Topic modeling attempts to uncover the hidden thematic structure in sets of documents, images and other data. Doc i = h i 1 × Word 1 + h i 2 × Word 2 + . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

  13. Topic Modeling Topic modeling attempts to uncover the hidden thematic structure in sets of documents, images and other data. Doc i = h i 1 × Word 1 + h i 2 × Word 2 + . . . Doc i = v i 1 × Topic 1 + v i 2 × Topic 2 + . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

  14. Topic Modeling Topic modeling attempts to uncover the hidden thematic structure in sets of documents, images and other data. Doc i = h i 1 × Word 1 + h i 2 × Word 2 + . . . Doc i = v i 1 × Topic 1 + v i 2 × Topic 2 + . . . Topic i = u i 1 × Word 1 + u i 2 × Word 2 + . . . Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 12 / 35

  15. Topic Modeling Methods Latent Dirichlet Allocation (LDA) 1 (computationally expensive) Nonnegative Matrix Factorization (NMF) 2 ◮ OMP - Orthogonal matching pursuit (Lozano, Swirszez, and Abe) ◮ LSAS - Alternating least squares using active sets (Kim and Park) ◮ AM - Alternating Multiplicative update (Lee and Seung) ◮ ℓ 1 - convex model for NMF (Esser, Moller, Osher and Sapiro) 1 David Blei, Andrew Ng, and Michael Jordan. ”Latent dirichlet allocation.” the Journal of machine Learning research 3 (2003): 993-1022. 2 D. Seung and L. Lee. ”Algorithms for non-negative matrix factorization.” Advances in neural information processing systems 13 (2001): 556-562. Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 13 / 35

  16. Poisson Factor Analysis a a M. Zhou and L. Carin ”Beta-Negative Binomial Process and Poisson Factor Analysis” 2012. assume that each histogram bin X dw satisfies � K � � X dw Pois λ dk ψ kw , k =1 � ψ kw = 1 and λ dk ≥ 0 , ψ kw ≥ 0 . w X Pois (ΛΨ) , � ψ k � 1 = and λ dk ≥ 0 , ψ kw ≥ 0 , where H Pois(ΛΨ) is interpreted component wise. Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 14 / 35

  17. Nonnegative Matrix Factorization U , V � X − UV T � min where U = [ U ] + , V = [ V ] + . Documents Topics     Documents Topics       Words Words         T ≈ V       X U             Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 15 / 35

  18. Nonnegative Matrix Factorization U , V � X − UV T � min where U = [ U ] + , V = [ V ] + . Documents Topics     Documents Topics       Words Words         T ≈ V       X U             Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 15 / 35

  19. Similarity Measures Euclidean Based Similarity ◮ Let u , v ∈ R n and let u ( i ) and v ( j ) denote different histogram vectors in the corpus � u − v � 1 − � u ( i ) − v ( j ) � max j Cosine Similarity ◮ Let u , v ∈ R n u · v cos θ = � u �� v � Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 16 / 35

  20. Example Topics Topic 1 Topic 2 Topic 5 Topic 6 Topic 3 Topic 4 blue reds heart truth twins invasion jays hit blood nature runs allied toronto second vein reason innings june hit season veins god game german second cincinnati artery will minnesota troops time pirates inning normandy arteries objects runs innings motion thought three british good three cavity men third landing three games small place start france single game body opinions hit beaches Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 17 / 35

  21. Documents as Linear Combinations of Topics 40 test documents Dark squares indicate a strong presence of a given topic in a document Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 18 / 35

  22. Content-Based Search Results Histogram Similarity Topic Similarity Document Similarity Document Similarity cinc3.txt 1.0000 cinc3.txt 1.0000 cinc2.txt 0.7772 cinc9.txt 0.9991 cinc4.txt 0.7729 cinc1.txt 0.9981 cinc9.txt 0.7569 cinc10.txt 0.9980 minn9.txt 0.7470 cinc4.txt 0.9972 toronto7.txt 0.7468 cinc2.txt 0.9970 cinc7.txt 0.7428 cinc5.txt 0.9959 minn7.txt 0.7419 cinc7.txt 0.9940 dotm14.txt 0.7406 cinc8.txt 0.9929 cinc6.txt 0.7367 cinc6.txt 0.9905 WW2 8.txt 0.7361 WW2 8.txt 0.9001 minn2.txt 0.7358 dotm18.txt 0.8996 toronto5.txt 0.7300 toronto1.txt 0.8993 Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 19 / 35

  23. Disadvantage of Topic Search Searching based on similiarity of each document’s topic vectors will do well to find documents of similar topic compositions, but does not take into account the difference or similarity between distinct topics. Document Similarity Index to ’dotm5.txt’ ’dotm5.txt’ 1 ’dotm2.txt’ 0.9997 ... ... ’dotm12.txt’ 0.9664 ’WW2 8.txt’ 0.8924 ... ... ’toronto4.txt’ 0.7027 ’dotm11.txt’ 0.7009 ... ... ’dotm14.txt’ 0.0212 Table: Topic Search Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 20 / 35

  24. Reweighted Topic Modeling Topic Affinity Weighting Define A , the affinity topic matrix of U by || Ui − Uj || 2 A ij = e − σ AV T is a reweighting of V T using similarity between topics Gram Matrix Weighting Define G , the Gram topic matrix of U by G ij = � U i , U j � For dot product, G = U T U Hunter (UCLA, Applied Math) Reweighted Topic Modeling February 5, 2014 21 / 35 GV T is a reweighting of V T using orthogonality between topics

Recommend


More recommend