topicview visually comparing topic models of text
play

TopicView: Visually Comparing Topic Models of Text Collections - PowerPoint PPT Presentation

TopicView: Visually Comparing Topic Models of Text Collections November 7, 2011 Patricia Crossno, Andrew Wilson, Timothy Shead, Daniel Dunlavy Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and


  1. TopicView: Visually Comparing Topic Models of Text Collections November 7, 2011 Patricia Crossno, Andrew Wilson, Timothy Shead, Daniel Dunlavy Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE -AC04-94AL85000.

  2. Modeling Text Data • Latent Semantic Analysis (LSA) vs Latent Dirichlet Allocation (LDA) • Similarities – Bag-of-words modeling – Transform text to term-document frequency matrices – User-defined # of dimensions – Produce weighted term lists for each concept/topic – Produce topic weights for each documents – Results used to compute document relationship measures • Differences – LSA: truncated singular value decomposition (SVD) -> correlations (-1 to 1) – LDA: Bayesian model -> probabilities (0 to 1) – Output quantities have different ranges and meanings • Direct numeric comparison not meaningful

  3. Comparing LSA and LDA • Focus on how models used in applications • Conceptual content – Topic models – Labels • Document relationships – Scatter plots – Graphs – Landscapes • TopicView application – Visually compare and interactively explore models – Tabbed panels (Conceptual Content & Document Relationships) – Linked views – Built using Titan Informatics Toolkit

  4. Term Topic Table Detailed Conceptual Similarity

  5. Bipartite Graph High-level Conceptual Similarity

  6. LDA LSA Topics Concepts

  7. Linked Selection Selected Topics Selected Concepts Green = Selected Edges

  8. Edge Display Controls All Edges High Weight Edges

  9. Document Relationship Graphs LSA Document Similarity Graph LDA Document Similarity Graph

  10. Document Topic Table Document- Topic Weights

  11. Document Full Text Reader

  12. Alphabet Data Case Study Synthetic Data for verification – 26 clusters (one per letter), 10 documents each – Each document contains only words starting with a single letter • absorbent autonomic appeals anthology aristocrats … • bacquire bairbags baiming babomination battorney bafter … • cadvisory cassumption cappears camount canthropology • … – Each algorithm given concept/topic count of 26

  13. Alphabet Topic Similarity

  14. Term/Topic Comparison L F L? F?

  15. Document-Topic Weights L F L? F? L F L? F?

  16. Clustering Evaluation

  17. DUC Data Case Study Document Understanding Conference (DUC) Data (real world) – 30 clusters, ~10 documents each – Human categorized around particular topic/event – Associated Press articles – New York Times articles – Each algorithm given concept/topic count of 30

  18. DUC Topic Similarity

  19. LSA Combines Topics Doc 121 connects Chile, Spanish, Fire Doc 87 connects Pinochet & Timor Pinochet Pinochet Arrest Pinochet Arrest Dance Hall Fire Arrest Timor Unrest

  20. LDA Combines Topics Iranian Bosnian Iranian Elections Tribunal Elections Bosnian Tribunal

  21. DUC Document Relationships

  22. LDA Unexpected Connections Pinochet’s Arrest (44) Bosnian War Crimes Tribunal (36) Cold Weather Deaths (37) Palestinian Airport Closing (34)

  23. Documents more strongly connected to Topic 30 than conceptual topics

  24. Topic 30 - AP wire source

  25. Bridging documents: conceptual content outweighed by source content

  26. LDA rerun without header tags

  27. Conclusions • LSA concepts provide good summarizations over broad document groups • LDA topics are focused on smaller groups • LDA’s limited groups and probabilistic mechanism provides better labeling • LSA’s document relationships do not include extraneous connections between disparate topics • Better graphs • Better labels

Recommend


More recommend