707.009 Foundations of Knowledge Management g g Topic Modeling Markus Strohmaier Univ. Ass. / Assistant Professor Knowledge Management Institute Graz University of Technology, Austria

  707.009 Foundations of Knowledge Management g g „Topic Modeling" Markus Strohmaier Univ. Ass. / Assistant Professor Knowledge Management Institute Graz University of Technology, Austria

  2. Knowledge Management Institute Acknowledgements: Acknowledgements: Course slides in part based on …the following slide decks and papers: • “Probabilistic Topic Models and Associative Memory” • • Mark Steyvers UC Irvine Tom Griffiths Brown University Josh Tenenbaum MIT Mark Steyvers, UC Irvine, Tom Griffiths, Brown University, Josh Tenenbaum,MIT • “Topics in Semantic Representation” • Tom Griffiths, Brown University, Mark Steyvers, UC Irvine, Josh Tenenbaum,MIT • Semantic Representations with Probabilistic Topic Models p p • Mark Steyvers, Joint work with: Tom Griffiths, UC Berkeley, Padhraic Smyth, UC Irvine • “Modeling Documents” • Amruta Joshi, Department of Computer Science, Stanford University • „Cognitive Modeling“ C iti M d li “ • Lecture 14: Models of Semantic Processing, University of Edinburgh Markus Strohmaier 2011 2

  3. Knowledge Management Institute Overview T d Today‘s Agenda: ‘ A d Topic Modeling • Associative memory • The topic model • Applications to associative memory • Applications in machine learning/text mining Markus Strohmaier 2011 3

  4. Knowledge Management Institute Wissensorganisation – Wissensorganisation – Zwei Herangehensweisen Taxonomien, Ontologien Ontologien, Semantische Formale vs. inhaltliche Struktur Netze Viele Informationen liegen in unstrukturierten Freitexten (Inhaltliche g ( Struktur) vor. Aussagekräftig aber schlecht auswertbar Schlüsselwort- extraktion, Zwei Herangehensweisen : Folksonomies Folksonomies – Verwendung einer standardisierten Sprache a priori (stark formalisiert) Verwendung einer standardisierten Sprache a priori (stark formalisiert) – Interpretation der heterogenen Sprache a posteriori (NLP, …) sem antische Freitext Code Darstellung Markus Strohmaier 2011 4

  5. Knowledge Management Institute Was sind Konzeptsysteme? Konzeptsystem e sind System e von unterscheidbaren Konzepten , die m ittels Relationen in Beziehung zueinander gesetzt w erden und in einer natürlicheren Sprache form uliert w erden können „Reale Welt“ Zielsetzung : Entwicklung und Festlegung Objekt eines gemeinsamen Verständnisses g Repräsentationssysteme : menschliche Sprache, Logik, „Computersprachen“ Sem iotisches Dreieck Dreieck W ort Begriff Ausdruck Konzept Sym bol Wissen Sprache Markus Strohmaier 2011 5

  7. 7 A third approach: Topic Modeling 2011 Knowledge Management Institute Markus Strohmaier

  8. Knowledge Management Institute Overview I Associative memory II The topic model III Applications to associative memory IV Applications in machine learning/text mining Markus Strohmaier 2011 8

  9. Knowledge Management Institute Example of associative memory: Example of associative memory: word association CUE: RESPONSES: PLAY FUN, BALL, GAME, WORK, GROUND, MATE, CHILD, ENJOY, WIN, ACTOR Markus Strohmaier 2011 9

  10. Knowledge Management Institute Example of associative memory: Example of associative memory: free recall STUDY THESE WORDS STUDY THESE WORDS: Bed, Rest, Awake, Tired, Dream, Wake, Snooze, Blanket, Doze, Slumber, Snore, Nap, Peace, Yawn, Blanket Doze Slumber Snore Nap Peace Yawn Drowsy RECALL WORDS ..... FALSE RECALL: “Sleep” FALSE RECALL: Sleep 61% 61% Markus Strohmaier 2011 10

  11. Knowledge Management Institute A theory for semantic association S Semantic association as probabilistic inference ti i ti b bili ti i f Representation of semantic structure Markus Strohmaier 2011 11

  12. Knowledge Management Institute Infer g from w I f Infer z from w f Infer w n+1 from w n+1 Markus Strohmaier 2011 12

  13. Knowledge Management Institute GENERATIVE PROCESS GENERATIVE PROCESS DOCUMENT 1: money 1 bank 1 bank 1 loan 1 river 2 stream 2 .8 bank 1 money 1 river 2 bank 1 money 1 bank 1 loan 1 money 1 stream 2 bank 1 money 1 bank 1 bank 1 loan 1 river 2 stream 2 bank 1 money 1 river 2 bank 1 money 1 bank 1 bank 1 money 1 loan 1 .3 stream 2 TOPIC 1 .2 2 stream 2 bank 2 stream 2 bank 2 money 1 DOCUMENT 2: river 2 loan 1 river 2 stream 2 loan 1 bank 2 river 2 bank 2 bank 1 stream 2 river 2 loan 1 bank 2 stream 2 bank 2 money 1 loan 1 river 2 stream 2 .7 bank 2 stream 2 bank 2 money 1 loan 1 bank 2 river 2 stream 2 y river 2 bank 2 money 1 stream 2 river 2 bank 2 stream 2 bank 1 bank 2 money 1 TOPIC 2 TOPIC 2 • No notion of mutual exclusivity Mixture Mixture • Capturing polysemy Capturing polysemy components components weights weights • Bag of words Markus Strohmaier 2011 13

  14. Knowledge Management Institute The probability of choosing a word: y g T ( ) ( ) ( ) ∑ = | P w P w z P z = 1 z word probability d b bilit probability of topic j in topic j in document T…Number of Topics Markus Strohmaier 2011 14

  15. Knowledge Management Institute Bayes‘ rule: Latent Semantic Structure Distribution over words ∑ = l ) ( ) ( , P P w w Latent Structure l l Latent Structure l l Inferring latent structure Inferring latent structure l l ( | ) ( ) P P w = P l l ( ( | | ) ) P w w ( ) P w Words w Prediction = ( ( | | ) ) ... ... P P w w w w + + 1 1 n n Markus Strohmaier 2011 15

  16. Knowledge Management Institute Overview I Associative memory II The topic model III Applications to associative memory IV Applications in machine learning/text mining Markus Strohmaier 2011 16

  17. Knowledge Management Institute The Big Idea Topic Model • Model topics as distribution over words Document Model • Model documents as distribution over words Document / Topic Model • Probabilistic Model for both • Model topics as distribution over words • Model documents as distribution over topics Markus Strohmaier 2011 17

  18. Knowledge Management Institute Topic Model Unsupervised learning of topics (“gist”) of documents: p g p ( g ) – articles/chapters – conversations – emails il – .... any verbal context Topics are useful latent structures to explain semantic association i ti Markus Strohmaier 2011 18

  19. Knowledge Management Institute Probabilistic Generative Model Each topic is a probability distribution over words Each topic is a probability distribution over words From the TASA corpus, a collection of over 37,000 text passages from educational materials (e.g., language & arts, social studies, health, sciences) collected by Touchstone Applied Science Associates (see Landauer, Foltz, & Laham, 1998). Markus Strohmaier 2011 19

  20. Knowledge Management Institute observed observed observed Markus Strohmaier 2011 Taken from „Topics in Semantic Representation“, Thomas L. Griffiths, Mark Steyvers,Joshua B. Tenenbaum 20

  21. Knowledge Management Institute Inference – Constructing Topic Models Expectation Maximization • But poor results (local Maxima) Gibbs Sampling Gibbs Sampling – Parameters: φ , θ – Start with initial random assignment S – Update parameter using other parameters – Converges after ‘n’ iterations – Burn-in time Markus Strohmaier 2011 21

  22. Knowledge Management Institute INVERTING THE GENERATIVE PROCESS INVERTING THE GENERATIVE PROCESS DOCUMENT 1: A Play is written to be performed on a DOCUMENT 1: A Play is written to be performed on a stage before a live audience or before motion ? picture or television cameras ( for later viewing by large by large audiences ). A Play audiences ). A Play is written is written because because playwrights have something ... TOPIC 1 ? DOCUMENT 2: He was listening g to music coming g ? from a passing riverboat. The music had already captured his heart as well as his ear . It was jazz . Bix beiderbecke had already had music lessons . He wanted to play the cornet. And he wanted to TOPIC 2 play jazz ... .... We estimate the assignments of topics to words W ti t th i t f t i t d Markus Strohmaier 2011 24

  23. Knowledge Management Institute INVERTING THE GENERATIVE PROCESS INVERTING THE GENERATIVE PROCESS Play 082 written 082 A A Play is is written to to be be DOCUMENT DOCUMENT 1: 1: performed 082 stage 082 live 093 on a before a ? audience 082 motion 270 picture 004 or before or television 004 cameras 004 ( for later 054 viewing 004 by television cameras ( for later viewing by large 202 audiences 082 ). Play 082 written 082 A is because playwrights 082 have something ... TOPIC 1 ? listening 077 music 077 He was g to DOCUMENT 2: coming 009 from a passing 043 riverboat. The music 077 ? had already captured 006 his heart 157 as well as his ear 119 . It was jazz 077 . Bix beiderbecke had already had music 077 lessons 077 . He wanted 268 to play 077 the cornet. And he wanted 268 to play 077 jazz 077 ... .... TOPIC 2 We estimate the assignments of topics to words We estimate the assignments of topics to words Blue words represent stopwords/words not used Markus Strohmaier 2011 25


