constraints on words
play

Constraints on Words Hayato Kobayashi , Hiromi Wakaki, Tomohiro - PowerPoint PPT Presentation

Topic Models with Logical Constraints on Words Hayato Kobayashi , Hiromi Wakaki, Tomohiro Yamasaki, and Masaru Suzuki Corporate Research and Development Center, Toshiba Corporation, Japan Topic modeling = Word clustering Method to extract


  1. Topic Models with Logical Constraints on Words Hayato Kobayashi , Hiromi Wakaki, Tomohiro Yamasaki, and Masaru Suzuki Corporate Research and Development Center, Toshiba Corporation, Japan

  2. Topic modeling = Word clustering • Method to extract latent topics on a corpus • Each topic is a distribution on words Corpus LDA about Bulgaria ・・・

  3. Topic modeling = Word clustering • Method to extract latent topics on a corpus • Each topic is a distribution on words Corpus yogurt LDA about milk Bulgaria food ・・・ fruit bacteria fat cream …

  4. Topic modeling = Word clustering • Method to extract latent topics on a corpus • Each topic is a distribution on words Corpus yogurt rose LDA about milk oil Bulgaria food organic ・・・ fruit essential bacteria valley fat pure cream kazanlak … …

  5. Topic modeling = Word clustering • Method to extract latent topics on a corpus • Each topic is a distribution on words Corpus yogurt rose dance LDA about milk oil fire Bulgaria food organic sexy ・・・ fruit essential ancient bacteria valley bikini fat pure walk Size of each word cream kazanlak exotic … … … represents its frequency

  6. Want to split into “fire dance” and “sexy dance” dance fire sexy ancient bikini walk exotic …

  7. Existing work [Andrzejewski+ ICML2009] • Constraints on words for topic modeling • Must-Link(A,B) : A and B appear in the same topic • Cannot-Link(A,B) : A and B don’t appear in the same topic Want to split into “fire dance” and “sexy dance” dance dance CL Cannot-Link(fire, sexy) fire sexy bikini ancient exotic walk … …

  8. Problem of the existing work •Constraints often don’t align with user’s intention You might get “blaze” topic instead of “fire dance” topic Want to split into “fire dance” and “sexy dance” blaze dance CL Cannot-Link(fire, sexy) fire sexy bikini ancient exotic forest … …

  9. This work • Logical constraints on words for topic modeling • Conjunctions ( ∧ ), disjunctions ( ∨ ), negations ( ¬ ) Want to split into “fire dance” and “sexy dance” dance dance ML ML CL Cannot-Link(fire, sexy) fire sexy ∧ (Must-Link(dance, fire) ancient bikini ∨ Must-Link(dance, sexy)) walk exotic … …

  10. Outline of the rest of this talk • LDA [Blei+ JMLR2003] • One of topic modeling method • LDA-DF [Andrzejewski+ ICML2009] • Must-Link and Cannot-Link • This work • Logical expressions of Must-Links and Cannot-Links • Experiment • Conclusion

  11. Latent Dirichlet Allocation (LDA) [Blei+ JMLR2003] • Famous Topic modeling method (i) Assume a generative model of documents • Each topic is a distribution on words • Each document is a distribution on topics • Taken from Dirichlet distributions to generate discrete distributions (ii) Infer parameters for the two distributions inverting the generative model

  12. Generative process of documents in LDA • Each topic is a distribution on words • Each document is a distribution on topics Topic 1 Document 1 Document 2 Topic 2

  13. Generative process of documents in LDA • Each topic is a distribution on words • Each document is a distribution on topics Topic 1 Document 1 yogurt milk food fruit … Document 2 Topic 2 rose oil organic essential …

  14. Generative process of documents in LDA • Each topic is a distribution on words • Each document is a distribution on topics Topic 1 Document 1 yogurt yogurt milk yogurt food milk rose oil fruit food yogurt 0.9 food milk bacteria fat drink fruit cream yogurt milk rose … Document 2 Topic 2 rose 0.1 oil organic essential …

  15. Generative process of documents in LDA • Each topic is a distribution on words • Each document is a distribution on topics Topic 1 Document 1 yogurt yogurt milk yogurt food milk rose oil fruit food yogurt food milk bacteria fat drink 0.2 fruit cream yogurt milk rose … Document 2 Topic 2 rose oil yogurt rose valley rose essential milk pure oil kazanlak quality rose food organic 0.8 oil organic yogurt milk essential …

  16. Parameter inference in LDA • Infer word and topic distributions from a corpus inverting the generative process Topic 1 Document 1 yogurt milk yogurt food rose oil fruit food yogurt ? milk bacteria fat drink cream yogurt milk rose ? Document 2 Topic 2 rose oil yogurt rose valley essential milk pure kazanlak quality rose food ? oil organic yogurt milk

  17. LDA-DF [Andrzejewski+ ICML2009] • Semi-supervised extension of LDA • Only conjunction of Must-Links and Cannot-Links • Must-Link(A,B) : A and B appear in the same topic • Cannot-Link(A,B) : A and B don’t appear in the same topic • Extending the generative process • Each topic is a constrained distribution on words • Taken from a Dirichlet tree distribution, which is a generalization of a Dirichlet distribution • Each document is a distribution on topics • Taken from a Dirichlet distribution

  18. Generative process of LDA-DF • Always generates a distribution, where yogurt and rose do not appear in the same topic. Topic 1 Document 1 yogurt yogurt milk yogurt food milk rose oil fruit food yogurt 0.9 food milk bacteria fat drink 0.2 CL fruit cream yogurt milk rose … Document 2 Topic 2 rose oil yogurt rose valley rose 0.1 essential milk pure oil kazanlak quality rose food organic 0.8 oil organic yogurt milk essential …

  19. Algorithm to generate distributions in LDA-DF 1. Map links to a graph 2. Contract Must-Links 3. Extract the maximal independent sets (MIS) 4. Generate a distribution based on each MIS

  20. Algorithm to generate distributions in LDA-DF 1. Map links to a graph • Any conjunction of links can be mapped to a graph Cannot-Link(A,B) ∧ Cannot-Link(E,G) ∧ Must-Link(B,E) ∧ Must-Link(C,D) A B CL ML C Words → Nodes ML D E Links → Edges CL F G

  21. Algorithm to generate distributions in LDA-DF 2. Contract Must-Links • Regard two words on each Must-Link as one word A B A CL CL BE ML C CD ML D E CL CL F F G G

  22. Algorithm to generate distributions in LDA-DF 3. Extract the maximal independent sets (MIS) • MIS = Maximal set of nodes without edges BE CD F A CL BE CD Extract CL F A G CD MIS F G

  23. Algorithm to generate distributions in LDA-DF 4. Generate a distribution based on each MIS • Equalize the frequencies of contracted words • Zero the frequencies of words not in the MIS Equal Same Zero frequency frequency frequency CD F BE A B C D E F G CL CL ML CD F A G A B C D E F G

  24. This work • Algorithm to generate logically constrained distributions on LDA-DF • We can not apply the existing algorithm ( ¬ Cannot-Link(A,B) ∨ Must-Link(A,C)) Words → Nodes ∧ Cannot-Link(B,C) Links → Edges This constraint cannot be mapped to a graph

  25. Negations • Delete negations ( ¬ ) in a preprocessing stage • Weak negation: ¬ Must-Link(A,B) = no constraint (A and B need not appear in the same topic) • Strong negation: ¬ Must-Link(A,B) = Cannot-Link(A,B) (A and B must not appear in the same topic) ( ¬ Cannot-Link(A,B) (Must-Link(A,B) ∨ Must-Link(A,C)) ∨ Must-Link(A,C)) ∧ Cannot-Link(B,C) ∧ Cannot-Link(B,C) Focus only on conjunctions and disjunctions

  26. Key observation for logical expressions • Any constrained distribution is represented by a conjunctive expression by two primitives • EqualPrim(A, B): makes p(A) ≒ p(B) • ZeroPrim(A): makes p(A) ≒ 0 Same Equal frequency frequency Zero frequency A B C D E F G CL CL ML EqualPrim(B, E) ∧ EqualPrim(C, D) ∧ ZeroPrim(A) ∧ ZeroPrim(G)

  27. Substitution of links with primitives • Must-Link(A,B) = EqualPrim(A,B) A B C … • Cannot-Links(A,B) = ZeroPrim(A) ∨ ZeroPrim(B) A B C … A B C … These two distributions satisfy Cannot-Link(A,B)

  28. Proposed algorithm for logical expressions 1. Substitute links with primitives 2. Calculate the minimum disjunctive normal form (DNF) of the primitives 3. Generate distributions for each conjunction of the DNF

  29. Proposed algorithm for logical expressions 1. Substitute links with primitives (Must-Link(A,B) ∨ Must-Link(A,C)) ∧ Cannot-Link(B,C) EqualPrim(A,B) EqualPrim(A,C) ( ∨ ) A B C … A B C … primitives ML ML ZeroPrim(B) ZeroPrim(C) ) ∧ ( ∨ A B C … A B C … CL CL

  30. Proposed algorithm for logical expressions 2. Calculate the minimum disjunctive normal form (DNF) of the primitives • DNF = Disjunction of conjunctions of primitives ( ∧ ) ( ∨ ) A B C … A B C … ∨ ( ∧ ) ML ML DNF ∨ ( ∧ ) ∧ ( ∨ ) A B C … A B C … CL CL ∨ ( ∧ )

  31. Proposed algorithm for logical expressions 3. Generate distributions for each conjunction of the DNF Combine each conjunction of primitives ( ∧ ) A B C … A B C … ∨ ( ∧ ) A B C D E F G ∨ ( ∧ ) ML CL ∨ ( ∧ ) A B C D E F G A B C D E F G A B C D E F G

Recommend


More recommend