topic modeling using
play

Topic Modeling using Topics from Many Domains, Lifelong Learning and - PowerPoint PPT Presentation

Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and Bing Liu University Of Illinois at Chicago liub@cs.uic.edu Introduction Topic models, such as LDA (Blei et al., 2003) , pLSA (Hofmann, 1999) and


  1. Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and Bing Liu University Of Illinois at Chicago liub@cs.uic.edu

  2. Introduction  Topic models, such as LDA (Blei et al., 2003) , pLSA (Hofmann, 1999) and their variants  Widely used to discover topics in docs  Unsupervised models are often insufficient  because their objective functions may not correlate well with human judgments (Chang et al., 2009) .  Knowledge-based topic models (KBTM) better (Andrzejewski et al., 2009 Mukherjee & Liu, 2012, etc)  But not automatic  need user-given prior knowledge for each domain ICML-2014, Beijing, June 22-24, 2014 2

  3. How to Improve Further?  We can invent better topic models  But how about: Learn like humans  What we learn in the past helps future learning  Whenever we see a new situation, we almost know it already;  few aspects are really new.  It shares a lot of things with what we’ve seen in the past.  (a systems approach) ICML-2014, Beijing, June 22-24, 2014 3

  4. Take a Major Step Forward  Knowledge-based modeling is still traditional  Knowledge provided by user and assumed correct.  Not automatic (each domain needs new knowledge from user)  Question : Can we mine prior knowledge systematically and automatically?  Answer : Yes - with big data (many domains)  Implication :  Learn forever: past learning results help future learning  lifelong learning ICML-2014, Beijing, June 22-24, 2014 4

  5. Why? (an example from opinion mining)  Topic overlap across domains: Although every domain is different, there is a fair amount of topic overlapping across domains, e.g.,  Every product review domain has the topic price ,  Most electronic products share the topic battery  Some products share the topic screen .  If we have good topics from a large number of past domain collections ( big data ):  for a new collection, we can use existing topics  To generate high quality prior knowledge automatically ICML-2014, Beijing, June 22-24, 2014 5

  6. An example  We have reviews from 3 domains and each domain gives a topic about price .  Domain 1: { price , color , cost , life }  Domain 2: { cost , picture , price, expensive}  Domain 3: { price , money , customer , expensive}  Mining quality knowledge: require words to appear in at least two domains. We get:  { price , cost } and { price , expensive }.  Each set is likely to belong to the same topic. ICML-2014, Beijing, June 22-24, 2014 6

  7. Run a KBTM: an Example (cont.)  If we run a KBtM on reviews of Domain 1, we may find the new topic about price :  { price , cost , expensive , color },  We get 3 coherent words in top 4 positions, rather than only 2 words as in the old topic.  Old : { price , color , cost , life }  A good topic improvement ICML-2014, Beijing, June 22-24, 2014 7

  8. Problem Statement  Given a large set of document collections 𝐸 = {𝐸 1 , … , 𝐸 𝑜 } , learn from D to produce results S .  Goal : Given a test collection 𝐸 𝑢 , learn from 𝐸 𝑢 with the help of S (and possibly D ).  𝐸 𝑢  D or 𝐸 𝑢  D .  The results learned this way should be better than without the guidance of S (and D ). ICML-2014, Beijing, June 22-24, 2014 8

  9. LTM – Lifelong learning Topic Model  Cold start (initialization)  Run LDA on each D i  D => topics S i  S = U S i  Given a new domain collection D t  Run LDA on D t => topics A t  Find matching topics M j from S for each topic a j  A t  Mine knowledge k j from each M j  K t = U k j  Run a KBTM on D t with the help of K t => new A t  KBTM uses K t and also deals with wrong knowledge in K t  Update S with A t ICML-2014, Beijing, June 22-24, 2014 9

  10. Prior Topic Generation (cold start)  Runs LDA on each 𝐸 𝑗 ∈ 𝐸 to produce a set of topics 𝑇 𝑗 called prior topics (or p-topics ). ICML-2014, Beijing, June 22-24, 2014 10

  11. LTM Topic Model  (1) Mine prior knowledge ( pk-sets ) (2) use prior knowledge to guide modeling. ICML-2014, Beijing, June 22-24, 2014 11

  12. Knowledge Mining Function 𝑢 ) from p-topics  Topic match: find similar topics ( 𝑁 𝑘 ∗ for each current topic 𝑢  Pattern mining: find frequent itemsets from 𝑁 𝑘 ∗ ICML-2014, Beijing, June 22-24, 2014 12

  13. Model inference: Gibbs sampling  How to use prior knowledge ( pk-sets )?  e.g., { price , cost } & { price , expensive }  How to tackle wrong knowledge?  Graphical model: same as LDA  But the model inference is different  Generalized Pólya Urn Model (GPU) (Mimno et al., 2011)  Idea: When assigning a topic t to a word w , also assign a fraction of t to words in prior knowledge sets (pk-sets) sharing with w . ICML-2014, Beijing, June 22-24, 2014 13

  14. Dealing with Wrong Knowledge  Some pieces of automatically generated knowledge (pk-sets) may be wrong.  Deal with them in sampling ( decide fraction ).  ensure words in a pk-set {𝑥, 𝑥 ′ } are associated 𝑥 = 𝑥 ′ 1 𝜈 × 𝑄𝑁𝐽 𝑥, 𝑥 ′ {𝑥, 𝑥 ′ }  𝔹 𝑥 ′ ,𝑥 = is a pk-set 0 otherwise 𝑄(𝑥,𝑥 ′ )  𝑄𝑁𝐽 𝑥, 𝑥 ′ = log 𝑄(𝑥)𝑄(𝑥 ′ ) ICML-2014, Beijing, June 22-24, 2014 14

  15. Gibbs Sampler  𝑄 𝑨 𝑗 = 𝑢 𝒜 −𝑗 , 𝒙, 𝛽, 𝛾 ∝ −𝑗 + 𝛽 𝑊 −𝑗 𝑥 ′ =1 𝔹 𝑥 ′ ,𝑥 𝑗 × 𝑜 𝑢,𝑥 ′ +𝛾 𝑜 𝑛,𝑢 + 𝛽 × 𝑈 −𝑗 𝑊 −𝑗 𝑊 𝑢 ′ =1 𝑤=1 𝑥 ′ =1 𝑜 𝑛,𝑢 ′ 𝔹 𝑥 ′ ,𝑤 × 𝑜 𝑢,𝑥 ′ +𝛾 ICML-2014, Beijing, June 22-24, 2014 15

  16. Evaluation  We used review collections D from 50 domains.  Each domain has 1000 reviews.  Four domains with 10000 reviews for large data test.  Test settings : Two test settings to evaluate LTM, representing two possible uses of LTM  seen the test domain before, i.e., 𝐸 𝑢  D.  Not seen test domain before, i.e., 𝐸 𝑢  D . ICML-2014, Beijing, June 22-24, 2014 16

  17. Topic Coherence (Mimno et al., EMNLP-2011) ICML-2014, Beijing, June 22-24, 2014 17

  18. Topic Coherence on 4 Large Datasets  Can LTM improve with larger data? ICML-2014, Beijing, June 22-24, 2014 18

  19. Split a large data to 10 smaller ones  Here we use only one domain data  Better topic coherence and better efficiency (30%) ICML-2014, Beijing, June 22-24, 2014 19

  20. Summary  Proposed a lifelong learning topic model LTM.  It keeps a large topic base S  For each new topic modeling task,  run LDA to generate a set of initial topics  find matching old topics from S  mine quality knowledge from the old topics  use the knowledge to help generate better topics  With big data (from diverse domains) : we can  do what we cannot do or haven’t done before. ICML-2014, Beijing, June 22-24, 2014 20

Recommend


More recommend