Topic Modeling using Topics from Many Domains, Lifelong Learning and Big Data Zhiyuan Chen and Bing Liu University Of Illinois at Chicago liub@cs.uic.edu
Introduction Topic models, such as LDA (Blei et al., 2003) , pLSA (Hofmann, 1999) and their variants Widely used to discover topics in docs Unsupervised models are often insufficient because their objective functions may not correlate well with human judgments (Chang et al., 2009) . Knowledge-based topic models (KBTM) better (Andrzejewski et al., 2009 Mukherjee & Liu, 2012, etc) But not automatic need user-given prior knowledge for each domain ICML-2014, Beijing, June 22-24, 2014 2
How to Improve Further? We can invent better topic models But how about: Learn like humans What we learn in the past helps future learning Whenever we see a new situation, we almost know it already; few aspects are really new. It shares a lot of things with what we’ve seen in the past. (a systems approach) ICML-2014, Beijing, June 22-24, 2014 3
Take a Major Step Forward Knowledge-based modeling is still traditional Knowledge provided by user and assumed correct. Not automatic (each domain needs new knowledge from user) Question : Can we mine prior knowledge systematically and automatically? Answer : Yes - with big data (many domains) Implication : Learn forever: past learning results help future learning lifelong learning ICML-2014, Beijing, June 22-24, 2014 4
Why? (an example from opinion mining) Topic overlap across domains: Although every domain is different, there is a fair amount of topic overlapping across domains, e.g., Every product review domain has the topic price , Most electronic products share the topic battery Some products share the topic screen . If we have good topics from a large number of past domain collections ( big data ): for a new collection, we can use existing topics To generate high quality prior knowledge automatically ICML-2014, Beijing, June 22-24, 2014 5
An example We have reviews from 3 domains and each domain gives a topic about price . Domain 1: { price , color , cost , life } Domain 2: { cost , picture , price, expensive} Domain 3: { price , money , customer , expensive} Mining quality knowledge: require words to appear in at least two domains. We get: { price , cost } and { price , expensive }. Each set is likely to belong to the same topic. ICML-2014, Beijing, June 22-24, 2014 6
Run a KBTM: an Example (cont.) If we run a KBtM on reviews of Domain 1, we may find the new topic about price : { price , cost , expensive , color }, We get 3 coherent words in top 4 positions, rather than only 2 words as in the old topic. Old : { price , color , cost , life } A good topic improvement ICML-2014, Beijing, June 22-24, 2014 7
Problem Statement Given a large set of document collections 𝐸 = {𝐸 1 , … , 𝐸 𝑜 } , learn from D to produce results S . Goal : Given a test collection 𝐸 𝑢 , learn from 𝐸 𝑢 with the help of S (and possibly D ). 𝐸 𝑢 D or 𝐸 𝑢 D . The results learned this way should be better than without the guidance of S (and D ). ICML-2014, Beijing, June 22-24, 2014 8
LTM – Lifelong learning Topic Model Cold start (initialization) Run LDA on each D i D => topics S i S = U S i Given a new domain collection D t Run LDA on D t => topics A t Find matching topics M j from S for each topic a j A t Mine knowledge k j from each M j K t = U k j Run a KBTM on D t with the help of K t => new A t KBTM uses K t and also deals with wrong knowledge in K t Update S with A t ICML-2014, Beijing, June 22-24, 2014 9
Prior Topic Generation (cold start) Runs LDA on each 𝐸 𝑗 ∈ 𝐸 to produce a set of topics 𝑇 𝑗 called prior topics (or p-topics ). ICML-2014, Beijing, June 22-24, 2014 10
LTM Topic Model (1) Mine prior knowledge ( pk-sets ) (2) use prior knowledge to guide modeling. ICML-2014, Beijing, June 22-24, 2014 11
Knowledge Mining Function 𝑢 ) from p-topics Topic match: find similar topics ( 𝑁 𝑘 ∗ for each current topic 𝑢 Pattern mining: find frequent itemsets from 𝑁 𝑘 ∗ ICML-2014, Beijing, June 22-24, 2014 12
Model inference: Gibbs sampling How to use prior knowledge ( pk-sets )? e.g., { price , cost } & { price , expensive } How to tackle wrong knowledge? Graphical model: same as LDA But the model inference is different Generalized Pólya Urn Model (GPU) (Mimno et al., 2011) Idea: When assigning a topic t to a word w , also assign a fraction of t to words in prior knowledge sets (pk-sets) sharing with w . ICML-2014, Beijing, June 22-24, 2014 13
Dealing with Wrong Knowledge Some pieces of automatically generated knowledge (pk-sets) may be wrong. Deal with them in sampling ( decide fraction ). ensure words in a pk-set {𝑥, 𝑥 ′ } are associated 𝑥 = 𝑥 ′ 1 𝜈 × 𝑄𝑁𝐽 𝑥, 𝑥 ′ {𝑥, 𝑥 ′ } 𝔹 𝑥 ′ ,𝑥 = is a pk-set 0 otherwise 𝑄(𝑥,𝑥 ′ ) 𝑄𝑁𝐽 𝑥, 𝑥 ′ = log 𝑄(𝑥)𝑄(𝑥 ′ ) ICML-2014, Beijing, June 22-24, 2014 14
Gibbs Sampler 𝑄 𝑨 𝑗 = 𝑢 𝒜 −𝑗 , 𝒙, 𝛽, 𝛾 ∝ −𝑗 + 𝛽 𝑊 −𝑗 𝑥 ′ =1 𝔹 𝑥 ′ ,𝑥 𝑗 × 𝑜 𝑢,𝑥 ′ +𝛾 𝑜 𝑛,𝑢 + 𝛽 × 𝑈 −𝑗 𝑊 −𝑗 𝑊 𝑢 ′ =1 𝑤=1 𝑥 ′ =1 𝑜 𝑛,𝑢 ′ 𝔹 𝑥 ′ ,𝑤 × 𝑜 𝑢,𝑥 ′ +𝛾 ICML-2014, Beijing, June 22-24, 2014 15
Evaluation We used review collections D from 50 domains. Each domain has 1000 reviews. Four domains with 10000 reviews for large data test. Test settings : Two test settings to evaluate LTM, representing two possible uses of LTM seen the test domain before, i.e., 𝐸 𝑢 D. Not seen test domain before, i.e., 𝐸 𝑢 D . ICML-2014, Beijing, June 22-24, 2014 16
Topic Coherence (Mimno et al., EMNLP-2011) ICML-2014, Beijing, June 22-24, 2014 17
Topic Coherence on 4 Large Datasets Can LTM improve with larger data? ICML-2014, Beijing, June 22-24, 2014 18
Split a large data to 10 smaller ones Here we use only one domain data Better topic coherence and better efficiency (30%) ICML-2014, Beijing, June 22-24, 2014 19
Summary Proposed a lifelong learning topic model LTM. It keeps a large topic base S For each new topic modeling task, run LDA to generate a set of initial topics find matching old topics from S mine quality knowledge from the old topics use the knowledge to help generate better topics With big data (from diverse domains) : we can do what we cannot do or haven’t done before. ICML-2014, Beijing, June 22-24, 2014 20
Recommend
More recommend