october 18 th 2017
play

October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Language Model for Topic Analysis School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 Outline What is topic mining? Topic


  1. DATA130006 Text Management and Analysis Language Model for Topic Analysis 魏忠钰 复旦大学大数据学院 School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410

  2. Outline § What is topic mining?

  3. Topic Mining and Analysis: Motivation § Topic » main idea discussed in text data § Theme/subject of a discussion or conversation § Different granularities (e.g., topic of a sentence, an article, etc.) § Many applications require discovery of topics in text § What are Weibo users talking about today? § What are the current research topics in data mining? How are they different from those 5 years ago? § What were the major topics debated in 2012 presidential election?

  4. Topics as Knowledge About the World Non-Text Data Knowledge about the world Topic 1 Text Data + Context Time Real World Topic 2 … Location … … Topic k

  5. Tasks of Topic Mining and Analysis Task 2: Figure out which documents Doc 1 Doc 2 cover which topics Text Data Topic 1 Topic 2 … Topic k Task 1: Discover k topics

  6. Formal Definition of Topic Mining and Analysis § Input § A collection of N text documents C={d 1 , …, d N } § Number of topics : k § Output § k topics : { q 1 , …, q k } k § Coverage of topics in each d i : { p i1 , …, p ik } å p ij = 1 § p ij = prob. of d i covering topic q j = j 1 How to define q i ?

  7. Initial Idea: Topic = Term … Doc 1 Doc 2 Doc N 30% Text Data p 11 q 1 p 21 =0 p N1 =0 “Sports” p 12 p 22 p N2 q 2 “Travel” … 12% q k p 2k p 1k p Nk “Science” 8%

  8. Mining k Topical Terms from Collection C § Parse text in C to obtain candidate terms (e.g., term = word). § Design a scoring function to measure how good each term is as a topic. § Favor a representative term (high frequency is favored) § Avoid words that are too frequent (e.g., “the”, “a”, stop words). § TF-IDF weighting from retrieval can be very useful. § Domain-specific heuristics are possible (e.g., favor title words, hashtags in microblog). § Pick k terms with the highest scores but try to minimize redundancy. § If multiple terms are very similar or closely related, pick only one of them and ignore others.

  9. Computing Topic Coverage: p ij Doc d i p i1 q 1 count(“sports”, d i )=4 “Sports” q count ( , d ) p i2 count(“travel”, d i ) =2 q 2 p = j i “Travel” … ij k å q count ( , d ) L i = L 1 q k p ik count(“science”, d i )=1 “Science”

  10. How Well Does This Approach Work? Doc d i Cavaliers vs. Golden State Warriors: NBA playoff finals … basketball game … travel to Cleveland … star … q 1 1. Need to count “Sports” p µ = c (" sports " , d ) 0 i 1 i related words also! q 2 p µ = > “Travel” c (" travel " , d ) 1 0 … i 2 i 2. “Star” can be ambiguous (e.g., star in the sky). q k p µ = “Science” c (" science " , d ) 0 3. Mine complicated topics? ik i

  11. Problems with “Term as Topic” 1. Lack of expressive power è Topic = {Multiple Words} • Can only represent simple/general topics • Can’t represent complicated topics 2. Incompleteness in vocabulary coverage + weights on words • Can’t capture variations of vocabulary (e.g., related words) 3. Word sense ambiguity è Split an ambiguous word • A topical term or related term can be ambiguous (e.g., basketball star vs. star in the sky) A probabilistic topic model can do all these!

  12. Improved Idea: Topic = Word Distribution … q 1 q k q 2 “Science” “Travel” “Sports” P(w| q k ) P(w| q 1 ) P(w| q 2 ) travel 0.05 sports 0.02 science 0.04 attraction 0.03 game 0.01 scientist 0.03 trip 0.01 basketball 0.005 spaceship 0.006 flight 0.004 football 0.004 telescope 0.004 hotel 0.003 play 0.003 genomics 0.004 island 0.003 star 0.003 star 0.002 … … … culture 0.001 nba 0.001 genetics 0.001 … … … play 0.0002 travel 0.0005 travel 0.00001 … … … å q = p ( w | ) 1 Vocabulary Set: V={w1, w2,….} i Î w V

  13. Probabilistic Topic Mining and Analysis § Input § A collection of N text documents C={d 1 , …, d N } § Vocabulary set: V={w 1 , …, w M } § Number of topics : k § Output å § k topics, each a word distribution : { q 1 , …, q k } q = p ( w | ) 1 i Î § Coverage of topics in each d i : { p i1 , …, p ik } w V k å p ij = § p ij =prob. of d i covering topic q j 1 = j 1

  14. The Computation Task OUTPUT: { q 1 , …, q k }, { p i1 , …, p ik } INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data 30% sports 0.02 game 0.01 q 1 p 11 p 21 =0% p N1 =0% basketball 0.005 football 0.004 … q 2 travel 0.05 12% attraction 0.03 p 12 p 22 p N2 trip 0.01 … … 8% science 0.04 q k scientist 0.03 p 2k p 1k p Nk spaceship 0.006 …

  15. Generative Model for Text Mining Modeling of Data Generation: P(Data |Model, L ) L =({ q 1 , …, q k }, { p 11 , …, p 1k }, …, { p N1 , …, p Nk }) How many parameters in total? Parameter Estimation/ Inferences L * = argmax L p(Data| Model, L ) P(Data |Model, L ) L L *

  16. Simplest Case of Topic Model: Mining One Topic OUTPUT: { q } INPUT: C={d}, V P(w| q ) Doc d Text Data 100% text ? q mining ? association ? database ? … query ? …

  17. Language Model Setup § Data : Document d= x 1 x 2 … x |d| , x i Î V={w 1 ,…, w M }is a word § Model : Unigram LM q : { q i =p(w i | q )}, i=1, …, M; q 1 +…+ q M =1 § Likelihood function: q = q ´ ´ q p ( d | ) p ( x | ) ... p ( x | ) 1 | d | = q c ( w , d ) ´ ´ q c ( w , d ) p ( w | ) ... p ( w | ) 1 M 1 M M M Õ Õ c ( w , d ) = q = q c ( w , d ) p ( w | ) i i i i = = i 1 i 1 M § ML estimate: Õ ˆ ˆ q q = q = q c ( w , d ) ( ,..., ) arg max p ( d | ) arg max i q q q q 1 M ,..., ,..., i 1 M 1 M = i 1

  18. Computation of Maximum Likelihood Estimate M Õ ˆ ˆ Maximize p(d| q ) q q = q = q c ( w , d ) ( ,..., ) arg max p ( d | ) arg max i q q q q 1 M ,..., ,..., i 1 M 1 M = i 1 M å ˆ ˆ q q = q = q Max. Log-Likelihood ( ,..., ) arg max log[ p ( d | )] arg max c ( w , d ) log q q q q 1 M ,..., ,..., i i 1 M 1 M = i 1 M å Subject to constraint: q i = 1 Use Lagrange multiplier approach = i 1 M M Normalized ∑ ∑ Lagrange function: f ( θ | d ) = c ( w i , d )log θ i + λ ( − 1) θ i Counts i = 1 i = 1 ∂ f ( θ | d ) = c ( w i , d ) → θ i = − c ( w i , d ) + λ = 0 ∂ θ i θ i λ M N − c ( w i , d ) c ( w i , d ) = c ( w i , d ) θ i = p ( w i | ˆ ˆ ∑ ∑ = 1 c ( w i , d ) θ ) = → λ = − → M | d | λ ∑ i = 1 i = 1 c ( w i , d ) i = 1

  19. What Does the Topic Look Like? p(w| q ) d the 0.031 a 0.018 Can we get rid of … Text mining text 0.04 these common words? paper mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …

  20. Factoring out Background Words p(w| q ) d the 0.031 a 0.018 How can we get rid of … Text mining text 0.04 these common words? paper mining 0.035 association 0.03 clustering 0.005 computer 0.0009 … food 0.000001 …

  21. Generate d Using Two Word Distributions text 0.04 Topic: q d p( q d )+( q B )=1 mining 0.035 d association 0.03 clustering 0.005 P(w| q d ) P( q d )=0.5 … the 0.000001 Text mining Topic paper the 0.03 Choice p(w| q B ) a 0.02 P( q B )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006 Background (topic) q B …

  22. What’s the probability of observing a word w? Topic: q d text 0.04 p( q d )+( q B )=1 mining 0.035 d association 0.03 P(“the”)=p( q d )p(“the”| q d ) + p( q B )p(“the”| q B ) clustering 0.005 P(w| q d ) = 0.5*0.000001+0.5*0.03 P( q d )=0.5 … “the”? the 0.000001 Topic P(“text”)=p( q d )p(“text”| q d ) + p( q B ) p(“text”| q B ) “text”? the 0.03 Choice p(w| q B ) a 0.02 = 0.5*0.04+0.5*0.000006 P( q B )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006 Background (topic) q B …

  23. The Idea of a Mixture Model text 0.04 q d Mixture Model p( q d )+( q B )=1 mining 0.035 association 0.03 clustering 0.005 P( q d )=0.5 … “the”? the 0.000001 w Topic “text”? the 0.03 Choice q B a 0.02 P( q B )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006

  24. As a Generative Model… Formally defines the following generative model: w p(w)=p( q d )p(w| q d ) + p( q B )p(w| q B ) Estimate of the model “discovers” two topics + topic coverage What if p( q d )=1 or p( q B )=1?

  25. Mixture of Two Unigram Language Models § Data : Document d § Mixture Model : parameters L =({p(w| q d )}, {p(w| q B )}, p( q B ), p( q d )) § Two unigram LMs: q d (the topic of d); q B (background topic) § Mixing weight (topic choice): p( q d )+p( q B )=1 § Likelihood function: Õ Õ | d | | d | L = L = q q + q q p ( d | ) p ( x | ) [ p ( ) p ( x | ) p ( ) p ( x | )] i d i d B i B = = i 1 i 1 Õ M = q q + q q c ( w , d ) [ p ( ) p ( w | ) p ( ) p ( w | )] d i d B i B = i 1 § ML Estimate: L = L * arg max p ( d | ) L å å M M q = q = q + q = Subject to p ( w | ) p ( w | ) 1 p ( ) p ( ) 1 i d i B d B = = i 1 i 1

Recommend


More recommend