october 18 th 2017
play

October 18 th , 2017 Adapted from UIUC CS410 What Is Text - PowerPoint PPT Presentation

DATA130006 Text Management and Analysis Text Clustering School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410 What Is Text Clustering? Discover natural structure


  1. DATA130006 Text Management and Analysis Text Clustering 魏忠钰 复旦大学大数据学院 School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410

  2. What Is Text Clustering? § Discover “natural structure” § Group similar objects together § Objects can be documents, terms, passages, websites,… § Example: Not well defined! What does “similar” mean?

  3. The “Clustering Bias” § Any two objects can be similar, depending on how you look at them! Basis for evaluation § Are “car” and “horse” similar? § A user must define the perspective (i.e., a “ bias ”) for assessing similarity!

  4. Examples of Text Clustering § Clustering of documents in the whole collection § Term clustering to define “concept”/“theme”/“topic” § Clustering of passages/sentences or any selected text segments from larger text objects § Clustering of websites (text object has multiple documents) § Text clusters can be further clustered to generate a hierarchy

  5. Why Text Clustering? § In general, very useful for text mining and exploratory text analysis: § Get a sense about the overall content of a collection (e.g., what are some of the “typical”/representative documents in a collection?) § Link (similar) text objects (e.g., removing duplicated content) § Create a structure on the text data (e.g., for browsing) § As a way to induce additional features (i.e., clusters) for classification of text objects § Examples of applications § Clustering of search results § Understanding major complaints in emails from customers

  6. Topic Mining Revisited OUTPUT: { q 1 , …, q k }, { p i1 , …, p ik } INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data 30% sports 0.02 game 0.01 q 1 p 11 p 21 =0% p N1 =0% basketball 0.005 football 0.004 … q 2 travel 0.05 12% attraction 0.03 p 12 p 22 p N2 trip 0.01 … … 8% science 0.04 q k scientist 0.03 p 1k p 2k p Nk spaceship 0.006 … 6

  7. One Topic(=cluster) Per Document OUTPUT: { q 1 , …, q k }, { c 1 , …, c N } c i Î [1,k] INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data p 11 = 100% p N1 = 100% sports 0.02 game 0.01 q 1 p 21 =0% basketball 0.005 football 0.004 … p 22 =100% q 2 travel 0.05 attraction 0.03 p 12 =0 p N2 =0 trip 0.01 … … science 0.04 q k p Nk =0 p 1k =0 p 1k =0 scientist 0.03 spaceship 0.006 …

  8. Mining One Topic Revisited OUTPUT: { q } INPUT: C={d}, V P(w| q ) Doc d Text Data 100% text ? q mining ? association ? database ? … ( 1 Doc, 1 Topic) query ? … è (N Docs, N Topics) k<N è (N Docs, k Shared Topics)=Clustering!

  9. What Generative Model Can Do Clustering? { c 1 , …, c N } c i Î [1,k] OUTPUT: { q 1 , …, q k }, INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data p 11 = 100% p N1 = 100% sports 0.02 game 0.01 q 1 p 21 =0% basketball 0.005 football 0.004 … p 22 =100% q 2 travel 0.05 p 12 =0 p N2 =0 attraction 0.03 trip 0.01 … … How can we force every document to be generated using one topic (instead of k topics)? science 0.04 q k p 1k =0 p 1k =0 p Nk =0 scientist 0.03 spaceship 0.006 … 9

  10. Generative Topic Model Revisited Why can’t this model be used for clustering? text 0.04 q 1 p( q 1 )+p( q 2 )=1 mining 0.035 d association 0.03 clustering 0.005 P( q 1 )=0.5 … “the”? the 0.000001 w Topic “text”? the 0.03 Choice q 2 a 0.02 P( q 2 )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006

  11. Mixture Model for Document Clustering L text 0.04 q 1 Difference from p( q 1 )+p( q 2 )=1 mining 0.035 P(w| q 1 ) topic model? d association 0.03 clustering 0.005 P( q 1 )=0.5 … the 0.000001 d=x 1 x 2 … x L Topic Choice the 0.03 q 2 P( q 2 )=0.5 a 0.02 p(w| q 2 ) is 0.015 d we 0.01 What if P( q 1 )=1 food 0.003 or P( q 2 )=1? … L text 0.000006 11

  12. Likelihood Function: p(d)=? = q q + q q p ( d ) p ( ) p ( d | ) p ( ) p ( d | ) 1 1 2 2 Õ Õ L L = q q + q q p ( ) p ( x | ) p ( ) p ( x | ) 1 i 1 2 i 2 = = i 1 i 1 d=x 1 x 2 … x L How is this different from a topic model? = Õ = L q q + q q topic mod el : p ( d ) [ p ( ) p ( x | ) p ( ) p ( x | )] 1 i 1 2 i 2 i 1

  13. Likelihood Function: p(d)=? = q q + q q p ( d ) p ( ) p ( d | ) p ( ) p ( d | ) 1 1 2 2 Õ Õ L L = q q + q q p ( ) p ( x | ) p ( ) p ( x | ) 1 i 1 2 i 2 = = i 1 i 1 d=x 1 x 2 … x L How can we generalize it to include k topics/clusters?

  14. Mixture Model for Document Clustering § Data: a collection of documents C={d 1 , …, d N } § Model: mixture of k unigram LMs: L =({ q i }; {p( q i )}), i Î [1,k] § To generate a document, first choose a q i according to p( q i ), and then generate all words in the document using p(w| q i ) § Likelihood: å Õ k | d | L = q q p ( d | ) [ p ( ) p ( x | )] i j i = = i 1 j 1 å Õ k = q q c ( w , d ) [ p ( ) p ( w | ) ] i i = Î i 1 w V § Maximum Likelihood estimate L = L * arg max p ( d | ) L

  15. Cluster Allocation After Parameter Estimation § Parameters of the mixture model: L =({ q i }; {p( q i )}), i Î [1,k] § Each q i represents the content of cluster i : p(w| q i ) § p( q i ) indicates the size of cluster i § Which cluster should document d belong to? c d =? § Likelihood only : Assign d to the cluster corresponding to the topic q i that most likely has been used to generate d = q c arg max p ( d | ) d i i § Likelihood + prior p( q i ) (Bayesian): favor large clusters

  16. How Can We Compute the ML Estimate? § Data: a collection of documents C={d 1 , …, d N } § Model: mixture of k unigram LMs: L =({ q i }; {p( q i )}), i Î [1,k] § To generate a document, first choose a q i according to p( q i ) and then generate all words in the document using p(w| q i ) § Likelihood: å Õ k L = q q c ( w , d ) p ( d | ) [ p ( ) p ( w | ) ] i i = Î i 1 w V Õ N L = L p ( C | ) p ( d | ) j = j 1 § Maximum Likelihood estimate L = L * arg max p ( C | ) L

  17. EM Algorithm for Document Clustering § Initialization: Randomly set L =({ q i }; {p( q i )}), i Î [1,k] § Repeat until likelihood p(C| L ) converges § E-Step: Infer which distribution has been used to generate document d: hidden variable Z d Î [1, k] Õ Î å = = µ q q k ( n ) ( n ) ( n ) c ( w , d ) p ( Z i | d ) p ( ) p ( w | ) = = ( n ) p ( Z i | d ) 1 d i i d w V i 1 § M-Step: Re-estimation of all parameters å = å = N k + q µ = + q = ( n 1 ) ( n ) ( n 1 ) p ( ) p ( Z i | d ) p ( ) 1 i d j i j 1 i 1 j å = å Î N + q µ = + ( n 1 ) ( n ) q = " Î p ( w | ) c ( w , d ) p ( Z 1 | d ) ( n 1 ) p ( w | ) 1 , i [ 1 , k ] i j d j i j 1 j w V

  18. EM Algorithm for Document Clustering § Initialization L =({ q i }; {p( q i )}), i Î [1,k] § E-Step: Compute 𝑄(𝑎 $ = 𝑗|𝑒) § M-Step: Re-estimate all parameters. L =({ q i }; {p( q i )}),

  19. An Example of 2 Clusters E-step Document d c(w,d) Random Initialization text 2 Hidden variables: p( q 1 )=p( q 2 )= 0.5 mining 2 Z d Î {1, 2} medical 0 p(w| q 1 ) p(w| q 2 ) health 0 q q q 2 2 p ( ) p (" text " | ) p (" min ing " | ) text 0.5 0.1 = = p ( Z 1 | d ) 1 1 1 d q q 2 q 2 + q q 2 q 2 p ( ) p (" text " | ) p (" min ing " | ) p ( ) p (" text " | ) p (" min ing " | ) 1 1 1 2 2 2 mining 0.2 0.1 2 2 0 . 5 * 0 . 5 * 0 . 2 100 = = 2 2 + 2 2 0 . 5 * 0 . 5 * 0 . 2 0 . 5 * 0 . 1 * 0 . 1 101 medic 0.2 0.75 al = = p ( Z 2 | d ) ? d health 0.1 0.05

  20. Normalization to Avoid Underflow p(w| q 1 ) p(w| q 2 ) q p ( w | ) Average of p(w| q i ) as a possible normalizer text 0.5 0.1 (0.5+0.1)/ 2 mining 0.2 0.1 (0.2+0.1)/ 2 medical 0.2 0.75 (0.2+0.75) q q q /2 2 2 p ( ) p (" text " | ) p (" min ing " | ) 1 1 1 q 2 q 2 p (" text " | ) p (" min ing " | ) health 0.1 0.05 (0.1+0.05) = = p ( Z 1 | d ) d q q /2 q q q q 2 2 2 2 p ( ) p (" text " | ) p (" min ing " | ) p ( ) p (" text " | ) p (" min ing " | ) + 1 1 1 2 2 2 q q q q 2 2 2 2 p (" text " | ) p (" min ing " | ) p (" text " | ) p (" min ing " | )

  21. Summary of Generative Model for Clustering § A slight variation of topic model can be used for clustering documents § Each cluster is represented by a unigram LM p(w| q i ) è Term cluster § A document is generated by first choosing a unigram LM and then generating ALL words in the document using this single LM § Estimated model parameters give both a topic characterization of each cluster and a probabilistic assignment of a document into each cluster § EM algorithm can be used to compute the ML estimate § Normalization is often needed to avoid underflow

  22. § More About Text Clustering

Recommend


More recommend