DATA130006 Text Management and Analysis Text Clustering 魏忠钰 复旦大学大数据学院 School of Data Science, Fudan University October 18 th , 2017 Adapted from UIUC CS410
What Is Text Clustering? § Discover “natural structure” § Group similar objects together § Objects can be documents, terms, passages, websites,… § Example: Not well defined! What does “similar” mean?
The “Clustering Bias” § Any two objects can be similar, depending on how you look at them! Basis for evaluation § Are “car” and “horse” similar? § A user must define the perspective (i.e., a “ bias ”) for assessing similarity!
Examples of Text Clustering § Clustering of documents in the whole collection § Term clustering to define “concept”/“theme”/“topic” § Clustering of passages/sentences or any selected text segments from larger text objects § Clustering of websites (text object has multiple documents) § Text clusters can be further clustered to generate a hierarchy
Why Text Clustering? § In general, very useful for text mining and exploratory text analysis: § Get a sense about the overall content of a collection (e.g., what are some of the “typical”/representative documents in a collection?) § Link (similar) text objects (e.g., removing duplicated content) § Create a structure on the text data (e.g., for browsing) § As a way to induce additional features (i.e., clusters) for classification of text objects § Examples of applications § Clustering of search results § Understanding major complaints in emails from customers
Topic Mining Revisited OUTPUT: { q 1 , …, q k }, { p i1 , …, p ik } INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data 30% sports 0.02 game 0.01 q 1 p 11 p 21 =0% p N1 =0% basketball 0.005 football 0.004 … q 2 travel 0.05 12% attraction 0.03 p 12 p 22 p N2 trip 0.01 … … 8% science 0.04 q k scientist 0.03 p 1k p 2k p Nk spaceship 0.006 … 6
One Topic(=cluster) Per Document OUTPUT: { q 1 , …, q k }, { c 1 , …, c N } c i Î [1,k] INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data p 11 = 100% p N1 = 100% sports 0.02 game 0.01 q 1 p 21 =0% basketball 0.005 football 0.004 … p 22 =100% q 2 travel 0.05 attraction 0.03 p 12 =0 p N2 =0 trip 0.01 … … science 0.04 q k p Nk =0 p 1k =0 p 1k =0 scientist 0.03 spaceship 0.006 …
Mining One Topic Revisited OUTPUT: { q } INPUT: C={d}, V P(w| q ) Doc d Text Data 100% text ? q mining ? association ? database ? … ( 1 Doc, 1 Topic) query ? … è (N Docs, N Topics) k<N è (N Docs, k Shared Topics)=Clustering!
What Generative Model Can Do Clustering? { c 1 , …, c N } c i Î [1,k] OUTPUT: { q 1 , …, q k }, INPUT: C, k, V … Doc N Doc 1 Doc 2 Text Data p 11 = 100% p N1 = 100% sports 0.02 game 0.01 q 1 p 21 =0% basketball 0.005 football 0.004 … p 22 =100% q 2 travel 0.05 p 12 =0 p N2 =0 attraction 0.03 trip 0.01 … … How can we force every document to be generated using one topic (instead of k topics)? science 0.04 q k p 1k =0 p 1k =0 p Nk =0 scientist 0.03 spaceship 0.006 … 9
Generative Topic Model Revisited Why can’t this model be used for clustering? text 0.04 q 1 p( q 1 )+p( q 2 )=1 mining 0.035 d association 0.03 clustering 0.005 P( q 1 )=0.5 … “the”? the 0.000001 w Topic “text”? the 0.03 Choice q 2 a 0.02 P( q 2 )=0.5 is 0.015 we 0.01 food 0.003 … text 0.000006
Mixture Model for Document Clustering L text 0.04 q 1 Difference from p( q 1 )+p( q 2 )=1 mining 0.035 P(w| q 1 ) topic model? d association 0.03 clustering 0.005 P( q 1 )=0.5 … the 0.000001 d=x 1 x 2 … x L Topic Choice the 0.03 q 2 P( q 2 )=0.5 a 0.02 p(w| q 2 ) is 0.015 d we 0.01 What if P( q 1 )=1 food 0.003 or P( q 2 )=1? … L text 0.000006 11
Likelihood Function: p(d)=? = q q + q q p ( d ) p ( ) p ( d | ) p ( ) p ( d | ) 1 1 2 2 Õ Õ L L = q q + q q p ( ) p ( x | ) p ( ) p ( x | ) 1 i 1 2 i 2 = = i 1 i 1 d=x 1 x 2 … x L How is this different from a topic model? = Õ = L q q + q q topic mod el : p ( d ) [ p ( ) p ( x | ) p ( ) p ( x | )] 1 i 1 2 i 2 i 1
Likelihood Function: p(d)=? = q q + q q p ( d ) p ( ) p ( d | ) p ( ) p ( d | ) 1 1 2 2 Õ Õ L L = q q + q q p ( ) p ( x | ) p ( ) p ( x | ) 1 i 1 2 i 2 = = i 1 i 1 d=x 1 x 2 … x L How can we generalize it to include k topics/clusters?
Mixture Model for Document Clustering § Data: a collection of documents C={d 1 , …, d N } § Model: mixture of k unigram LMs: L =({ q i }; {p( q i )}), i Î [1,k] § To generate a document, first choose a q i according to p( q i ), and then generate all words in the document using p(w| q i ) § Likelihood: å Õ k | d | L = q q p ( d | ) [ p ( ) p ( x | )] i j i = = i 1 j 1 å Õ k = q q c ( w , d ) [ p ( ) p ( w | ) ] i i = Î i 1 w V § Maximum Likelihood estimate L = L * arg max p ( d | ) L
Cluster Allocation After Parameter Estimation § Parameters of the mixture model: L =({ q i }; {p( q i )}), i Î [1,k] § Each q i represents the content of cluster i : p(w| q i ) § p( q i ) indicates the size of cluster i § Which cluster should document d belong to? c d =? § Likelihood only : Assign d to the cluster corresponding to the topic q i that most likely has been used to generate d = q c arg max p ( d | ) d i i § Likelihood + prior p( q i ) (Bayesian): favor large clusters
How Can We Compute the ML Estimate? § Data: a collection of documents C={d 1 , …, d N } § Model: mixture of k unigram LMs: L =({ q i }; {p( q i )}), i Î [1,k] § To generate a document, first choose a q i according to p( q i ) and then generate all words in the document using p(w| q i ) § Likelihood: å Õ k L = q q c ( w , d ) p ( d | ) [ p ( ) p ( w | ) ] i i = Î i 1 w V Õ N L = L p ( C | ) p ( d | ) j = j 1 § Maximum Likelihood estimate L = L * arg max p ( C | ) L
EM Algorithm for Document Clustering § Initialization: Randomly set L =({ q i }; {p( q i )}), i Î [1,k] § Repeat until likelihood p(C| L ) converges § E-Step: Infer which distribution has been used to generate document d: hidden variable Z d Î [1, k] Õ Î å = = µ q q k ( n ) ( n ) ( n ) c ( w , d ) p ( Z i | d ) p ( ) p ( w | ) = = ( n ) p ( Z i | d ) 1 d i i d w V i 1 § M-Step: Re-estimation of all parameters å = å = N k + q µ = + q = ( n 1 ) ( n ) ( n 1 ) p ( ) p ( Z i | d ) p ( ) 1 i d j i j 1 i 1 j å = å Î N + q µ = + ( n 1 ) ( n ) q = " Î p ( w | ) c ( w , d ) p ( Z 1 | d ) ( n 1 ) p ( w | ) 1 , i [ 1 , k ] i j d j i j 1 j w V
EM Algorithm for Document Clustering § Initialization L =({ q i }; {p( q i )}), i Î [1,k] § E-Step: Compute 𝑄(𝑎 $ = 𝑗|𝑒) § M-Step: Re-estimate all parameters. L =({ q i }; {p( q i )}),
An Example of 2 Clusters E-step Document d c(w,d) Random Initialization text 2 Hidden variables: p( q 1 )=p( q 2 )= 0.5 mining 2 Z d Î {1, 2} medical 0 p(w| q 1 ) p(w| q 2 ) health 0 q q q 2 2 p ( ) p (" text " | ) p (" min ing " | ) text 0.5 0.1 = = p ( Z 1 | d ) 1 1 1 d q q 2 q 2 + q q 2 q 2 p ( ) p (" text " | ) p (" min ing " | ) p ( ) p (" text " | ) p (" min ing " | ) 1 1 1 2 2 2 mining 0.2 0.1 2 2 0 . 5 * 0 . 5 * 0 . 2 100 = = 2 2 + 2 2 0 . 5 * 0 . 5 * 0 . 2 0 . 5 * 0 . 1 * 0 . 1 101 medic 0.2 0.75 al = = p ( Z 2 | d ) ? d health 0.1 0.05
Normalization to Avoid Underflow p(w| q 1 ) p(w| q 2 ) q p ( w | ) Average of p(w| q i ) as a possible normalizer text 0.5 0.1 (0.5+0.1)/ 2 mining 0.2 0.1 (0.2+0.1)/ 2 medical 0.2 0.75 (0.2+0.75) q q q /2 2 2 p ( ) p (" text " | ) p (" min ing " | ) 1 1 1 q 2 q 2 p (" text " | ) p (" min ing " | ) health 0.1 0.05 (0.1+0.05) = = p ( Z 1 | d ) d q q /2 q q q q 2 2 2 2 p ( ) p (" text " | ) p (" min ing " | ) p ( ) p (" text " | ) p (" min ing " | ) + 1 1 1 2 2 2 q q q q 2 2 2 2 p (" text " | ) p (" min ing " | ) p (" text " | ) p (" min ing " | )
Summary of Generative Model for Clustering § A slight variation of topic model can be used for clustering documents § Each cluster is represented by a unigram LM p(w| q i ) è Term cluster § A document is generated by first choosing a unigram LM and then generating ALL words in the document using this single LM § Estimated model parameters give both a topic characterization of each cluster and a probabilistic assignment of a document into each cluster § EM algorithm can be used to compute the ML estimate § Normalization is often needed to avoid underflow
§ More About Text Clustering
Recommend
More recommend