Using Wikipedia for Co-clustering Based Cross-domain Text Classification Pu Wang and Carlotta Domeniconi George Mason University Jian Hu Microsoft Research Asia ICDM — Pisa, December 16, 2008 Motivation • Labeled data are seldom available, and often too expensive to obtain. • Abundant labeled data may exist for a different but related domain. • Goal : Use the labeled data as auxiliary information to accomplish the task (classification) in the target domain.
Main Idea • Leverage the shared dictionary across the in-domain and out-of-domain (target) documents to propagate label information. common D i D o words label propagation Main Idea • Enrich document representation to fill the semantic gap. common words & D i D o semantic concepts label propagation
Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • : in-domain documents D i • : out-of-domain documents D o • : set of class labels C • : dictionary of all the words W Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • Co-clustering of : D o C D o : { d 1 , ..., d m } → { ˆ d 1 , ˆ d 2 , ..., ˆ d |C| } = ˆ D o w k } = ˆ C W : { w 1 , ..., w n } → { ˆ w 1 , ˆ w 2 , ..., ˆ W
Co-clustering based Classification (CoCC) [Dai et al., KDD 07] )(*)&+,$-.#"/01.(2-3+.- :4,;) 0123453647"#8& 64917%82) :4,;) 0123453647"#8& 64917%82) -"@%+) *+1)2%,#8>&:4,;)&45&'42?&<83 647"#8&"8;&012345&647"#8& 64917%82) :4,;) *+1)2%,#8>&0123453 647"#8&64917%82) -"@%+) A3A&7"B& @%2C%%8& +"@%+)&"8;& ;4917%82& :4,;) 9+1)2%,) 4$-506 4$-507 Co-clustering based Classification (CoCC) [Dai et al., KDD 07] !"#$#%&#'%$#(" :4,;) *-./0 0123453647"#8& 64917%82) !"#$%& '"(%)& *+"))# ! %, <8#2#"+#="2#48& <8#2#"+#="2#48& 45&64917%82& 45&:4,;& *+1)2%,) *+1)2%,) <83647"#8&"8;& 0123453647"#8& 64917%82) :4,;)
Co-clustering based Classification (CoCC) [Dai et al., KDD 07] • Iterative algorithm that achieves W { I ( D o ; W ) − I ( ˆ D o ; ˆ W ) + λ ( I ( C ; W ) − I ( C ; ˆ min ˆ W )) } D o , ˆ loss loss in mutual in mutual information information between class between documents and labels and words words Information Theoretic Co-clustering [Dhillion et al., KDD 03] I ( D o ; W ) − I ( ˆ D o ; ˆ W ) p ( x, y ) log p ( x, y ) � � I ( X ; Y ) = p ( x ) p ( y ) x y I ( C ; W ) − I ( C ; ˆ W )
p ( d, w ) , f ( d | w ) = p ( d | w ) = f ( d, w ) � f ( w ) = f ( w ) , d ∈ D o p ( d, w ) , f ( w | d ) = p ( w | d ) = p ( d, w ) � f ( d ) = f ( d ) , w ∈ W ˆ w | ˆ w | ˆ d ) , ˆ f ( ˆ w ) = p ( ˆ f ( ˆ d ) = p ( ˆ d | ˆ d | ˆ w ) , f ( d | ˆ ˆ d ) = p ( d | ˆ d ) , ˆ f ( w | ˆ w ) = p ( w | ˆ w ) , ˆ w ) = ˆ f ( d | ˆ d ) ˆ f ( ˆ w ) = p ( d | ˆ d ) p ( ˆ f ( d | ˆ d | ˆ d | ˆ w ) f ( w | ˆ ˆ d ) = ˆ w ) ˆ w | ˆ w | ˆ f ( w | ˆ f ( ˆ d ) = p ( w | ˆ w ) p ( ˆ d ) w ) p ( w ) g ( c, w ) = p ( c, ˆ w ) p ( w | ˆ w ) = p ( c, ˆ p ( ˆ w ) p ( c, w ) , g ( c | w ) = p ( c | w ) = g ( c, w ) � g ( w ) = g ( w ) , c ∈ C � w p ( c | w ) p ( w ) � w p ( c | w ) p ( w ) w ∈ ˆ w ∈ ˆ g ( c | ˆ ˆ w ) = = . p ( ˆ w ) � w p ( w ) w ∈ ˆ
Co-clustering based Classification (CoCC) [Dai et al., KDD 07] I ( D o ; W ) − I ( ˆ D o ; ˆ W ) + λ I ( C ; W ) − I ( C ; ˆ W ) = D ( f ( D o ; W ) || ˆ f ( D o ; W )) + λ D ( g ( C , W ) || ˆ g ( C , W )) p ( x ) log p ( x ) � D ( p ( x ) || q ( x )) = q ( x ) x Co-clustering based Classification (CoCC) [Dai et al., KDD 07] D ( f ( D o , W ) || ˆ f ( D o , W )) f ( d ) D ( f ( W| d ) || ˆ f ( W| ˆ � � = d )) d ∈ ˆ ˆ d ∈ ˆ d D o D ( f ( D o , W ) || ˆ f ( D o , W )) f ( w ) D ( f ( D o | w ) || ˆ � � = f ( D o | ˆ w )) w ∈ ˆ w ∈ ˆ w W ˆ D ( g ( C , W ) || ˆ g ( C , W )) � � = g ( w ) D ( g (( C| w ) || ˆ g ( C| ˆ w ))) w ∈ ˆ w ∈ ˆ ˆ W w
Co-clustering based Classification (CoCC) [Dai et al., KDD 07] C ( t ) D ( f ( W| d ) || ˆ f ( t − 1) ( W| ˆ D o ( d ) = argmin d )) ˆ d C ( t +1) w f ( w ) D ( f ( D o | w ) || ˆ ( d ) = argmin f ( D o | ˆ w )) W ˆ + λ g ( w ) D ( g (( C| w ) || ˆ g ( C| ˆ w ))) Main Idea • Enrich document representation to fill the semantic gap. common words & D i D o semantic concepts label propagation
"Ford "Felidae" Vehicles" Build Thesaurus Ambiguous Concepts: Category Category from Wikipedia Puma Puma (Car) Concept "Puma" belongs to "Puma Redirect Category (Car)" Concepts of "Puma" "Felidae" "Puma" Building Related "Auto- "Mountain "Cougar" Concepts of mobile" Semantic Lion" "Puma" Kernels 1 a . . . b from a . . . c 1 Wikipedia Build Semantic Concept . . . ... Kernels . . . Proximity . . . Matrix Wikipedia: b c . . . 1 Overall Text Document Search Wikipedia "... The Cougar, also Approach Candidate Concepts in Puma and Mountain lion, Concepts Documents is a New World mammal Puma 2 of the Felidae family ..." ... Enrich Document Enriched Document Representation Disambiguation Representation with Wikipedia Puma 2 "Puma" here means a kind Concepts Cougar 2 of animal, not car or sport- brand. Felines 0.98 ... Proximity Matrix Terms Concepts 1 0 0 0 0 0 · · · · · · 0 1 0 0 0 0 · · · · · · Terms . . . . . . ... ... . . . . . . . . . . . . 0 0 1 0 0 0 · · · · · · 0 0 0 1 a b · · · · · · 0 0 0 1 a c · · · · · · Concepts . . ... . . . ... . . . . . . . . . . . . . 0 0 0 b c 1 · · · · · · S = λ 1 S BOW + λ 2 S OLC + (1 − λ 1 − λ 2 )(1 − D cat ) Outlink Content- Distance- category-based based based
Proximity Matrix Terms Concepts 1 0 0 0 0 0 · · · · · · 0 1 0 0 0 0 · · · · · · Terms . . . . . . ... ... . . . . . . . . . . . . 0 0 1 0 0 0 · · · · · · 0 0 0 1 a b · · · · · · 0 0 0 1 a c · · · · · · Concepts . . ... . . . ... . . . . . . . . . . . . . 1 if c i and c j are synonyms; 0 0 0 b c 1 · · · · · · µ − depth if c i and c j are hyponyms; P ij = S if c i and c j are associative concepts; 0 otherwise . S = λ 1 S BOW + λ 2 S OLC + (1 − λ 1 − λ 2 )(1 − D cat ) Outlink Content- Distance- category-based based based Building Semantic Kernels Machine learning, statistical learning and data mining are related subjects. Original BOW Vector <machine:1, statistical:1, learn:2, data:1, mine:1, relate:1, subject:1> Find Wikipedia Concepts and Keep as it is φ ( d ) <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1 ; ... > × Machine Learning Statistical Learning Data Mining Artificial Intelligence . . . . . . Machine Learning 1 0.6276 0.4044 0.1314 . . . Statistical Learning 0.6276 1 0.2839 0.1146 P . . . Data Mining 0.4044 0.2839 1 0.0792 . . . Artificial Intelligence 0.1314 0.1146 0.0792 1 . . . . . . . . . Enriched Document Vector ˜ = φ ( d ) = φ ( d ) P Representation <relate:1, subject:1; machine learning:1, statistical learning:1, data mining:1 ; artificial intelligence:0.3252 >
Empirical Evaluation • Data sets : 20Newsgroups and SRAA • Methods : • CoCC w/ and w/out enrichment • NB w/ and w/out enrichment Cross-domain Classification Precision Rates Data Set w/o enrichment w/ enrichment NB CoCC NB CoCC rec vs talk 0.824 0.921 0.853 0.998 rec vs sci 0.809 0.954 0.828 0.984 comp vs talk 0.927 0.978 0.934 0.995 comp vs sci 0.552 0.898 0.673 0.987 comp vs rec 0.817 0.915 0.825 0.993 sci vs talk 0.804 0.947 0.877 0.988 rec vs sci vs comp 0.584 0.822 0.635 0.904 rec vs talk vs sci 0.687 0.881 0.739 0.979 sci vs talk vs comp 0.695 0.836 0.775 0.912 rec vs talk vs sci vs comp 0.487 0.624 0.538 0.713 real vs simulation 0.753 0.851 0.826 0.977 auto vs aviation 0.824 0.959 0.933 0.992
CoCC with enrichment: Precision as a function of the number of iterations )"! !"( !"' ,-./01023 !"& !"% -./7817569:78171/0 -./78171/07817/2;< !"$ 1/07817569:7817/2;< 1/07817-./7817569:7817/2;< !"# ! ) * + # $ % & ' ( )! )) )* )+ )# )$ )% )& )' )( *! *) ** *+ *# *$ *% *& 45.-650231 CoCC with enrichment: Precision as a function of (sci vs talk vs comp) λ '"! !"& !"% *+,-./.01 !"$ '#&230+42-56/7,+/ %$230+42-56/7,+/ !"# '%230+42-56/7,+/ ! !"!('#) !"!%#) !"'#) !"#) !") ' # $ & !
Recommend
More recommend