Improve the Clustering of Short Texts Exploiting Internal and External Semantics Xia Hu for the Clustering of Short Texts Using outline World Knowledge Introduction Proposed Framework Evaluation Xia Hu, 1 , 2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 Conclusion and Future Work 1 School of Computing National University of Singapore 2 School of Computer Science and Engineering BeiHang University November 2, 2009
outline Improve the Clustering of Short Texts Xia Hu 1 Introduction outline Introduction Proposed 2 Proposed Framework Framework Evaluation Conclusion and Future 3 Evaluation Work 4 Conclusion and Future Work
Aggregated Search Improve the The form of browsing search results. Clustering of Short Texts Xia Hu outline Introduction Proposed Framework Evaluation Conclusion and Future Work
Short Texts Improve the Clustering of Short Texts Xia Hu outline Short texts, such as the snippets, product descriptions, Introduction QA passages and image captions etc., have played im- Proposed portant roles in current Web and IR applications. Framework Evaluation Unlike standard texts with lots of words in length, short Conclusion texts, which only consist of a few phrases or 2–3 sen- and Future Work tences, especially present great challenges in clustering. Problems : “data sparseness” & “semantic gap”.
Related Work Improve the Clustering of Short Texts Xia Hu Many methods have been proposed to improve the rep- outline resentation of standard text for clustering and clas- Introduction sification, including “surface representation”[3,19] and Proposed Framework “integrating world knowledge”[14]. Evaluation Several clustering techniques were employed to place Conclusion the search engine snippets to their highly relevant topic- and Future Work coherent groups[5,29]. World knowledge bases have been found useful in im- proving the short text representation[1,23].
The General Framework Improve the ,�&�������'�� (��������'�� 2������'�� Clustering of ,��!!�� �������� Short Texts ��������� ����������� ����������������� ������� ����� ��������������� ��� Xia Hu �������������� ������� ������� ��������������� ��� ������������������ ���� … �� ,������� ����� ������!���� ������������ ������������ … � (�������& 3�&���� "���� ����� ���������� 2���� ������!���� outline #���$���� "�������#��� ������ $����%��&������ !���� %��&�� ��'��������� "������� ��� Introduction '������������ #��� $���� (�����)� %��&��� ��� … � … � Proposed ���������,����������������0��&�������*��������1�,����(������ Framework Evaluation *������� ./������� ,���� +�������� *������� (������ ������ ������!���� Conclusion *������ "���� -����� ������ #�������� and Future +��������� 2���!���� ��� ���&�� Work ��� ����� ��� (�����)� *������)� ./�������,����������������./�������*������� *������� 0��&������ ������������ *������� #���$����%��&�� *������ *������ ��� *�������&1� ,�������� ��������� ������ ./������� *������� ����� ��� *������)� Fig: Framework for feature constructor
Hierarchical Resolution Improve the “Jul 18, 2008 ... It is the best American film of the year so far Clustering of Short Texts and likely to remain that way. Christopher Nolan’s The Dark Xia Hu Knight is revelatory, visceral ...” outline Text Introduction Proposed S S S Framework Evaluation July 18, 2008 . . . . . . NP VP Conclusion and Future Work VBZ NP NP NP NNP NNP POS DT NNP NNP is NN JJ Christopher Nolan ’s The Dark Knight revelatory visceral Fig: Syntax tree of the snippet
Original Feature Extraction Improve the Clustering of Short Texts Xia Hu Segment-level features. outline Introduction Phrase-level features. Proposed Sentence1 : [NP July 18 2008] Framework Sentence2 : [NP It] [VP is] [NP the best American film] Evaluation [PP of] [NP the year] [ADVP so far] and/CC [ADJP Conclusion likely] [VP to remain] [NP that way] and Future Work Sentence3 : [NP Christopher Nolan ’s] [NP The Dark Knight] [VP is] [NP revelatory visceral] Word-level features.
Feature Generation Improve the Clustering of Short Texts Xia Hu outline Two steps: Introduction Proposed the construction of basic features Framework seed phrases from internal semantics. Evaluation Conclusion the generation of external features. and Future Work external features from world knowledge bases.
Seed phrases selection (I) Improve the Clustering of Short Texts Xia Hu outline There are redundancies between phrase level features Introduction and segment level features. Proposed Framework We propose to measure the semantic similarity between Evaluation the two kinds of feature to eliminate information redun- Conclusion dancy. and Future Work For Wikipedia we download the XML corpus, remove xml tags and create a Solr index of all XML articles.
Seed phrases selection (II) Improve the Clustering of Short Texts Xia Hu outline Introduction Let P denotes a segment level feature, P = { p 1 , p 2 , . . . , p n } . Proposed We calculate the semantic similarity between p i and Framework Evaluation { p 1 , p 2 , . . . , p n } as InfoScore ( p i ). Conclusion The p ∗ which has the largest similarity with other fea- and Future Work tures in P will be removed as the redundant feature.
Seed phrases selection (III) Improve the Clustering of Given two phrases p i and p j , the variants of three pop- Short Texts ular co-occurrence measures[6] are defined as below: Xia Hu outline W ikiDice ( p i , p j ) Introduction Proposed 0 if f ( p i | p j ) = 0 Framework or f ( p j | p i ) = 0 Evaluation = , (1) Conclusion and Future f ( p i | p j )+ f ( p j | p i ) otherwise Work f ( p i )+ f ( p j ) where WikiDice is a variant of the Dice coefficient. W ikiJaccard ( p i , p j ) min ( f ( p i | p j ) , f ( p j | p i )) = (2) f ( p i ) + f ( p j ) − max ( f ( p i | p j ) , f ( p j | p i )) , where WikiJaccard is a variant of the Jaccard coefficient.
Seed phrases selection (IV) Improve the Clustering of Short Texts Xia Hu W ikiOverlap ( p i , p j ) = min ( f ( p i | p j ) , f ( p j | p i )) (3) , min ( f ( p i ) , f ( p j )) outline where WikiOverlap is a variant of the Overlap(Simpson) coefficient. Introduction Linear normalization formula is defined below: Proposed Framework W ikiDice ij − min ( W ikiDice k ) Evaluation W D ij = (4) max ( W ikiDice k ) − min ( W ikiDice k ) , Conclusion and Future Work A linear combination is then used to incorporate the three similarity measures into an overall semantic similarity between two phrases p i and p j , as follows: W ikiSem ( p i , p j ) = (1 − α − β ) W D ij + αW J ij + βW O ij , (5) where α and β weight the importance of the three similarity measures.
Seed phrases selection (V) Improve the Clustering of Short Texts Xia Hu For each segment level feature, we rank the information score defined in Equation 5 for its child node features at phrase outline level . Introduction n � Proposed InfoScore ( p i ) = W ikiSem ( p i , p j ) . (6) Framework j =1 ,j � = i Evaluation Finally, we remove the phrase level feature p ∗ , which dele- Conclusion gates the most information duplicate to the segment level and Future Work feature P , and it is defined as: p ∗ = arg p i ∈{ p 1 ,p 2 ,...,p n } InfoScore ( p i ) . max (7)
Background Knowledge Bases Improve the Clustering of Short Texts Xia Hu outline Wikipedia, as background knowledge, has a wider knowl- Introduction edge coverage than WordNet and is regularly updated Proposed Framework to reflect recent events. Evaluation On the other hand, as the construction of WordNet Conclusion and Future follows theoretical model or corpus evidence, it contains Work rich lexical semantic knowledge.
Recommend
More recommend