Entity Clustering Across Languages NAACL 2012 — Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University
One Entity, Many Names Qaddafi, Muammar Al-Gathafi, Muammar al-Qadhafi, Muammar Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41
One Entity, Many Names ���� ����� ����� ���� ���� ���� �� � Qaddafi, Muammar � �� � ��� ��� � ���� ����� ���� ���� �� � Al-Gathafi, Muammar � �� � ��� ��� � �� � al-Qadhafi, Muammar � �� � ��� ���� � ���� ����� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41
One Entity, Many Names ���� ����� ����� ���� ���� ���� �� � Qaddafi, Muammar � �� � ��� ��� � ���� ����� ���� ���� �� � Al-Gathafi, Muammar � �� � ��� ��� � �� � al-Qadhafi, Muammar � �� � ��� ���� � ���� ����� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41
Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) 4 / 41
Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) Clustering/disambiguation relies on: ◮ Mention similarity ◮ Context similarity 4 / 41
Entity Disambiguation: Mention Similarity ���� ����� ����� ���� ���� ���� �� � Qaddafi, Muammar � �� � ��� ��� � ���� ����� ���� ���� �� � Al-Gathafi, Muammar � �� � ��� ��� � �� � al-Qadhafi, Muammar � �� � ��� ���� � ���� ����� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 5 / 41
Entity Disambiguation: Context Similarity Apple 6 / 41
Entity Disambiguation: Context Similarity ����� Apple Inc. ���� � Apple Inc. Apple ���� Apple Inc. ����� � �� �� town in Lebanon ����� camel 6 / 41
Entity Disambiguation: Context Similarity The Apple chief executive was former Beatles road manager Neil Aspinall... Sentential context is usually required 7 / 41
Old: Entity Clustering Tasks Within-doc coref Peter said to himself ... 8 / 41
Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] 9 / 41
Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking Mr. Jones Peter Jones Peter [McNamee et al., 2011] [Rao et al., 2011] 10 / 41
New: Entity Clustering Across Languages Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking This paper Mr. Jones doc1 : Peter Jones said ... ���������� �������� � Peter Jones doc2 : Peter [McNamee et al., 2011] [Rao et al., 2011] 11 / 41
New: Entity Clustering Across Languages The Apple chief executive was former Beatles road manager Neil Aspinall... ����� � ������ �������� �� � ������ � � � ������ �� �� ��� ��� ����� �� � � �� � No knowledge base of entities 12 / 41
Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... 13 / 41
Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... Other applications: machine translation, name search 13 / 41
Plan: Extend Existing Monolingual System 100 91.4 89.8 90 B3 78.8 80 70 English Arabic English+Arabic 14 / 41
Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation
Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation
Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? 17 / 41
Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? Algorithm : Sorted Jaro-Winkler [Christen, 2006] 1. Sort tokens 2. Compute Jaro-Winkler distance in O ( | m i | + | m j | ) 3. Evaluate sim ( m i , m j ) < β 17 / 41
Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( ����� ) = abl sim ( m i , m j ) =? 18 / 41
Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( ����� ) = abl sim ( m i , m j ) =? Algorithm : Phonetic mapping + classification 1. Apply deterministic mapping map ( · ) 2. Extract character-level features 3. Classify (Maxent) 18 / 41
Cross-language: Binary classifier Training: parallel name list ◮ 97.1% accuracy on a held-out set Phonetic mapping: think of Soundex Best features: character bigrams 19 / 41
Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation
Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low 21 / 41
Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low ◮ MT: Phrasal with NIST-09 data [Galley et al., 2009] ◮ Lexicon: 31k entries (web and LDC sources) 21 / 41
Context Similarity: Polylingual Topic Model Polylingual Topic Model (PLTM) [Mimno et al., 2009] ◮ Words linked through cross-lingual priors ◮ Training: Wikipedia document tuples Map context words to 1-best topics 22 / 41
Context Similarity: Polylingual Topic Model [Mimno et al., 2009] 23 / 41
Context Similarity: Polylingual Topic Model The Apple chief executive was former Beatles road 3 1 6 6 14 5 103 99 6 3 5... manager Neil Aspinall... ����� � ������ �������� �� � ������ � � � ������ �� �� 2 1 14 99 7 7 103 79 ��� ��� ����� �� � � �� � 24 / 41
Context Similarity Bag of words / smoothed unigram distributions Measure: Jensen-Shannon divergence 25 / 41
Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation
Constraint-Based Clustering Two algorithms: 1. Hierarchical clustering 2. Dirichlet process mixture model Setup: ◮ Mention similarity as a hard constraint ◮ Cluster distance: context similarity 27 / 41
Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi context context Apple Apple Corps. context 28 / 41
Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple 0.40 0.15 Apple Corps. context 29 / 41
Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple Apple Corps. context 30 / 41
Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation
Evaluation Corpus Genres: 1. broadcast Automatic Content Extraction conversation (ACE) 2008 Arabic-English 2. broadcast news 3. meeting 4. newswire We annotated 216 cross-lingual 5. telephone entities 6. usenet 7. weblog 32 / 41
ACE2008 Evaluation Corpus Docs Tokens Entities Chains Mentions Arabic 412 178k 2.6k 4.2k 9.2k 414 246k 2.3k 4.0k 9.1k English ◮ Chain – set of mentions (within-doc) ◮ Entity – set of chains 33 / 41
Within-document Processing Our models cluster chains Evaluation: ◮ Gold chains ◮ Predicted chains from SERIF [Ramshaw et al., 2011] 34 / 41
Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) 35 / 41
Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) In paper: CEAF, NVI 35 / 41
Automatic within-document processing 80.0 MT 76.7 76.0 75.3 Le xicon 75.0 PLTM Baseline 70.0 67.0 65.0 B3 36 / 41
Recommend
More recommend