Entity Clustering Across Languages NAACL 2012 Montreal Spence - PowerPoint PPT Presentation

Entity Clustering Across Languages NAACL 2012 — Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University

One Entity, Many Names Qaddafi, Muammar Al-Gathafi, Muammar al-Qadhafi, Muammar Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

One Entity, Many Names �� Qaddafi, Muammar � �� Al-Gathafi, Muammar � �� al-Qadhafi, Muammar � �� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) 4 / 41

Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) Clustering/disambiguation relies on: ◮ Mention similarity ◮ Context similarity 4 / 41

Entity Disambiguation: Mention Similarity �� Qaddafi, Muammar � �� Al-Gathafi, Muammar � �� al-Qadhafi, Muammar � �� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 5 / 41

Entity Disambiguation: Context Similarity Apple 6 / 41

Entity Disambiguation: Context Similarity �� Apple Inc. �� Apple Inc. Apple �� Apple Inc. �� town in Lebanon �� camel 6 / 41

Entity Disambiguation: Context Similarity The Apple chief executive was former Beatles road manager Neil Aspinall... Sentential context is usually required 7 / 41

Old: Entity Clustering Tasks Within-doc coref Peter said to himself ... 8 / 41

Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] 9 / 41

Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking Mr. Jones Peter Jones Peter [McNamee et al., 2011] [Rao et al., 2011] 10 / 41

New: Entity Clustering Across Languages Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking This paper Mr. Jones doc1 : Peter Jones said ... �� Peter Jones doc2 : Peter [McNamee et al., 2011] [Rao et al., 2011] 11 / 41

New: Entity Clustering Across Languages The Apple chief executive was former Beatles road manager Neil Aspinall... �� No knowledge base of entities 12 / 41

Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... 13 / 41

Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... Other applications: machine translation, name search 13 / 41

Plan: Extend Existing Monolingual System 100 91.4 89.8 90 B3 78.8 80 70 English Arabic English+Arabic 14 / 41

Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? 17 / 41

Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? Algorithm : Sorted Jaro-Winkler [Christen, 2006] 1. Sort tokens 2. Compute Jaro-Winkler distance in O ( | m i | + | m j | ) 3. Evaluate sim ( m i , m j ) < β 17 / 41

Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( �� ) = abl sim ( m i , m j ) =? 18 / 41

Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( �� ) = abl sim ( m i , m j ) =? Algorithm : Phonetic mapping + classification 1. Apply deterministic mapping map ( · ) 2. Extract character-level features 3. Classify (Maxent) 18 / 41

Cross-language: Binary classifier Training: parallel name list ◮ 97.1% accuracy on a held-out set Phonetic mapping: think of Soundex Best features: character bigrams 19 / 41

Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low 21 / 41

Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low ◮ MT: Phrasal with NIST-09 data [Galley et al., 2009] ◮ Lexicon: 31k entries (web and LDC sources) 21 / 41

Context Similarity: Polylingual Topic Model Polylingual Topic Model (PLTM) [Mimno et al., 2009] ◮ Words linked through cross-lingual priors ◮ Training: Wikipedia document tuples Map context words to 1-best topics 22 / 41

Context Similarity: Polylingual Topic Model [Mimno et al., 2009] 23 / 41

Context Similarity: Polylingual Topic Model The Apple chief executive was former Beatles road 3 1 6 6 14 5 103 99 6 3 5... manager Neil Aspinall... �� 2 1 14 99 7 7 103 79 �� 24 / 41

Context Similarity Bag of words / smoothed unigram distributions Measure: Jensen-Shannon divergence 25 / 41

Constraint-Based Clustering Two algorithms: 1. Hierarchical clustering 2. Dirichlet process mixture model Setup: ◮ Mention similarity as a hard constraint ◮ Cluster distance: context similarity 27 / 41

Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi context context Apple Apple Corps. context 28 / 41

Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple 0.40 0.15 Apple Corps. context 29 / 41

Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple Apple Corps. context 30 / 41

Evaluation Corpus Genres: 1. broadcast Automatic Content Extraction conversation (ACE) 2008 Arabic-English 2. broadcast news 3. meeting 4. newswire We annotated 216 cross-lingual 5. telephone entities 6. usenet 7. weblog 32 / 41

ACE2008 Evaluation Corpus Docs Tokens Entities Chains Mentions Arabic 412 178k 2.6k 4.2k 9.2k 414 246k 2.3k 4.0k 9.1k English ◮ Chain – set of mentions (within-doc) ◮ Entity – set of chains 33 / 41

Within-document Processing Our models cluster chains Evaluation: ◮ Gold chains ◮ Predicted chains from SERIF [Ramshaw et al., 2011] 34 / 41

Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) 35 / 41

Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) In paper: CEAF, NVI 35 / 41

Automatic within-document processing 80.0 MT 76.7 76.0 75.3 Le xicon 75.0 PLTM Baseline 70.0 67.0 65.0 B3 36 / 41

Entity Clustering Across Languages NAACL 2012 Montreal Spence - PowerPoint PPT Presentation

Entity Clustering Across Languages NAACL 2012 Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University One Entity, Many Names Qaddafi, Muammar

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

Assignment 4 Clustering Languages Verena Blaschke July 04, 2018 Assignment 4 I: Feature

Overview of TAC-KBP2017 13 Languages Entity Discovery and Linking Heng Ji, Xiaoman Pan, Boliang

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Structured Generative Models for Unsupervised Named Entity Clustering Micha Elsner, Prof. Eugene

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering DSE 210 Clustering in R d Two common uses of clustering: Vector quantization Find

Clustering: Hierarchical Clustering and K- Means Clustering Machine

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

by Learning Entity-Level Distributed Representations K. Clark and C. Manning, ACL 2016

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Robust Entity Clustering via Phylogenetic Inference Nicholas Andrews with Jason Eisner and Mark

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

HITS at TAC 2015 Entity Discovery and Linking Benjamin Heinzerling 1 , 2 and Michael Strube 2 1

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Additional Semantic Tasks: Entity Coreference and Question Answering CMSC 473/673 UMBC Outline

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Chapter 7: Clustering (Unsupervised Data Organization) 7.1 Hierarchical Clustering 7.2 Flat

Graph Clustering Why graph clustering is useful? Distance matrices are graphs as useful as

Web Information Retrieval Lecture 15 Clustering Todays Topic: Clustering Document