Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501-NLP 1
Organizing knowledge It’s a version of Chicago – the Chicago was used by default Chicago VIII was one of the standard classic Macintosh for Mac menus through early 70s-era Chicago menu font, with that distinctive MacOS 7.6, and OS 8 was albums to catch my thick diagonal in the ”N”. released mid-1997.. ear, along with Chicago II . Slides are adapted from Dan Roth CS6501-NLP 2
Cross-document co-reference resolution It’s a version of Chicago – the Chicago was used by default Chicago VIII was one of the standard classic Macintosh for Mac menus through early 70s-era Chicago menu font, with that distinctive MacOS 7.6 , and OS 8 was albums to catch my thick diagonal in the ”N”. released mid-1997.. ear, along with Chicago II . CS6501-NLP 3
Reference resolution: (disambiguation to Wikipedia) It’s a version of Chicago – the Chicago was used by default Chicago VIII was one of the standard classic Macintosh for Mac menus through early 70s-era Chicago menu font, with that distinctive MacOS 7.6 , and OS 8 was albums to catch my thick diagonal in the ”N”. released mid-1997.. ear, along with Chicago II . CS6501-NLP 4
The “Reference” Collection has Structure It’s a version of Chicago – the Chicago was used by default Chicago VIII was one of the standard classic Macintosh for Mac menus through early 70s-era Chicago menu font, with that distinctive MacOS 7.6 , and OS 8 was albums to catch my thick diagonal in the ”N”. released mid-1997.. ear, along with Chicago II . Is_a Is_a Used_In Released Succeeded CS6501-NLP 5
Analysis of Information Networks It’s a version of Chicago – the Chicago was used by default Chicago VIII was one of the standard classic Macintosh for Mac menus through early 70s-era Chicago menu font, with that distinctive MacOS 7.6 , and OS 8 was albums to catch my thick diagonal in the ”N”. released mid-1997.. ear, along with Chicago II . CS6501-NLP 6
Wikipedia as a knowledge resource …. Is_a Is_a Used_In Released Succeeded CS6501-NLP 7
Cycles of Wikification: Knowledge: Grounding The Reference Problem for/using Knowledge Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. CS6501-NLP 8
Challenging v Dealing with Ambiguity of Natural Language v Mentions of entities and concepts could have multiple meanings v Dealing with Variability of Natural Language v A given concept could be expressed in many ways v Wikification addresses these two issues in a specific way: v The Reference Problem v What is meant by this concept? (WSD + Grounding) v More than just co-reference (within and across documents) CS6501-NLP 9
General Challenges Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. • Ambiguity • Variability CT The New York Times Times Connecticut The Nutmeg State The Times • Concepts outside of • Scale Wikipedia (NIL) • Millions of labels • Blumenthal ? CS6501-NLP 10
Wikification: Subtasks v Wikification and Entity Linking requires addressing several sub-tasks: v Identifying Target Mentions v Mentions in the input text that should be Wikified v Identifying Candidate Titles v Candidate Wikipedia titles that could correspond to each mention v Candidate Title Ranking v Rank the candidate titles for a given mention v NIL Detection and Clustering v Identify mentions that do not correspond to a Wikipedia title v Entity Linking: cluster NIL mentions that represent the same entity. CS6501-NLP 11
High-level Algorithmic Approach. v Input: A text document d; Output: a set of pairs ( m i ,t i ) v m i are mentions in d; t j ( m i ) are corresponding Wikipedia titles, or NIL. v (1) Identify mentions m i in d v (2) Local Inference v For each m i in d: v Identify a set of relevant titles T( m i ) v Rank titles t i ∈ T( m i ) [E.g., consider local statistics of edges [( m i ,t i ) , ( m i ,*), and (*, t i )] occurrences in the Wikipedia graph] v (3) Global Inference v For each document d: v Consider all m i ∈ d; and all t i ∈ T( m i ) v Re-rank titles t i ∈ T( m i ) [E.g., if m, m’ are related by virtue of being in d, their corresponding titles t, t’ may also be related] CS6501-NLP 12
Local approach A text Document Identified mentions Wikipedia Articles Local score of matching the mention to the title Γ is a solution to the problem § (decomposed by m i ) A set of pairs (m,t) § m: a mention in the document § t: the matched Wikipedia Title § CS6501-NLP 13
Global Approach: Using Additional Structure Text Document(s)—News, Blogs,… Wikipedia Articles Adding a “global” term to evaluate how good the structure of the solution is. Use the local solutions Γ’ (each • mention considered independently. Evaluate the structure based on pair- • wise coherence scores Ψ(t i ,t j ) Choose those that satisfy document • coherence conditions. CS6501-NLP 14
Recommend
More recommend