Tulip: Lightweight Entity Recognition and Disambiguation Using - PowerPoint PPT Presentation

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios

Problem definition  The goal of Entity Recognition and Disambiguation (ERD) □ Identify mentions of entities □ Link the mentions to a relevant entry in an external knowledge base □ The knowledge base is typically a large subset of Wikipedia articles  Example: The selling offsets decent earnings from Cisco Systems and Home Depot . Techs fall, led by Microsoft and Intel . Michael Kors rises. Gold and oil slip. 2

Recognition and Disambiguation The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.  Recognition □ Is this a valid mention of an entity present in the knowledge base?  Disambiguation □ Which of the potential entities (senses) is correct? 3

Recognition and Disambiguation The selling offsets decent earnings from Cisco Systems and Home Depot. Techs fall, led by Microsoft and Intel. Michael Kors rises. Gold and oil slip.  Recognition □ Is this a valid mention of an entity present in the knowledge base?  Disambiguation □ Which of the potential entities (senses) is correct?  Default sense – the entity with a largest number of wiki-links with the mention as the anchor text □ Tulip focuses on default sense entities □ Main goal is to recognize whether the default sense is consistent with the document 4

Our background  Visual Text Analytics Lab □ Some experience with using ERD systems □ No experience implementing ERD systems  Key issue with state-of-the-art systems: obvious false positive mistakes □ Visualize Prof. Smith's research interests:  Data Mining  Machine Learning  50 cent  Our goal: minimize the number of false positives 5

Tulip – system overview  Spotter □ Find all mentions of entities in the text (Solr Text Tagger) □ Special handling for personal names  Recognizer □ Retrieve profjles of spotted entities (from Sunfmower) □ Generate a topic centroid representing the document □ Select entities consistent with the document 6

Spotter  Spotter □ Find all mentions of entities in the text (Solr Text Tagger) □ Special handling for personal names  Recognizer □ Retrieve profjles of spotted entities (from Sunfmower) □ Generate a topic centroid representing the document □ Select entities consistent with the document 7

Solr Text Tagger  Solr (Lucene) is a text search engine □ Indexes textual documents □ Retrieve documents for keyword-based queries  Solr Text Tagger □ Indexes entity surface forms stored in a lexicon  E.g., Baltimore Ravens, Ravens, Baltimore (…) □ Uses full text documents as queries □ Finds all entity mentions in the document □ Retrieves the mentioned entities (candidate selection) □ Implemented based on Solr's Finite State Transducers  By David Smiley and Rupert Westenthaler (thanks!) 8

Building the lexicon  Three sources of entity surface forms (external datasets) □ Entity names (from Freebase ) □ Wiki-links anchor text (from Wikipedia ) □ Web anchor text (from Google's Wikilinks corpus ) 9

Building the lexicon  Three sources of entity surface forms (external datasets) □ Entity names (from Freebase ) □ Wiki-links anchor text (from Wikipedia ) □ Web anchor text (from Google's Wikilinks corpus )  Special handling of personal names □ “Jack” and “London” are not allowed as surface forms for Jack London □ Instead they are indexed as “generic” personal names and will be matched only if Jack London is mentioned by his full name 10

Building the lexicon  Three sources of entity surface forms (external datasets) □ Entity names (from Freebase ) □ Wiki-links anchor text (from Wikipedia ) □ Web anchor text (from Google's Wikilinks corpus )  Special handling of personal names □ “Jack” and “London” are not allowed as surface forms for Jack London □ Instead they are indexed as “generic” personal names and will be matched only if Jack London is mentioned by his full name  Flagging suspicious surface forms (e.g., “It” - Stephen King's novel) □ stop-word fjlter marks all stop-words or phrases composed of stop- words (e.g., This is ) □ Wiktionary fjlter marks all common nouns, verbs, adjectives, etc. found in Wiktionary □ lower-case fjlter marks all lower-case words or phrases 11

Spotter – example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1] . Techs fall (1) (...) [7] , led by Microsoft [1] (...) [13] and Intel [1] (...) [9] . Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.  Default sense for all mentions (Freebase only) 12

Spotter – example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1] . Techs fall (1) (...) [7] , led by Microsoft [1] (...) [13] and Intel [1] (...) [9] . Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.  Default sense for all mentions (Freebase only)  Default sense for all mentions (Freebase + Wikpedia) 13

Spotter – example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1] . Techs fall (1) (...) [7] , led by Microsoft [1] (...) [13] and Intel [1] (...) [9] . Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.  Default sense for all mentions (Freebase only)  Default sense for all mentions (Freebase + Wikpedia)  Suspicious mentions removed 14

Spotter – example The [1] (...) [97] selling offsets decent earnings from Cisco Systems [1] and Home Depot [1] . Techs fall (1) (...) [7] , led by Microsoft [1] (...) [13] and Intel [1] (...) [9] . Michael Kors [1] rises. Gold (1) (...) [31] and oil slip.  Default sense for all mentions (Freebase only)  Default sense for all mentions (Freebase + Wikpedia)  Suspicious mentions removed  How can we remove Michael Kors and bring back Home Depot? □ Relatedness of entities to the document 15

Recognizer  Spotter □ Find all mentions of entities in the text (Solr Text Tagger) □ Special handling for personal names  Recognizer □ Retrieve profjles of spotted entities (from Sunfmower) □ Generate a topic centroid representing the document □ Select entities consistent with the document 16

Relatedness score The selling offsets decent earnings from Cisco Systems and Home Depot . Techs fall, led by Microsoft and Intel . Michael Kors rises. Gold and oil slip. How strongly or are related to the document?  Our solution □ Retrieve a profjle of every entity mentioned in the text □ Agglomerate the profjles in a centroid representing the document □ Check which entities are coherent with the topics (relatedness score) 17

Relatedness score The selling offsets decent earnings from Cisco Systems and Home Depot . Techs fall, led by Microsoft and Intel . Michael Kors rises. Gold and oil slip. How strongly or are related to the document?  Our solution □ Retrieve a profjle of every entity mentioned in the text □ Agglomerate the profjles in a centroid representing the document □ Check which entities are coherent with the topics (relatedness score) □ How do we create the entity profjles? 18

Relatedness – Sunflower  A concept graph based on unifjed category graph from 120 Wikipedia language versions □ Each language version acts like a witness for the importance of stored relation  Compact and accurate category profjles for all Wikipedia articles □ Removal of unimportant categories □ Inference of more general categories 19

Sunflower – from graph to term profile  Sunfmower graph is: □ Directed □ Weighted (importance score) □ Sparse (only k most important links per node)  Category-based profjle is a sparse, weighted term vector □ All categories at distance < d □ Term weights based on edge weights □ E.g., k = 3, d = 2 □ Path weight is the product of edge weights  w(Intel → Comp. of US → Ec. of US) = 0.42*0.27 = 0.11 □ Category weight is the sum of path weights  w(Ec. of US) = 0.11 + 0.19 = 0.3 20

Topic centroids in Tulip  Retrieve category-based profjles for all default senses (example next slide) 21

Topic centroids in Tulip  Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation □ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only (entity core) 23

Topic centroids in Tulip  Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation □ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only (entity core)  Topic Centroid Refjnement □ Entities far from the centroid are removed from the core □ Cosine similarity with predefjned threshold t coh =0.2 24

Topic centroids in Tulip  Retrieve category-based profjles for all default senses (example next slide)  Topic Centroid Generation □ Centroid is a linear combination of entity profjles □ Default senses of non-suspicious mentions only (entity core)  Topic Centroid Refjnement □ Entities far from the centroid are removed from the core □ Cosine similarity with predefjned threshold t coh =0.2  Entity Scoring □ Relatedness score assigned to each default sense entity (including suspicious mentions) 25

Tulip: Lightweight Entity Recognition and Disambiguation Using - PowerPoint PPT Presentation

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation (ERD) Identify

TRUST IS NOT DEFAULT Tulip Ace Elevation Plan Tulip Violet Elevation Plan Tulip Ace

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Tulip Lab Tulip Lab Private Limited Care beyond boundaries Care Beyond Boundaries Vision &

AIDA-light: High-Throughput Named-Entity Disambiguation Ba Dat Nguyen Johannes Hoffart Martin

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

Full-document Entity Extraction and Disambiguation Silviu Cucerzan Microsoft Research Machine

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

TULIP Continuous testing of Linux distributions upgrade Stefane Fermigier Laurent Godard

Line Producers Global - Indias Leading Line Producers An Arm of Golden Tulip Films, India

The Yellow Tulip Project Building a Youth Momentum Spreading Hope Building Community &

Tulip.jl : an interior-point solver with abstract linear algebra Miguel Anjos a , b Andrea Lodi a

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Lightweight Cryptography and and RFID Security Svetla Nikova COSIC KUL COSIC, KULeuven and

Track 1: Power Electronics for Smart Devices Paper ID Paper Details Selective Harmonic

Design of Design of Lar Large ge Disc Har Disc Harrow Senior Design 2004-2005 BAE 4012 BAE

What is Accelerated Reader? Accelerated Reader is a computer program that helps teachers manage

Marana Distribution BOS District 1 BOS District 3 BOS District 5 Year 1 - 17-18 Town of

The Green Belt Threats and Future Richard Knox-Johnston Chair London Green Belt Council April

Part 2 Rembrandt versus school of Rembrandt Without any information, observers cant tell a

Hive200 full overview Enterprises HiVE200 For constant quality and low maintenance system

Heat storage for solar heating systems Department of Civil Engineering Now and in the future

Tulip: Lightweight Entity Recognition and Disambiguation Using - PowerPoint PPT Presentation

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation (ERD) Identify

TRUST IS NOT DEFAULT Tulip Ace Elevation Plan Tulip Violet Elevation Plan Tulip Ace

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Tulip Lab Tulip Lab Private Limited Care beyond boundaries Care Beyond Boundaries Vision &amp;

AIDA-light: High-Throughput Named-Entity Disambiguation Ba Dat Nguyen Johannes Hoffart Martin

Joint Entity Disambiguation and Clustering Angela Fahrni, Thierry Gckel and Michael Strube

Full-document Entity Extraction and Disambiguation Silviu Cucerzan Microsoft Research Machine

Named Entity Recognition Using BERT and ELMo Group 8 : Mikaela Guerrero Vikash Kumar Nitya

TULIP Continuous testing of Linux distributions upgrade Stefane Fermigier Laurent Godard

Line Producers Global - Indias Leading Line Producers An Arm of Golden Tulip Films, India

The Yellow Tulip Project Building a Youth Momentum Spreading Hope Building Community &amp;

Tulip.jl : an interior-point solver with abstract linear algebra Miguel Anjos a , b Andrea Lodi a

Publications, Identity, and Disambiguation NIH Workshop on Identifiers and Disambiguation in

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Lightweight Cryptography and and RFID Security Svetla Nikova COSIC KUL COSIC, KULeuven and

Track 1: Power Electronics for Smart Devices Paper ID Paper Details Selective Harmonic

Design of Design of Lar Large ge Disc Har Disc Harrow Senior Design 2004-2005 BAE 4012 BAE

What is Accelerated Reader? Accelerated Reader is a computer program that helps teachers manage

Marana Distribution BOS District 1 BOS District 3 BOS District 5 Year 1 - 17-18 Town of

The Green Belt Threats and Future Richard Knox-Johnston Chair London Green Belt Council April

Part 2 Rembrandt versus school of Rembrandt Without any information, observers cant tell a

Hive200 full overview Enterprises HiVE200 For constant quality and low maintenance system

Heat storage for solar heating systems Department of Civil Engineering Now and in the future

Tulip Lab Tulip Lab Private Limited Care beyond boundaries Care Beyond Boundaries Vision &

The Yellow Tulip Project Building a Youth Momentum Spreading Hope Building Community &

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT