full document entity extraction
play

Full-document Entity Extraction and Disambiguation Silviu Cucerzan - PowerPoint PPT Presentation

TAC Entity Linking by Performing Full-document Entity Extraction and Disambiguation Silviu Cucerzan Microsoft Research Machine Learning Group Gaithersburg, MD November 15, 2011 KBP Entity Linking - Task Description For a name string and a


  1. TAC Entity Linking by Performing Full-document Entity Extraction and Disambiguation Silviu Cucerzan Microsoft Research Machine Learning Group Gaithersburg, MD November 15, 2011

  2. KBP Entity Linking - Task Description For a name string and a document, determine which entity in a given knowledge base if any is being referred to by the name string. Wikipedia Oct. 2008 <query id="EL006455"> … <name>Reserve Bank</name> <docid>eng-NG-31-100316-11150589</docid> E0421510: Reserve Bank of Australia <entity>E0700143</entity> … </query> E0700143: Reserve Bank of India … NIL <query id="EL06472"> <name>Reserve Bank</name> <docid>eng-NG-31-142262-10040510</docid> <entity>E0421510</entity> </query> Evaluation metrics: Linking accuracy (A), Known-entity linking accuracy (A Wiki ), NIL accuracy (A NIL ), B-cubed precision and recall with equal element weighting P B-cubed+ = Avg x (Avg x ’|T ( x )=T( x ’) ( δ (T( x ),S( x ),S( x ’))) R B-cubed+ = Avg x (Avg x ’|S ( x )=S( x ’) ( δ (S( x ),T( x ),T( x ’))),

  3. Employed Resources Knowledge base: Wikipedia Oct. 2008 June 2011 818,741 3.6 million nodes nodes Text corpus: 1 million news articles + 300,000 Web documents Annotated data: Corpus Size (entity mentions) Person Organization GPE 2010 Training Web 500 500 500 2010 Eval Newswire 500 500 500 2010 Eval Web data 250 250 250

  4. How Ambiguous Are Target Names? Wikipedia Oct. 2008 <query id="EL006455"> <name>Reserve Bank</name> … <docid>eng-NG-31-100316-11150589</docid> E0421510: Reserve Bank of Australia <entity>E0700143</entity> … </query> E0700143: Reserve Bank of India 8 entities … NIL <query id="EL06472"> <name>Reserve Bank</name> <docid>eng-NG-31-142262-10040510</docid> <entity>E0421510</entity> </query>

  5. How Ambiguous Are Target Names? <DOCID> eng-NG-31-100316-11150589 </DOCID> <DOCTYPE SOURCE="usenet"> USENET TEXT </DOCTYPE> <DATETIME> 2008-11-08T05:41:05 </DATETIME> <query id="EL006455"> <HEADLINE> India Inc cuts jobs, frills to stay in shape </HEADLINE> <name>Reserve Bank</name> <TEXT> <POST> <docid>eng-NG-31-100316-11150589</docid> <POSTER> "ekam ber" &lt;ekam...@gmail.com&gt; </POSTER> <POSTDATE> 2008-11-08T05:41:05 </POSTDATE> <entity>E0700143</entity> NEW DELHI/MUMBAI: Layoffs, firings and salary cuts are increasingly becoming all too </query> common across India Inc, highlighting a deepening slowdown in the economy that has forced companies to take the knife to costs to protect their bottom line. Reserve Bank From banking and finance to aviation, from manufacturing to information technology, no sector appears immune, as companies look beyond hiring freezes to job cuts, of India mirroring a trend across much of the developed world which has seen tens of thousands of people out of employment. <query id="EL06472"> Admittedly India, among the few major global economies that will see respectable GDP growth this year, may not see job losses quite like that being felt in the West, it has <name>Reserve Bank</name> nevertheless got policymakers worried. <docid>eng-NG-31-142262-10040510</docid> Prime Minister Manmohan Singh earlier this week urged industry to desist from laying <entity>E0421510</entity> off people and promised to cut interest rates and levies to shore up the economy. The Reserve Bank of India (RBI) has already turned its attention to driving up growth from </query> containing inflation, and cut key reserve ratios for banks and a short-term interest rate, signalling a bias in favour of lower rates. Yet on Friday, news about job cuts came in from different directions. L&amp;T Infotech, a wholly-owned subsidiary of the country's largest engineering company Larsen &amp; Toubro (L&amp;T), is shedding up to 5% of its workforce of nearly 10,000 employees, according to market sources. […]

  6. How Ambiguous Are Target Names? Wikipedia Oct. 2008 <query id="EL006455"> <name>Reserve Bank</name> … <docid>eng-NG-31-100316-11150589</docid> E0421510: Reserve Bank of Australia <entity>E0700143</entity> … </query> E0700143: Reserve Bank of India … <query id="EL06472"> NIL <name>Reserve Bank</name> <docid>eng-NG-31-142262-10040510</docid> Wiki Oct. 2008: Wiki June 2011: <entity>E0421510</entity> “reserve bank” “reserve bank” </query> 8 entities 9 entities 105 surface forms that contain the string “reserve bank” 68 entities

  7. Full-document Analysis • Perform full-document entity extraction and then match target names against the extracted entities; choose the top-ranked matching entity • Sub- and super-string matches (in 7% of the instances, the target name does not match exactly any entity reference extracted from the text) e.g. “USC” ~ “USC baseball team”  USC Trojans baseball “ Koran Tempo newspaper ” ~ “Koran Tempo”  Koran Tempo

  8. Starting Point • Productized concept extraction system, trained on Wikipedia from June 2011 2010 training data set: A = 86.3% • Map the provided KB extracted from Oct 2008 data to the 2011 collection 2010 training data set: A = 88.2%

  9. Overview of the Information Extracted Surface Forms Entities e.g.: Texas Texas Texas (TV Series) Texas (US State) Topics: University of Texas Austin NBC network shows USS Texas American television soaps ≈ 30 Texas (band) Television spin-offs Texas (musical) Texas (TV Series) Texas (novel) Contexts: Texas (SpongeBob episode) Another World Texas Instruments TV Series Texas County, OK ... ...

  10. Surface Form to Entity Mappings • the titles of entity pages

  11. Surface Form to Entity Mappings • the titles of redirecting pages http://en.wikipedia.org/wiki/Another_World_in_Texas Another World in Texas Texas (TV Series)

  12. Surface Form to Entity Mappings • the disambiguation pages

  13. Surface Form to Entity Mappings • the references to entity pages in other articles Texas (TV Series)

  14. Topics • List pages (“List of [...]” “Table of [...]”)

  15. Topics • Wikipedia categories

  16. Topics • Lexico-syntactic patterns ENUM_Scotland_Music_#1

  17. Topic Statistics • List pages 80k • Categories 456k • Lexico-syntactic patterns 852k 1,000,000 900,000 766,575 682,715 800,000 700,000 569,566 600,000 398,986 500,000 400,000 272,745 216,038 190,940 300,000 135,352 97,411 200,000 71,637 67,766 53,775 40,954 31,786 24,968 20,240 16,423 13,346 11,188 100,000 9,348 7,768 • Avg. # topics per entity: 4.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+ • Avg. # entities per topic: 12

  18. Disambiguation - Intuition S. Cucerzan. “Large -scale Entity Disambiguation C = { c 1 ,…, c M } - known contexts Based on Wikipedia Data". EMNLP 2007 T = { t 1 ,…, t N } - known topic identifiers Text document D s s 1 ,..., s 1 e e 1 1  | ( )| s 1 s s e 1 ,..., e 1 1  | ( )| s s s s 1 ,..., ,..., e e e i i i s i  1 | ( )| k s i s s C , k T i i k s s s ,..., ,..., e j e j e j s j  1 | ( )| l s j s s C , j T j l l s s s ,..., ,..., e e e i i i  1 k | ( s )| i s s 1 ,..., s n e e n n  | ( )| s n Maximize the similarity between the document context d and each entity’s contexts d = D ∩ C as well as the topic identifiers of each entity pair.

  19. Disambiguation - Intuition S. Cucerzan. “Large -scale Entity Disambiguation Based on Wikipedia Data". EMNLP 2007 n n n           arg max , ( 1 ) , C d T T e e e i i j  ( ,.., )    e e 1 1 1 i i j 1 n More robust and simpler :     ( ) .. ( ) s s  1 n j i     d d T e    ( ) ( ) s S D e s  n     arg max ( , ), ( 0 , ) C T d T   e e e 1 i i i      ( ,.., ) ( ) .. ( )  e e s s 1 i 1 1 n n      2 arg max ( , ), || || , 1 .. C T d T i n   e e e 1 i i i   ( ) e s i i # topic tags of e i

Recommend


More recommend