Information Extraction Prof. Sameer Singh CS 295: STATISTICAL NLP WINTER 2017 February 21, 2017 Based on slides from Dan Jurafski, Chris Manning, Jay Pujara, and everyone else they copied from.
Outline What is Information Extraction Named Entity Recognition Homework 3 CS 295: STATISTICAL NLP (WINTER 2017) 2
Outline What is Information Extraction Named Entity Recognition Homework 3 CS 295: STATISTICAL NLP (WINTER 2017) 3
Making Sense of Text ? Query Search Search Query DB Query (IR) Database or Graph Documents Documents Information Documents Documents Documents Extraction Documents Documents Structured Massive Corpus of Representation Unstructured Text 4
News Articles Query Which AI startups have been acquired by Tech companies? acquired founded Company Structured People employee Representation belongsTo expertIn Industry Information Extraction Massive Corpus of News Articles 5
Fiction Query Which two characters are not related by blood? Structured Representation Information Extraction Collection of Books 6
Academic Research Query What is the interaction pathway between YY1 and TIP60? Structured Representation Information Massive Corpus of Extraction Scientific Papers
Applications ? Question Answering 250 200 Documents 150 Information Documents Documents 100 Documents Documents Extraction 50 Documents Documents 0 April June Visualization & Statistics Database or Graph Downstream AI applications 8
Low-level Info. Extraction CS 295: STATISTICAL NLP (WINTER 2017) 9
Slightly better… CS 295: STATISTICAL NLP (WINTER 2017) 10
Slightly better? The headquarters of BHP Billiton Limited, and the global headquarters of the combined BHP Billiton Group, are located in Melbourne, Australia. headquarters(“BHP Biliton Limited”, “Melbourne, Australia”) CS 295: STATISTICAL NLP (WINTER 2017) 11
In the industry… Google Knowledge Graph ◦ Google Knowledge Vault Amazon Product Graph Facebook Graph API IBM Watson Microsoft Satori ◦ Project Hanover/Literome LinkedIn Knowledge Graph Yandex Object Answer Diffbot, GraphIQ, Maana, ParseHub, Reactor Labs, SpazioDati CS 295: STATISTICAL NLP (WINTER 2017) 12
Knowledge Extraction John was born in Liverpool, to Julia and Alfred Lennon. Text Literal Facts Alfred Lennon childOf birthplace John Liverpool Lennon Julia childOf Lennon 13
Role of NLP? John was born in Liverpool, to Julia and Alfred Lennon. Natural Language Processing Lennon.. Mrs. Lennon.. his father the Pool John Lennon... .. his mother .. Alfred he Location Person Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP 14
Information Extraction Lennon.. Mrs. Lennon.. his father the Pool John Lennon... .. his mother .. Alfred he Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. NNP VBD VBD IN NNP TO NNP CC NNP NNP Information Extraction Alfred Lennon childOf spouse birthplace John Liverpool Lennon Julia childOf Lennon 15
Breaking it Down Alfred Information Lennon Extraction Entity resolution, childOf Entity linking, spouse birthplace John Liverpool Relation extraction… Lennon Julia childOf Lennon Lennon.. Mrs. Lennon.. his father Document the Pool John Lennon... .. his mother .. Alfred he Coreference Resolution... Person Location Person Person John was born in Liverpool, to Julia and Alfred Lennon. Sentence Dependency Parsing, Part of speech tagging, Named entity recognition… NNP VBD VBD IN NNP TO NNP CC NNP NNP John was born in Liverpool, to Julia and Alfred Lennon. 16
Outline What is Information Extraction Named Entity Recognition Homework 3 Relation Extraction CS 295: STATISTICAL NLP (WINTER 2017) 17
Named Entity Recognition An important sub-task: find and classify names in text, for example: ◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Named Entity Recognition An important sub-task: find and classify names in text, for example: ◦ The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Named Entity Recognition An important sub-task: find and classify names in text, for example: ◦ The decision by the independent MP Andrew Wilkie to withdraw his Person support for the minority Labor government sounded dramatic but it Date should not further threaten its stability. When, after the 2010 election, Location Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply. Organi- zation
Detecting Named Entities Location Person Person Person John was born in Liverpool, to Julia and Alfred Lennon. How it is done: Uses in Knowledge Extraction: Context is important! Mentions describes the nodes • • Georgia, Washington, … Types are incredibly important! • • John Deere, Thomas Cook, … Often restrict relations • • Princeton, Amazon, … Fine-grained types are informative! • • Label whole sentence together Brooklyn: city • • Structured prediction again Sanders: politician, senator • • 21
NER: Entity Types 3 class: Location, Person, Organization Stanford CoreNLP 4 class: Location, Person, Organization, Misc 7 class: Location, Person, Organization, Money, Percent, Date, Time PERSON People, including fictional. NORP Nationalities or religious or political groups. FACILITY Buildings, airports, highways, bridges, etc. ORG Companies, agencies, institutions, etc. GPE Countries, cities, states. spaCy.io LOC Non-GPE locations, mountain ranges, bodies of water. PRODUCT Objects, vehicles, foods, etc. (Not services.) EVENT Named hurricanes, battles, wars, sports events, etc. WORK_OF_ART Titles of books, songs, etc. LANGUAGE Any named language. From Stanford CoreNLP (http://nlp.stanford.edu/software/CRF-NER.shtml) 22
NER: Entity Types Fine-grained Types From Ling & Weld. AAAI 2012 (http://aiweb.cs.washington.edu/ai/pubs/ling-aaai12.pdf) 23
CS 295: STATISTICAL NLP (WINTER 2017) 24
Sequence Labeling for NER CS 295: STATISTICAL NLP (WINTER 2017) 25
Features: Words and Lexicons Words Lexicons CS 295: STATISTICAL NLP (WINTER 2017) 26
Features: Prefixes/Suffixes CS 295: STATISTICAL NLP (WINTER 2017) 27
Features: Substrings of Words drug oxa : field company movie 0 0 6 0 8 0 0 14 0 0 0 6 place person Cotrimoxazole Cotrimoxazole 68 708 Wethersfield Wethersfield 18 Alien Fury: Countdown to Invasion Alien Fury: Countdown to Invasion 28 CS 295: STATISTICAL NLP (WINTER 2017)
Features: Word Shapes if A-Z X x if a-z Shape(c)= if 0-9 d John DC-100 CamelCase o.w. c Word shapes Xxxx XX-ddd XxxxxXxxx Short shapes Xx X-d XxXx CS 295: STATISTICAL NLP (WINTER 2017) 29
Features: Surrounding Context John Deere announced i-1 i i+1 NEXT_ PREV_ BIAS BIAS BIAS NEXT_ PREV_ WORD=Deere WORD=Deere WORD=Deere NEXT_ PREV_ LWORD=deere LWORD=deere LWORD=deere NEXT_ PREV_ FIRSTCAP=True FIRSTCAP=True FIRSTCAP=True NEXT_ PREV_ SSHAPE=Xx SSHAPE=Xx SSHAPE=Xx NEXT_ PREV_ LEXICON=company LEXICON=company LEXICON=company … … … CS 295: STATISTICAL NLP (WINTER 2017) 30
Outline What is Information Extraction Named Entity Recognition Homework 3 Relation Extraction CS 295: STATISTICAL NLP (WINTER 2017) 31
Sequence Tagging on Twitter Parts of Speech What a productive day . Not . P RON D ET A DJ N OUN . A DV . Named Entity Recognition ‘ Breaking Dawn ’ Returns to Vancouver on January 11th O B- MOVIE I- MOVIE O O O B- GEO - LOC O O O CS 295: STATISTICAL NLP (WINTER 2017) 32 Steedman, 2000
Sequence Tagging Models Logistic Regression Conditional Random Fields CS 295: STATISTICAL NLP (WINTER 2017) 33 Steedman, 2000
What do you have to do? Feature Engineering Test data will be released very close to the deadline! Viterbi Algorithm CS 295: STATISTICAL NLP (WINTER 2017) 34 Steedman, 2000
Upcoming… Homework 3 is due on February 27 • Homework Write-up and data has been released. • Status report due in 1.5 weeks: March 2, 2017 • Project Instructions coming soon • Only 5 pages • Paper summaries: February 28 , March 14 • Summaries Only 1 page each • CS 295: STATISTICAL NLP (WINTER 2017) 35
Recommend
More recommend