Multi-Source Information Extraction Valentin Tablan University of - PowerPoint PPT Presentation

Multi-Source Information Extraction Valentin Tablan University of Sheffield

University of Sheffield, NLP Multi-Source IE Information Input 1 Extraction Results Information Merge Output Input 2 Extraction (Template / … Ontology) Information Input N Extraction □ Redundant sources: better precision. □ Complementary sources: better recall. 2009 GATE Summer School, Sheffield 2

University of Sheffield, NLP RichNews □ A prototype addressing the automation of semantic annotation for multimedia material □ Fully automatic □ Aimed at news material □ Not aiming at reaching performance comparable to that of human experts □ TV and radio news broadcasts from the BBC were used during development and testing 2009 GATE Summer School, Sheffield 3

University of Sheffield, NLP Motivation □ Broadcasters produce many of hours of material daily (BBC has 8 TV and 11 radio national channels) □ Some of this material can be reused in new productions □ Access to archive material is provided by some form of semantic annotation and indexing □ Manual annotation is time consuming (up to 10x real time) and expensive □ Currently some 90% of BBC’s output is only annotated at a very basic level 2009 GATE Summer School, Sheffield 4

University of Sheffield, NLP Overview □ Input: multimedia file □ Output: OWL/RDF descriptions of content ○ Headline (short summary) ○ List of entities (Person/Location/Organization/…) ○ Related web pages ○ Segmentation □ Multi-source Information Extraction system ○ Automatic speech transcript ○ Subtitles/closed captions (if available) ○ Related web pages ○ Legacy metadata 2009 GATE Summer School, Sheffield 5

University of Sheffield, NLP Key Problems □ Obtaining a transcript: ○ Speech recognition produces poor quality transcripts with many mistakes (error rate ranging from 10 to 90%) ○ More reliable sources (subtitles/closed captions) not always available □ Broadcast segmentation: ○ A news broadcast contains several stories. How do we work out where one starts and another one stops? 2009 GATE Summer School, Sheffield 6

University of Sheffield, NLP Workflow THISL C99 ASR ASR Media Speech Topical Transcript Segments File Recogniser Segmenter TF/IDF Web Search & Search Related Keyphrase Document Terms Web Pages Extraction Matching KIM Web Information Entities Entity Ouput Extraction Validation Entities And Degraded Text ASR Alignment Information Entities Extraction 2009 GATE Summer School, Sheffield 7

University of Sheffield, NLP Using ASR Transcripts □ ASR is performed by the THISL system. □ Based on ABBOT connectionist speech recognizer. □ Optimized specifically for use on BBC news broadcasts. □ Average word error rate of 29%. □ Error rate of up to 90% for out of studio recordings. 2009 GATE Summer School, Sheffield 8

University of Sheffield, NLP ASR Errors he was suspended after he was suspended after his arrest [SIL] but the Princess his arrest [SIL] but the was said never to have lost process were set never to confidence in him have lost confidence in him and other measures United Nations weapons weapons inspectors inspectors have for the have the first time first time entered one of entered one of saddam saddam hussein's hussein's presidential presidential palaces palaces 2009 GATE Summer School, Sheffield 9

University of Sheffield, NLP Topical Segmentation □ Uses C99 segmenter: ○ Removes common words from the ASR transcripts. ○ Stems the other words to get their roots. ○ Then looks to see in which parts of the transcripts the same words tend to occur. → These parts will probably report the same story. 2009 GATE Summer School, Sheffield 10

University of Sheffield, NLP Key Phrase Extraction Term frequency inverse document frequency (TF.IDF): □ Chooses sequences of words that tend to occur more frequently in the story than they do in the language as a whole. □ Any sequence of up to three words can be a phrase. □ Up to four phrases extracted per story. 2009 GATE Summer School, Sheffield 11

University of Sheffield, NLP Web Search and Document Matching □ The Key-phrases are used to search on the BBC, and the Times, Guardian and Telegraph newspaper websites for web pages reporting each story in the broadcast. □ Searches are restricted to the day of broadcast, or the day after. □ Searches are repeated using different combinations of the extracted key-phrases. □ The text of the returned web pages is compared to the text of the transcript to find matching stories. 2009 GATE Summer School, Sheffield 12

University of Sheffield, NLP Using the Web Pages The web pages contain: □ A headline, summary and section for each story. □ Good quality text that is readable, and contains correctly spelt proper names. □ They give more in depth coverage of the stories. 2009 GATE Summer School, Sheffield 13

University of Sheffield, NLP Semantic Annotation The KIM knowledge management system can semantically annotate the text derived from the web pages: □ KIM will identify people, organizations, locations etc. □ KIM performs well on the web page text, but very poorly when run on the transcripts directly. □ It allows for semantic ontology-aided searches for stories about particular people or locations etcetera. □ So we could search for people called Sydney, which would be difficult with a text-based search. 2009 GATE Summer School, Sheffield 14

University of Sheffield, NLP Entity Matching 2009 GATE Summer School, Sheffield 15

University of Sheffield, NLP Search for Entities 2009 GATE Summer School, Sheffield 16

University of Sheffield, NLP Story Retrieval 2009 GATE Summer School, Sheffield 17

University of Sheffield, NLP Evaluation Success in finding matching web pages was investigated. □ Evaluation based on 66 news stories from 9 half- hour news broadcasts. □ Web pages were found for 40% of stories. □ 7% of pages reported a closely related story, instead of that in the broadcast. 2009 GATE Summer School, Sheffield 18

University of Sheffield, NLP Possible Improvements □ Use teletext subtitles (closed captions) when they are available □ Better story segmentation through visual cues and latent semantic analysis □ Use for content augmentation for interactive media consumption 2009 GATE Summer School, Sheffield 19

University of Sheffield, NLP Other Examples: Multiflora □ Improve recall in analysing botany texts by using multiple sources and unification of populated templates. □ Store templates as an ontology (which gets populated from the multiple sources). □ Recall for the full template improves from 22% (1 source) to 71% (6 sources) □ Precision decreases from 74% to 63% 2009 GATE Summer School, Sheffield 20

University of Sheffield, NLP Multiflora - IE 2009 GATE Summer School, Sheffield 21

University of Sheffield, NLP Multiflora: Output 2009 GATE Summer School, Sheffield 22

University of Sheffield, NLP Other Examples: MUMIS □ Multi-Media Indexing and Search □ Indexing of football matches, using multiple sources: ○ Tickers (time-aligned with video stream) ○ Match reports (more in-depth) ○ Comments (extra details, such as player profiles) 2009 GATE Summer School, Sheffield 23

University of Sheffield, NLP Mumis Interface 2009 GATE Summer School, Sheffield 24

University of Sheffield, NLP Thank You! Questions? More Information http://gate.ac.uk http://nlp.shef.ac.uk 2009 GATE Summer School, Sheffield 25

Multi-Source Information Extraction Valentin Tablan University of - PowerPoint PPT Presentation

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of Sheffield, NLP Multi-Source IE Information Input 1 Extraction Results Information Merge Output Input 2 Extraction (Template / Ontology)

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

REET Joint Relation Extraction and Entity Typing via Multi-task Learning ADVISOR: JIA-LING, KOH

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

HIGH VACUUM MULTI-PHASE EXTRACTION CASE STUDIES MWCC CONFERENCE JULY 2019 High Vacuum

Chapter 15: Information Extraction and Knowledge Harvesting The Semantic Web is not a separate

Data Mining l The Extraction of useful information from data l The automated extraction of hidden

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca

Chapter VI: Information Extraction Information Retrieval & Data Mining Universitt des

Information Extraction in Illicit Web Domains Date: 2017/05/09 Author: Mayank Kejriwal, Pedro

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction &

Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning

Selective Sampling for Information Extraction with a Committee of Classifiers Evaluating Machine

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

mwetoolkit: A tool for automated extraction of multi-word expressions Vtor De Arajo Carlos

Structure patterns in Information Extraction Gal Lejeune, Research Assistant University of

Information Retrieval Based on Extraction of Domain Specific Information Retrieval Based on

Information Extraction: Capabilities and Challenges Ralph Grishman New York University What is

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

Multi-Source Information Extraction Valentin Tablan University of - PowerPoint PPT Presentation

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of Sheffield, NLP Multi-Source IE Information Input 1 Extraction Results Information Merge Output Input 2 Extraction (Template / Ontology)

HANDLING UNCERTAINTY IN INFORMATION EXTRACTION Maurice van Keulen and Mena Badieh Habib URSW 23

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

REET Joint Relation Extraction and Entity Typing via Multi-task Learning ADVISOR: JIA-LING, KOH

Information Extraction Pedro Szekely Information Sciences Institute, USC Viterbi School of

SI425 : NLP Set 13 Information Extraction Information Extraction Yesterday GM released third

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

SI485i : NLP Set 13 Information Extraction Information Extraction Yesterday GM released

HIGH VACUUM MULTI-PHASE EXTRACTION CASE STUDIES MWCC CONFERENCE JULY 2019 High Vacuum

Chapter 15: Information Extraction and Knowledge Harvesting The Semantic Web is not a separate

Data Mining l The Extraction of useful information from data l The automated extraction of hidden

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos

A contrastive Approach to Multi-word Term Extraction from Domain-specific Corpora Francesca

Chapter VI: Information Extraction Information Retrieval &amp; Data Mining Universitt des

Information Extraction in Illicit Web Domains Date: 2017/05/09 Author: Mayank Kejriwal, Pedro

Named Entity Recognition &amp; Sequence Labeling CSCI 699: ML for Knowledge Extraction &amp;

Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning

Selective Sampling for Information Extraction with a Committee of Classifiers Evaluating Machine

Convex relaxations for weakly supervised information extraction Edouard Grave Columbia

mwetoolkit: A tool for automated extraction of multi-word expressions Vtor De Arajo Carlos

Structure patterns in Information Extraction Gal Lejeune, Research Assistant University of

Information Retrieval Based on Extraction of Domain Specific Information Retrieval Based on

Information Extraction: Capabilities and Challenges Ralph Grishman New York University What is

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

Information Extraction Using the Structured Language Model Ciprian Chelba, Milind Mahajan

Chapter VI: Information Extraction Information Retrieval & Data Mining Universitt des

Named Entity Recognition & Sequence Labeling CSCI 699: ML for Knowledge Extraction &