Aktuelle Themen der Angewandten Informatik Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik chiarcos@informatik.uni-frankfurt.de 11. Juli 2013
Machine Reading & Open IE • Pretext: Continue from last week – Structured evidence, slide 90ff. • Machine Reading: Definition and goals • Open IE – TextRunner • Applications • Structured Knowledge – Entities, types, ontologies – OpenIE + LOD
Machine Reading „Learning by Reading“ • goal (informally): "acquisition of commonsense knowledge" – machine reading is the automatic, unsupervised understanding of text • Machine reading, or learning by reading, aims to extract knowledge automatically from unstructured text and apply the extracted knowledge to end tasks such as decision making and question answering (Poon et al. 2010) • Our task is to build a formal representation of a specific, coherent topic through deep processing of concise texts focused on that topic. (Barker et al. 2007)
Machine Reading: Desiderata • End-to-end – input raw text, extract knowledge, and be able to answer questions and support other end tasks • High quality – extract knowledge with high accuracy • Large-scale – acquire knowledge at Web-scale and be open to arbitrary domains, genres, and languages • Maximally autonomous – the system should incur minimal human effort • Continuous learning from experience – constantly integrate new information sources and learn from user questions and feedback (Poon et al. 2010)
Breadth/Depth tradeoff (a) broad/shallow (e.g., KnowItAll/TextRunner) – use a broad range of materials – extract repetitive facts from them set of relational tuples (b) narrow/deep (e.g., Möbius [Barker et al 2007]) – narrow range of materials (either in terms of simplified NL syntax or being limited to a single domain), – extract as much knowledge as possible from those materials a coherent and complete semantic model for an entire focused text
Breadth/Depth tradeoff (c) support deep systems with resources built by broad/shallow systems Open IE to construct a Background Knowledge Base (BKB)* consult this BKB in a deep system, e.g., for type inferences or inference of implicit information Today, we focus on a shallow system (KnowItAll/TextRunner, Oren Etzioni, University of Washington, since 2003) – Slides from Oren Etzioni (2012), Open Information Extraction from the Web. Invited talk at the NAACL-HLTC 2012 Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX 2012), June 2012, Montréal, Canada * Other BKBs may be built using syntax-based generalizations as described by Penas & Hovy (2010)
Definition Machine Reading • “MR is an exploratory, open -ended, serendipitous process” • “In contrast with many NLP tasks, MR is inherently unsupervised” • “Very large scale” • “Forming Generalizations based on extracted assertions” ontology-free !
Open Information Extraction • Definition and goals • Open IE – TextRunner • Applications • Structured Knowledge – Entities, types, ontologies – OpenIE + LOD
Open IE • Open Information Extraction (IE) is the task of extracting assertions from massive corpora without requiring a pre-specified vocabulary. • Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples – does not scale to corpora where the number of target relations is very large, or where the target relations cannot be specified in advance. • Open IE approach: identifying relation phrases — phrases that denote relations in English sentences – extraction of arbitrary relations from sentences, obviating the restriction to a pre-specified vocabulary Fader et al. (2011), Identifying Relations for Open Information Extraction, EMNLP 2011.
Classical KR research • Declarative KR is expensive & difficult • Formal semantics is at odds with – Broad scope – Distributed authorship • KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance”
KR-based IE: Hearst Patterns Knowledge Base Elvis was a great artist, but while all of Elvis’ colleagues loved the Several pre-defined relations plus song “Oh yeah, honey”, Elvis instances, e.g., for is-a relations did not perform that song at his (class membership) concert in Hintertuepflingen. Idea (by Hearst): Sentences express class membership in very predictable patterns. Use these patterns for instance extraction. Hearst patterns: Entity Class • X was a great Y Elvis artist Slide from Fabian M. Suchanek (2010)
KR-based IE: Bootstrapping Bootstrapping hand-crafted Hearst pattern • X was a great Y Seed: manually collected instances of a relation (or use hand-crafted pattern to retrieve such instances)
KR-based IE: Bootstrapping Bootstrapping seed • Seed: manually collected instances of a (June, is-a, month) • relation (or use hand-crafted pattern to (52, is-a, comic) • retrieve such instances) (Robert Altman, is-a, pothead) • (Lowry, is-a, reporter) Search: for every seed instance, retrieve sentences that contain its elements
KR-based IE: Bootstrapping Bootstrapping seed • Seed: manually collected instances of a (June, is-a, month) • relation (or use hand-crafted pattern to (52, is-a, comic) • retrieve such instances) (Robert Altman, is-a, pothead) • (Lowry, is-a, reporter) Search: for every seed instance, retrieve sentences that contain its elements
KR-based IE: Bootstrapping Bootstrapping seed • Seed: manually collected instances of a (June, is-a, month) • relation (or use hand-crafted pattern to (52, is-a, comic) • retrieve such instances) (Robert Altman, is-a, pothead) • (Lowry, is-a, reporter) Search: for every seed instance, retrieve sentences that contain its elements
KR-based IE: Bootstrapping Bootstrapping pattern candidates • Seed: manually collected instances of a X. Education Y (Lowry: 1) • relation (or use hand-crafted pattern to X is a Y (Lowry: 4) • retrieve such instances) X is a weekly American Y (52: 2) • X new Y (52: 1) • Search: for every seed instance, retrieve X is the sixth Y (June: 1) • sentences that contain its elements X is National Y (June: 1) • X is PTSD Awareness Y (June: 1) Generate patterns: for every instance match, replace the matches with variables, keep the immediate context (say, the words between) Pruning: Keep only the most confident (frequent, recurring, etc.) patterns Iterate: Retrieve instances and interate
Limits of Bootstrapping • established methodology to increase the coverage of pattern-based information extraction / information retrieval – cf. http://bootcat.sslmit.unibo.it (to bootstrap corpora for a particular language given a small number of seed words) • noise increases with every generation of patterns and instances • noise level cannot be reliably measured – no negative evidence • cannot extend the number of relations in the KB
KR-based Open IE ? • A “universal ontology” is impossible – Global consistency is like world peace • Micro ontologies ? – Do these scale? Interconnections? • Ontological “glass ceiling” – Limited vocabulary – Pre-determined predicates – Coverage restricted to pre-defined relations
OPEN VERSUS TRADITIONAL IE Open vs. Traditional IE Traditional IE Open IE Input: Corpus + O(R) Corpus hand-labeled data Relations: Specified Discovered in advance automatically Relation-specific Relation- Extractor: independent How is Open IE Possible? Etzioni, University of Washington 19
Open IE: TextRunner (2007) • Extractor – a single pass over all documents, POS-tagging, NP chunking – For each pair of NPs that are not too far apart,* apply a classifier to determine whether or not to extract a relationship • several other constraints apply, as well
Open IE: TextRunner (2007) • Self-Supervised Classifier – generate training examples for extraction – using several heuristic constraints, automatically label a the train set as trustworthy or untrustworthy (positive and negative examples – The classifier is trained on these examples • main feature: part of speech tags
NUMBER OF RELATIONS Number of Relations DARPA MR Domains <50 NYU, Yago <100 NELL ~500 DBpedia 3.2 940 PropBank 3,600 VerbNet 5,000 WikiPedia InfoBoxes, f > ~5,000 10 TextRunner (phrases ) 100,000+ ReVerb ( phrases ) 1,000,000+ Etzioni, University of Washington 22
SAMPLE OrrF EXTRACTED RELATIONS Relation Phrases invented acquired by has a PhD in inhibits tumor denied voted for growth in inherited born in mastered the art of is the patron downloaded aspired to saint of expelled Arrived from wrote the book on Etzioni, University of Washington 23
Relation Phrases Etzioni, University of Washington 24
Open IE: TextRunner (2007) • Cleaning up relations – Unsupervised, probabilistic synonym detection • P(Bill Clinton = President Clinton) – Count shared (relation, arg2) • P(acquired = bought) – Relations: count shared (arg1, arg2) Etzioni, University of Washington 25
Recommend
More recommend