http://aksw.org/files/boa.pdf BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille Ngonga Ngomo AKSW, Universität Leipzig http://www.volunteer-conservation-peru.org
Motivation • most knowledge bases extracted from (semi)-structured data • Linked Data Cloud grows • BUT: only 15-20 % of information • How can we extract data from the document-oriented web?
Idea • start with triples from the Data Web • extract natural language patterns which express predicates found in triples • combine patterns & NLP to find labels which stand in relation with predicate • generate RDF and feed it into Data Web
Related Work • NLP & RDF: • Fox, Extractiv, Alchemy, OpenCalais • NELL [CAR+10] • initial ontology: 100+ categories/relations • PROSPERA [NDA+11] • harvesting of n-grams-itemset patterns ➤ generalisation without adding noise • [JUR+10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. ACL, pages 1003 – 1011, 2009.
The BOA approach
Corpus extraction • Crawler • Seed Pages, removes HTML • Cleaner • SBD, UTF-8 filters to remove noise • Indexer • sentences get index
Knowledge acquisition • Class C that serves as the rdfs:domain or as the rdfs:range of predicate p • knowledge base for background knowledge • extract statements with entities of rdf:type C as subject or object • db:Place , db:Person , db:Organisation
Pattern Search • set of entities s and o connected through p • find sentences which contain s & o, strip the rest • replace labels with variables (?D?, ?R?) • A BOA pattern is a pair P = (μ(p), θ), where μ(p) is p’s URI and θ is a natural language representation of p. • A BOA pattern mapping is a function M such that M(p) = S , where S is the set of natural language representations for p. • Occurrences, sentences, labels p is learned from, number of occurrences for each label combination
Pattern Scoring • Pattern Filtering: Length, Stop Words, Occurrence • Support: used across several triples in background knowledge • Typicity: allows to map ?D?, ?R? to entities with rdf:type of domain/range of p • Specificity: used exclusively to express p, IDF adopted to patterns • ( Similarity: how similar is a pattern to label of predicate) • Combine Support, Typicity, Specificity to calculate local maximum
RDF Generation • use top-n pattern for each relation • find sentences which contain pattern • NER-tag sentence • look for token’s classes which match domain/range • extract labels • URI retrieval above threshold do not create new URI
Demo
http://139.18.2.164:8080/boa
Evaluation • Corpora • en_wiki (44.7M), en_news (256.1M) • Background Knowledge • Organisation, Place and Person (283 relations from 1 to 471920 triples) • Parameters • top1,2 pattern, kappa, 500 sentences for Typicity, 100 example sentences for 12 different KBs
Results
Examples Relation Top-2 Pattern URI en-wiki en-news Domain/Range foundationPerson 1. R , co-founder of D 1. R, the co-founder of D Organisation/Person 2. R , founder of D 2. R, founder of the D subsidiary 1. R , a subsidiary of D 1. R, a division of D Organisation/Organisati 2. D‘s acquisition of R 2. - (R , a division of D) on 1. D has been named in the birthPlace 1.D was born in R R Person/PopulatedPlace 2. - (D , the mayor of R) 2. D, MP for R
Discussion • we can use patterns from wiki for every corpus • we create many new triples • we create correct triples • we need 15 minutes for one iteration • Q1 & Q2 answered with YES
Future Work • Train NER on DBpedia classes • Iteration 1+ • Human feedback • Pattern generalization • rdf:type extractor • Languages/Corpora • Webservices
Thank you! Questions?
References • [NDA+11] • Ndapandula Nakashole, Martin Theobald, and Gerhard Weikum. Scalable knowledge harvesting with high precision and high recall. In WSDM, pages 227 – 236, Hong Kong, 2011. • [CAR+10] • Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010.
Recommend
More recommend