An Integrated Approach for Large-scale Relation Extraction from the - PowerPoint PPT Presentation

An Integrated Approach for Large-scale Relation Extraction from the Web Naimdjon Takhirov, Fabien Duchateau, Trond Aalberg and Ingeborg Sølvberg APWeb‘2013 Sydney, Australia April 4th, 2013 1

Knowledge extraction • creation of knowledge from structured and unstructured text • machine processable representation • similar to IE but goes further (backed by a schema) • many projects towards transforming databases and other structured and unstructured text into an RDF/OWL representation An Integrated Approach for Large-scale Relation Extraction from the Web 2

An Integrated Approach for Large-scale Relation Extraction from the Web 3

An Integrated Approach for Large-scale Relation Extraction from the Web 4

Signet (publisher) publishedBy parodyOf creatorOf Bored of Lord of the J.R.R the rings rings Tolkien creatorOf Henry N. Douglas C. Beard Kenney founderOf National Lampoon (magazine) An Integrated Approach for Large-scale Relation Extraction from the Web 4

Background (2) • proper semantic integration of this data enables advanced semantic services (e.g. semantic and exploratory search, QA, entity matching and disambiguation, etc) • projects: Snowball, Dipre, Espresso, NELL, ReVerb, Sofie/ Prospera, KnowItAll, Probase, etc • also commercial interest: Google Knowledge Graph, Bing Snapshot, trueknowledge.com, etc • issues: not typed entities/relations, multiple relations, temporal aspect, tradeoff recall/precision, runtime performance An Integrated Approach for Large-scale Relation Extraction from the Web 5

Agenda • context and overview • approach – pattern generation – relationship and example generation – scalability • experimental evaluation – relationship discovery – performance • conclusion An Integrated Approach for Large-scale Relation Extraction from the Web 6

Context • existing, domain specific data models (e.g. libraries) need an “upgrade” – data created several decades ago (legacy data) – large investments (on the infrastructure and manpower) • new semantic data models require a complete conversion • recent developments of Linked Open Data(LOD) and the interest in semantic data models • ad-hoc conversion to semantic data models (RDF, OWL etc) is difficult – identification of entities – ambiguity An Integrated Approach for Large-scale Relation Extraction from the Web 7

Context (2) • why knowledge extraction from the Web? – huge source of information • “ Every 2 Days We Create As Much Information As We Did Up To 2003”, E. Schmidt 2010 – the place we discuss and share knowledge about our cultural heritage (news, wikis, blogs, personal catalogs etc.) SPIDER - S emantic and P rovenance-based I ntegration for • D etecting and E xtracting R elations – extracting semantic information from the documents – reasonable recall/precision wrt state-of-the-art An Integrated Approach for Large-scale Relation Extraction from the Web 8

Overview • two-step approach: – pattern generation – relationship example generation • both patterns and examples are stored in a knowledge base An Integrated Approach for Large-scale Relation Extraction from the Web 9

Overview of pattern generation L O T R {e2} is a parody of {e1} L o t R a spoof of {e1}, entitled{e2} … …. e _ R i n g s L o r d _ o f _ t h {e2} is a short satirical novel by … lord of the rings T L o t R parodying {e1} l11 l12 e1 ... l1m candidate generic ranked DOCUMENT EXTENSION GENERALIZATION SELECTION patterns patterns patterns EXTRACTION e2 l21 l22 ... {e2} is/VBZ a/DT parody of/IN {e1} l2n …. Simple strategy a/DT spoof of/IN {e1}, entitled/VBD {e2} DBpedia Freebase Contextual strategy OpenCyc collection of documents An Integrated Approach for Large-scale Relation Extraction from the Web 10

Extension L O T R {e2} is a parody of {e1} L o t R a spoof of {e1}, entitled{e2} … …. e _ R i n g s L o r d _ o f _ t h {e2} is a short satirical novel by … lord of the rings T L o t R parodying {e1} l11 l12 e1 ... l1m candidate generic ranked DOCUMENT EXTENSION GENERALIZATION SELECTION patterns patterns patterns EXTRACTION e2 l21 l22 ... {e2} is/VBZ a/DT parody of/IN {e1} l2n …. Simple strategy a/DT spoof of/IN {e1}, entitled/VBD {e2} DBpedia Freebase Contextual strategy OpenCyc collection of documents An Integrated Approach for Large-scale Relation Extraction from the Web 10

Extending entities • variety of spelling forms for entities (e.g. “Lord of the Rings”, “The Lord of the Rings”, “LOTR” etc) • use all alternative labels during extraction (avoid missing potentially interesting relationships) • idea is to exploit knowledge bases (DBpedia, Freebase) • context-driven, based on co-occurrence • discover alternative labels in knowledge bases (e.g. dbpedia:wikiPageRedirects, freebase:common.topic.alias) – how to select the right entity? – what to do when disambiguation is not possible? An Integrated Approach for Large-scale Relation Extraction from the Web 11

Document extraction L O T R {e2} is a parody of {e1} L o t R a spoof of {e1}, entitled{e2} … …. e _ R i n g s L o r d _ o f _ t h {e2} is a short satirical novel by … lord of the rings T L o t R parodying {e1} l11 l12 e1 ... l1m candidate generic ranked DOCUMENT EXTENSION GENERALIZATION SELECTION patterns patterns patterns EXTRACTION e2 l21 l22 ... {e2} is/VBZ a/DT parody of/IN {e1} l2n …. Simple strategy a/DT spoof of/IN {e1}, entitled/VBD {e2} DBpedia Freebase Contextual strategy OpenCyc collection of documents An Integrated Approach for Large-scale Relation Extraction from the Web 12

Extracting candidate patterns • use all the (alternative) labels of entities • search the collection and rank docs acc. to relevance score • parse documents, locate sentences with co-occurrence • consider tokens before and after the entities • output is a list of candidate patterns An Integrated Approach for Large-scale Relation Extraction from the Web 13

Generalization L O T R {e2} is a parody of {e1} L o t R a spoof of {e1}, entitled{e2} … …. e _ R i n g s L o r d _ o f _ t h {e2} is a short satirical novel by … lord of the rings T L o t R parodying {e1} l11 l12 e1 ... l1m candidate generic ranked DOCUMENT EXTENSION GENERALIZATION SELECTION patterns patterns patterns EXTRACTION e2 l21 l22 ... {e2} is/VBZ a/DT parody of/IN {e1} l2n …. Simple strategy a/DT spoof of/IN {e1}, entitled/VBD {e2} DBpedia Freebase Contextual strategy OpenCyc collection of documents An Integrated Approach for Large-scale Relation Extraction from the Web 14

Generalization • goal is to generalize extracted candidate patterns • use of confidence score to select “best” patterns • strategies: - simple strategy (based on various operations) - clean, tag, merge - a strategy is sequence of operations - contextual strategy based on term frequency - most candidate patterns contain a few interesting terms to denote the type of relationship (e.g. Bored of the Rings is a parody of Lord of the Rings ) - not only terms in between, but also the surrounding context - use Wordnet to build a cluster of similar (hyponyms, synonyms) words - e.g. pair “Lord of the Rings” and “Tolkien” leads to “book”, “fantasy”, “writer” clusters An Integrated Approach for Large-scale Relation Extraction from the Web 15

Selection L O T R {e2} is a parody of {e1} L o t R a spoof of {e1}, entitled{e2} … …. e _ R i n g s L o r d _ o f _ t h {e2} is a short satirical novel by … lord of the rings T L o t R parodying {e1} l11 l12 e1 ... l1m candidate generic ranked DOCUMENT EXTENSION GENERALIZATION SELECTION patterns patterns patterns EXTRACTION e2 l21 l22 ... {e2} is/VBZ a/DT parody of/IN {e1} l2n …. Simple strategy a/DT spoof of/IN {e1}, entitled/VBD {e2} DBpedia Freebase Contextual strategy OpenCyc collection of documents An Integrated Approach for Large-scale Relation Extraction from the Web 16

Selection • exploit all information which allowed the discovery of the patterns and to compare patterns ✓ α sup p + β occ p + γ prov p ◆ • confidence score: conf ( p ) = α + β + γ • support: the ratio between the # of examples a pattern is able to discover and the total # examples discovered by all patterns • occurrency: # of candidate patterns which generalizes a pattern • provenance: takes into account the document properties in which a pattern was discovered (PageRank, SpamScore, and relevance score) – PageRank – SpamScore – RelScore An Integrated Approach for Large-scale Relation Extraction from the Web 17

An Integrated Approach for Large-scale Relation Extraction from the - PowerPoint PPT Presentation

An Integrated Approach for Large-scale Relation Extraction from the Web Naimdjon Takhirov, Fabien Duchateau, Trond Aalberg and Ingeborg Slvberg APWeb2013 Sydney, Australia April 4th, 2013 1 Knowledge extraction creation of

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Relation between things vs. a relation between people Lenin: Where the bourgeois economists

Part I: Soil Mechanics Volume-Volume relation Mass-Mass relation Mass-Volume relation

Relation Schema Given domains D 1 , D 2 , . D n a relation r is a subset of D 1 x D 2 x

A Scalable Scalable Approach Approach A for for Large- -Scale Scale Schema Schema

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Scale Confusion What is Scale? Scale is the size of an object in relation to other objects. What

Taking an Integrated Approach to the SDGs: A Role for Sustainability Science SDGs Flagship Team

GLOBAL STATUS OF CCS: 2014 Supplementary information presentation package Large-scale integrated

The presentation of gender in relation to the works of The presentation of gender in relation to

Nud : ( ) Nud : ( ) Relation : [ ] Nud : ( ) Relation : [ ] Modles de recherche possibles

Mohamed Thahir Traditional and Open Relation Extraction Read the Web Relation Extraction

Relation between sets 2/16 A relation R between sets A and B is a predicate on A B . R ( x, y )

Disjoint Set Class A relation R on set S is a subset of S S : ( a,b ) is in R iff a is

MOBILITY AS A SERVICE EFFECTIVENESS OF INTELLIGENT SPEED ADAPTATION, COLLISION WARNING AND

Knowledge Representation in Practice: Project Halo and the Semantic Web Mark Greaves Vulcan,

Information Planning Division Director, Lee Kiwan (kiwani@seoul.go.kr) About Seoul Ranked 6th in

Result Clustering for Keyword Search on Graphs Madhulika Mohanty Supervisor: Dr Maya Ramanath

February 18, 2019 4pm-7pm Riyadh Chamber KSA Technology in Facilities Management & Its

Global Developer of Free-to-Play Games for Mobile Social PC Other platforms TEAM

AFIIA 2017 1. Agenda 2. Etymology 3. VUCA world 4. Imperatives 5. Conclusions 1. Agenda 2.

BOARD OF TRUSTEES GOALS 1. Develop a long range plan that promotes safe and well-maintained

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

An Integrated Approach for Large-scale Relation Extraction from the - PowerPoint PPT Presentation

An Integrated Approach for Large-scale Relation Extraction from the Web Naimdjon Takhirov, Fabien Duchateau, Trond Aalberg and Ingeborg Slvberg APWeb2013 Sydney, Australia April 4th, 2013 1 Knowledge extraction creation of

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Relation between things vs. a relation between people Lenin: Where the bourgeois economists

Part I: Soil Mechanics Volume-Volume relation Mass-Mass relation Mass-Volume relation

Relation Schema Given domains D 1 , D 2 , . D n a relation r is a subset of D 1 x D 2 x

A Scalable Scalable Approach Approach A for for Large- -Scale Scale Schema Schema

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Scale Confusion What is Scale? Scale is the size of an object in relation to other objects. What

Taking an Integrated Approach to the SDGs: A Role for Sustainability Science SDGs Flagship Team

GLOBAL STATUS OF CCS: 2014 Supplementary information presentation package Large-scale integrated

The presentation of gender in relation to the works of The presentation of gender in relation to

Nud : ( ) Nud : ( ) Relation : [ ] Nud : ( ) Relation : [ ] Modles de recherche possibles

Mohamed Thahir Traditional and Open Relation Extraction Read the Web Relation Extraction

Relation between sets 2/16 A relation R between sets A and B is a predicate on A B . R ( x, y )

Disjoint Set Class A relation R on set S is a subset of S S : ( a,b ) is in R iff a is

MOBILITY AS A SERVICE EFFECTIVENESS OF INTELLIGENT SPEED ADAPTATION, COLLISION WARNING AND

Knowledge Representation in Practice: Project Halo and the Semantic Web Mark Greaves Vulcan,

Information Planning Division Director, Lee Kiwan (kiwani@seoul.go.kr) About Seoul Ranked 6th in

Result Clustering for Keyword Search on Graphs Madhulika Mohanty Supervisor: Dr Maya Ramanath

February 18, 2019 4pm-7pm Riyadh Chamber KSA Technology in Facilities Management &amp; Its

Global Developer of Free-to-Play Games for Mobile Social PC Other platforms TEAM

AFIIA 2017 1. Agenda 2. Etymology 3. VUCA world 4. Imperatives 5. Conclusions 1. Agenda 2.

BOARD OF TRUSTEES GOALS 1. Develop a long range plan that promotes safe and well-maintained

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

February 18, 2019 4pm-7pm Riyadh Chamber KSA Technology in Facilities Management & Its