Name System Mayank Kejriwal 2 Linked Data A set of four best - PowerPoint PPT Presentation

1 Populating a Linked Data Entity Name System Mayank Kejriwal

2 Linked Data  A set of four best practices for publishing and connecting structured data on the Web Bizer et al. (2009, 2014)

3 Instance Matching  Connecting pairs of entities that refer to the same underlying entity  Also known as ‘entity resolution’, ‘entity matching’, ‘co - reference resolution’, ‘merge - purge’... Jaffri et al. (2008) Papadakis et al. (2010) Nikolov et al. (2011)

4 Entity Name System: a thesaurus for entities  Populating an ENS requires solutions to instance matching  Many applications ... freebase:Paul_G._Allen dbpedia:Allen_,Paul Paul Gardner Allen ... freebase:Microsoft dbpedia:Microsoft Corp. Microsoft ... Bouquet et al. (2008)

5 Data Integration: Example from e-commerce Product X Mediated Entity Name schema/Target Aggregated Results System ontology ... Seller 1 Seller 2 Seller n Doan et al. (2012)

6 Emerald: Data Integration for RDF and Linked Data

7 Resource Description Framework (RDF)  An RDF dataset is a set of triples, visualized as a directed labeled graph  A triple is a 3-element tuple (subject, property, object) and represents an edge in the graph  Subjects and properties are necessarily URIs  Objects may be URIs or literals http://www.w3.org/RDF Bizer et al. (2009)

8 From a Web of Linked ‘Documents’...

9 ...to a Web of Linked ‘Data’ Cross-domain Media  ‘Linked Open Data’ started in Publications 2007 with just 12 RDF datasets  At last survey (2014), contains:  Millions of resources  1000 datasets  900,000 documents  500 million inter-dataset links  Many domains!  Applications include schema.org, Google Knowledge Graph, Constitute... Cyganiak and Jentzsch Social Networking (2014) Linkeddata.org

10 Research question What requirements need to be fulfilled in order to populate a Linked Data Entity Name System?

11 Returning to our example...

12 Linked Open Data Cross-domain Media Publications  ‘Linked Open Data’ started in 2007 with just a handful of datasets  At last survey (2014), contains:  Millions of resources  1000 datasets  900,000 documents  500 million inter-dataset links  Many domains! Cyganiak and Jentzsch (2014) Social Networking Linkeddata.org

13 Thesis statement Populating a Linked Data Entity Name System requires simultaneously fulfilling the four DASH requirements of domain-independence, automation, scalability and heterogeneity Kejriwal and Miranker (2014)

14 Step 1: Type alignment Kejriwal and Miranker (2014) Euzenat and Shvaiko (2007)

15 Step 2: Property alignment Euzenat and Shvaiko (2007)

16 Step 3: Similarity prediction?

17 Step 3: blocking and similarity ? Apply blocking key e.g. Tokens(LastName) ? 4 Blocks 3 2 1 ? Generate candidate set (7 ? pairs), apply Dataset 1 5 similarity function ? on each pair ? Dataset 2 ? ‘Exhaustive’ set: 4 X 6=24 pairs Christen (2012)

18 Final output

19 Supervised schematic (post type-alignment)  Presented mainly to static tabular datasets; not viable for dynamic linked datasets Aligned training set Training set of Learn duplicates/ Property non-duplicates Alignment Learn Learn Similarity blocking key function Blocking Classifier Trained key Candidate RDF dataset 1 Execute set :sameAs Execute blocking links RDF dataset 2 similarity Elmagarmid et al. (2007)

20 Semi-supervised schematic (post type-alignment)  Hard to realize in practice both because of class imbalance , and because graphs are hard to explore Aligned training set Seed training set Learn of duplicates/ Property non-duplicates Alignment samples Most confident Learn Learn Similarity blocking key function Blocking Classifier Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links Kejriwal and Miranker (2015)

21 Unsupervised schematic? Aligned training set Seed training set Learn of duplicates/ Property non-duplicates Alignment samples Most confident Learn Learn Similarity blocking key function Blocking Classifier Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links

22 Unsupervised schematic? Aligned training set Noisy seed Learn training set of Property duplicates/ non- Alignment samples Most confident duplicates Learn Learn Similarity blocking key function Training set Blocking Classifier generator? Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links Kejriwal and Miranker (2013-2015)

23 Our system: a complete, unsupervised schematic  Implemented both serially and in MapReduce (using standard cloud services)  Feasible for linking large, cross-domain graphs like Dbpedia and Freebase  Does not ‘ assume away ’ any of the DASH requirements (e.g. property heterogeneity) Kejriwal and Miranker (2015)

24 Specific algorithmic contributions Motivation Type Heterogeneity Automation Blocking and similarity Property Heterogeneity Full system (serial) Scalability 2016 2015 2013 2014 ISWC, ESWC, JWS, ISWC, ISWC, 2016 ICDM, 2014 2015 2015 2015 (submitted) 2013 Know@ OM, LOD, 2014 2015

25 First contribution: Unsupervised training set generation Kejriwal and Miranker (2013-2015)

26 Training Set Generator (TSG): Intuition  Generate a seed training set by locating a few easy examples using fast, unsupervised heuristics Aligned training set Noisy seed Learn training set of Property duplicates/ non- Alignment samples Most confident duplicates Learn Learn Similarity blocking key function Training set Blocking Classifier generator Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links Kejriwal and Miranker (2013-2015)

27 What’s considered ‘easy’?  Operational definition: Pair on which a token-based heuristic (e.g. Jaccard ) gives a high score  Tokens can be extracted by using an RDF-specific tokenizer Entity from RDF dataset 2 Entity from RDF dataset 1

28 Step 1: Fast heuristic that is ‘recall - favoring’ with respect to easy examples  Found LogTFIDF with cosine similarity to work well for this step  Prunes much of the quadratic space in slightly super-linear time Given two bags of tokens (‘words’), 𝑇 1 and 𝑇 2 ...  𝑀𝑝𝑕𝑈𝐺𝐽𝐸𝐺(𝑇 1 , 𝑇 2 ) = σ 𝑟 ∈𝑇 1 ∩𝑇 2 ) 𝑥 𝑇 1 , 𝑟 𝑥(𝑇 2 , 𝑟 , where ) 𝑥′(S,𝑟 𝑥(S, 𝑟) = σ 𝑟 ∈𝑇 𝑥′ S,𝑟 2 , where S,𝑟 + 1 lo g( 𝑄 𝑥 ′ 𝑇, 𝑟 = log 𝑢𝑔 + 1) 𝑒𝑔 𝑟 Cohen (2000)

29 Step 2: ‘Precision - favoring’ heuristic  Found Jaccard to work well for this ‘re - ranking’ step  Given two sets of tokens (‘words’), 𝑇 1 and 𝑇 2 ... 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝑇 1 , 𝑇 2 ) = |𝑇 1 ∩ 𝑇 2 | |𝑇 1 ∪ 𝑇 2 | Christen (2012)

30 Unsupervised RDF Training Set Generator (TSG) Training set generator (TSG) Use TF-IDF to prune space and favor recall Use Jaccard to favor precision Make every sample count Generate non- duplicates Kejriwal and Miranker (2015)

31 Baseline and Metrics  Use Dumas TSG (just uses LogTFIDF) as baseline  Why not an RDF instance-matching TSG? There were none!  We evaluate the training set generator using Precision vs. Recall graphs |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| 𝑆𝑓𝑑𝑏𝑚𝑚 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑜𝑓𝑕𝑏𝑢𝑗𝑤𝑓𝑡| 𝐺 − 𝑁𝑓𝑏𝑡𝑣𝑠𝑓 = 2 × 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 × 𝑆𝑓𝑑𝑏𝑚𝑚 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑆𝑓𝑑𝑏𝑚𝑚 Bilke and Naumann (2005)

32 Serial Evaluations: Test suite Test case (pair of Number of Number of Number of Domain datasets) properties instances duplicate pairs Persons 1 People 15/14 2000/1000 500 Persons 2 People 15/14 2400/800 400 Restaurants Restaurants 8/8 339/2256 89 Eprints-Rexa Publications 24/115 1130/18,492 171 IM-Similarity Books 9/9 181/180 496 IIMB-059 Movies 31/25 1549/519 412 IIMB-062 Movies 31/34 1549/265 264 Libraries Point of Interest, Addresses 4/10 17,636/26,583 16,789 Parks Point of Interest, Addresses 3/10 567/359 322 Video Game Point of Interest, Addresses 11/4 20,000/16,755 10,000 Kejriwal and Miranker (2015)

33 Some Results Kejriwal and Miranker (2015)

34 How does it scale?  Implemented in MapReduce in Microsoft Azure  Scales near linearly, even with millions of entities  Designed to avoid data skew and ‘curse of the last reducer’ problems

Name System Mayank Kejriwal 2 Linked Data A set of four best - PowerPoint PPT Presentation

1 Populating a Linked Data Entity Name System Mayank Kejriwal 2 Linked Data A set of four best practices for publishing and connecting structured data on the Web Bizer et al. (2009, 2014) 3 Instance Matching Connecting pairs of

Name service Domain Name System (DNS) Name : identifier Need a system: Name IP

COMPANY NAME www.nicecompany.com COMPANY NAME www.nicecompany.com COMPANY NAME

Ethereum Name Service Nick Johnson <nick@notdot.net> Why do we need another name service?

Chapter 24 Chapter 24 Chapter 24 The Domain Name System The Domain Name System The Domain Name

Workshop Sponsors 1 11/5/2012 Site Name Here Todays Presenters FA professional name FA

Who Is My Counselor? Last Name A-Co: Mrs. Ary Last Name Cr-He: Mr. Peslak Last Name Hi-Ma:

In the name of Allah In the name of Allah In the name of Allah In the name of Allah THE

Chapter 7 The Application Layer DNS The Domain Name System The DNS Name Space

A GUIDE TO SELLING YOUR HOME PREPARED FOR: CLIENT NAME 1 CLIENT NAME 2 COMPLIMENTS OF: AGENT NAME

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Applicant Name: [Last, first name #1; Last,

Ohio B Buck ckeye T eye Tre ree Commo mmon Name Name: Ohio Buckeye Scienti tifi fic Name

Regular Expressions Simple matching and searching String: My name is Claus Regex: My name is

More forms 1 CS380 Reset Buttons 2 Name: <input type="text" name="name"

Domain Name System (DNS) Smith College, CSC 249 Feb 6, 2017 1 TODAY: Domain Name System q The

Country Name : Somali Federal Republic Name of Name of t the he Sp Spea eaker er Dr : Dr

SKY FIBRE Despicable Me 3 Creative Concepts 30 DRTV No. 00000 V.1 SKY FIBRE Despicable

Logical Structure Analysis of Scientific Publications in Mathematics Valery Solovyev, Nikita

Scalable Learning of Entity and Predicate Embeddings for Knowledge Graph Completion Pasquale

Addressing Risks in a Changing World EHS Auditing: Addressing Risks in a Changing World Douglas

Situational Awareness: Terrain Reasoning for Tactical Shooter A.I Situational Awareness The

Simulation of molecular regulatory networks with graphical models Inma Tur 1 Robert Castelo 1

When Code Cries Cory Foy @cory_foy foyc@coryfoy.com http://www.coryfoy.com #gotober @cory_foy

Machine Learning and Knowledge Graphs Pasquale Minervini University College London @pminervini

AN INTRODUCTION TO CONTENT DETERMINATION Gerard Casamayor Chris Mellish Contents 1. The place

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Name System Mayank Kejriwal 2 Linked Data A set of four best - PowerPoint PPT Presentation

1 Populating a Linked Data Entity Name System Mayank Kejriwal 2 Linked Data A set of four best practices for publishing and connecting structured data on the Web Bizer et al. (2009, 2014) 3 Instance Matching Connecting pairs of

Name service Domain Name System (DNS) Name : identifier Need a system: Name IP

COMPANY NAME www.nicecompany.com COMPANY NAME www.nicecompany.com COMPANY NAME

Ethereum Name Service Nick Johnson &lt;nick@notdot.net&gt; Why do we need another name service?

Chapter 24 Chapter 24 Chapter 24 The Domain Name System The Domain Name System The Domain Name

Workshop Sponsors 1 11/5/2012 Site Name Here Todays Presenters FA professional name FA

Who Is My Counselor? Last Name A-Co: Mrs. Ary Last Name Cr-He: Mr. Peslak Last Name Hi-Ma:

In the name of Allah In the name of Allah In the name of Allah In the name of Allah THE

Chapter 7 The Application Layer DNS The Domain Name System The DNS Name Space

A GUIDE TO SELLING YOUR HOME PREPARED FOR: CLIENT NAME 1 CLIENT NAME 2 COMPLIMENTS OF: AGENT NAME

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Applicant Name: [Last, first name #1; Last,

Ohio B Buck ckeye T eye Tre ree Commo mmon Name Name: Ohio Buckeye Scienti tifi fic Name

Regular Expressions Simple matching and searching String: My name is Claus Regex: My name is

More forms 1 CS380 Reset Buttons 2 Name: &lt;input type=&quot;text&quot; name=&quot;name&quot;

Domain Name System (DNS) Smith College, CSC 249 Feb 6, 2017 1 TODAY: Domain Name System q The

Country Name : Somali Federal Republic Name of Name of t the he Sp Spea eaker er Dr : Dr

SKY FIBRE Despicable Me 3 Creative Concepts 30 DRTV No. 00000 V.1 SKY FIBRE Despicable

Logical Structure Analysis of Scientific Publications in Mathematics Valery Solovyev, Nikita

Scalable Learning of Entity and Predicate Embeddings for Knowledge Graph Completion Pasquale

Addressing Risks in a Changing World EHS Auditing: Addressing Risks in a Changing World Douglas

Situational Awareness: Terrain Reasoning for Tactical Shooter A.I Situational Awareness The

Simulation of molecular regulatory networks with graphical models Inma Tur 1 Robert Castelo 1

When Code Cries Cory Foy @cory_foy foyc@coryfoy.com http://www.coryfoy.com #gotober @cory_foy

Machine Learning and Knowledge Graphs Pasquale Minervini University College London @pminervini

AN INTRODUCTION TO CONTENT DETERMINATION Gerard Casamayor Chris Mellish Contents 1. The place

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Ethereum Name Service Nick Johnson <nick@notdot.net> Why do we need another name service?

More forms 1 CS380 Reset Buttons 2 Name: <input type="text" name="name"