1 Populating a Linked Data Entity Name System Mayank Kejriwal
2 Linked Data A set of four best practices for publishing and connecting structured data on the Web Bizer et al. (2009, 2014)
3 Instance Matching Connecting pairs of entities that refer to the same underlying entity Also known as ‘entity resolution’, ‘entity matching’, ‘co - reference resolution’, ‘merge - purge’... Jaffri et al. (2008) Papadakis et al. (2010) Nikolov et al. (2011)
4 Entity Name System: a thesaurus for entities Populating an ENS requires solutions to instance matching Many applications ... freebase:Paul_G._Allen dbpedia:Allen_,Paul Paul Gardner Allen ... freebase:Microsoft dbpedia:Microsoft Corp. Microsoft ... Bouquet et al. (2008)
5 Data Integration: Example from e-commerce Product X Mediated Entity Name schema/Target Aggregated Results System ontology ... Seller 1 Seller 2 Seller n Doan et al. (2012)
6 Emerald: Data Integration for RDF and Linked Data
7 Resource Description Framework (RDF) An RDF dataset is a set of triples, visualized as a directed labeled graph A triple is a 3-element tuple (subject, property, object) and represents an edge in the graph Subjects and properties are necessarily URIs Objects may be URIs or literals http://www.w3.org/RDF Bizer et al. (2009)
8 From a Web of Linked ‘Documents’...
9 ...to a Web of Linked ‘Data’ Cross-domain Media ‘Linked Open Data’ started in Publications 2007 with just 12 RDF datasets At last survey (2014), contains: Millions of resources 1000 datasets 900,000 documents 500 million inter-dataset links Many domains! Applications include schema.org, Google Knowledge Graph, Constitute... Cyganiak and Jentzsch Social Networking (2014) Linkeddata.org
10 Research question What requirements need to be fulfilled in order to populate a Linked Data Entity Name System?
11 Returning to our example...
12 Linked Open Data Cross-domain Media Publications ‘Linked Open Data’ started in 2007 with just a handful of datasets At last survey (2014), contains: Millions of resources 1000 datasets 900,000 documents 500 million inter-dataset links Many domains! Cyganiak and Jentzsch (2014) Social Networking Linkeddata.org
13 Thesis statement Populating a Linked Data Entity Name System requires simultaneously fulfilling the four DASH requirements of domain-independence, automation, scalability and heterogeneity Kejriwal and Miranker (2014)
14 Step 1: Type alignment Kejriwal and Miranker (2014) Euzenat and Shvaiko (2007)
15 Step 2: Property alignment Euzenat and Shvaiko (2007)
16 Step 3: Similarity prediction?
17 Step 3: blocking and similarity ? Apply blocking key e.g. Tokens(LastName) ? 4 Blocks 3 2 1 ? Generate candidate set (7 ? pairs), apply Dataset 1 5 similarity function ? on each pair ? Dataset 2 ? ‘Exhaustive’ set: 4 X 6=24 pairs Christen (2012)
18 Final output
19 Supervised schematic (post type-alignment) Presented mainly to static tabular datasets; not viable for dynamic linked datasets Aligned training set Training set of Learn duplicates/ Property non-duplicates Alignment Learn Learn Similarity blocking key function Blocking Classifier Trained key Candidate RDF dataset 1 Execute set :sameAs Execute blocking links RDF dataset 2 similarity Elmagarmid et al. (2007)
20 Semi-supervised schematic (post type-alignment) Hard to realize in practice both because of class imbalance , and because graphs are hard to explore Aligned training set Seed training set Learn of duplicates/ Property non-duplicates Alignment samples Most confident Learn Learn Similarity blocking key function Blocking Classifier Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links Kejriwal and Miranker (2015)
21 Unsupervised schematic? Aligned training set Seed training set Learn of duplicates/ Property non-duplicates Alignment samples Most confident Learn Learn Similarity blocking key function Blocking Classifier Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links
22 Unsupervised schematic? Aligned training set Noisy seed Learn training set of Property duplicates/ non- Alignment samples Most confident duplicates Learn Learn Similarity blocking key function Training set Blocking Classifier generator? Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links Kejriwal and Miranker (2013-2015)
23 Our system: a complete, unsupervised schematic Implemented both serially and in MapReduce (using standard cloud services) Feasible for linking large, cross-domain graphs like Dbpedia and Freebase Does not ‘ assume away ’ any of the DASH requirements (e.g. property heterogeneity) Kejriwal and Miranker (2015)
24 Specific algorithmic contributions Motivation Type Heterogeneity Automation Blocking and similarity Property Heterogeneity Full system (serial) Scalability 2016 2015 2013 2014 ISWC, ESWC, JWS, ISWC, ISWC, 2016 ICDM, 2014 2015 2015 2015 (submitted) 2013 Know@ OM, LOD, 2014 2015
25 First contribution: Unsupervised training set generation Kejriwal and Miranker (2013-2015)
26 Training Set Generator (TSG): Intuition Generate a seed training set by locating a few easy examples using fast, unsupervised heuristics Aligned training set Noisy seed Learn training set of Property duplicates/ non- Alignment samples Most confident duplicates Learn Learn Similarity blocking key function Training set Blocking Classifier generator Trained key Candidate RDF dataset 1 Execute set Execute :sameAs blocking RDF dataset 2 similarity links Kejriwal and Miranker (2013-2015)
27 What’s considered ‘easy’? Operational definition: Pair on which a token-based heuristic (e.g. Jaccard ) gives a high score Tokens can be extracted by using an RDF-specific tokenizer Entity from RDF dataset 2 Entity from RDF dataset 1
28 Step 1: Fast heuristic that is ‘recall - favoring’ with respect to easy examples Found LogTFIDF with cosine similarity to work well for this step Prunes much of the quadratic space in slightly super-linear time Given two bags of tokens (‘words’), 𝑇 1 and 𝑇 2 ... 𝑀𝑝𝑈𝐺𝐽𝐸𝐺(𝑇 1 , 𝑇 2 ) = σ 𝑟 ∈𝑇 1 ∩𝑇 2 ) 𝑥 𝑇 1 , 𝑟 𝑥(𝑇 2 , 𝑟 , where ) 𝑥′(S,𝑟 𝑥(S, 𝑟) = σ 𝑟 ∈𝑇 𝑥′ S,𝑟 2 , where S,𝑟 + 1 lo g( 𝑄 𝑥 ′ 𝑇, 𝑟 = log 𝑢𝑔 + 1) 𝑒𝑔 𝑟 Cohen (2000)
29 Step 2: ‘Precision - favoring’ heuristic Found Jaccard to work well for this ‘re - ranking’ step Given two sets of tokens (‘words’), 𝑇 1 and 𝑇 2 ... 𝐾𝑏𝑑𝑑𝑏𝑠𝑒(𝑇 1 , 𝑇 2 ) = |𝑇 1 ∩ 𝑇 2 | |𝑇 1 ∪ 𝑇 2 | Christen (2012)
30 Unsupervised RDF Training Set Generator (TSG) Training set generator (TSG) Use TF-IDF to prune space and favor recall Use Jaccard to favor precision Make every sample count Generate non- duplicates Kejriwal and Miranker (2015)
31 Baseline and Metrics Use Dumas TSG (just uses LogTFIDF) as baseline Why not an RDF instance-matching TSG? There were none! We evaluate the training set generator using Precision vs. Recall graphs |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡| 𝑆𝑓𝑑𝑏𝑚𝑚 = |𝑈𝑠𝑣𝑓 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓𝑡 ∪ 𝐺𝑏𝑚𝑡𝑓 𝑜𝑓𝑏𝑢𝑗𝑤𝑓𝑡| 𝐺 − 𝑁𝑓𝑏𝑡𝑣𝑠𝑓 = 2 × 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 × 𝑆𝑓𝑑𝑏𝑚𝑚 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 + 𝑆𝑓𝑑𝑏𝑚𝑚 Bilke and Naumann (2005)
32 Serial Evaluations: Test suite Test case (pair of Number of Number of Number of Domain datasets) properties instances duplicate pairs Persons 1 People 15/14 2000/1000 500 Persons 2 People 15/14 2400/800 400 Restaurants Restaurants 8/8 339/2256 89 Eprints-Rexa Publications 24/115 1130/18,492 171 IM-Similarity Books 9/9 181/180 496 IIMB-059 Movies 31/25 1549/519 412 IIMB-062 Movies 31/34 1549/265 264 Libraries Point of Interest, Addresses 4/10 17,636/26,583 16,789 Parks Point of Interest, Addresses 3/10 567/359 322 Video Game Point of Interest, Addresses 11/4 20,000/16,755 10,000 Kejriwal and Miranker (2015)
33 Some Results Kejriwal and Miranker (2015)
34 How does it scale? Implemented in MapReduce in Microsoft Azure Scales near linearly, even with millions of entities Designed to avoid data skew and ‘curse of the last reducer’ problems
Recommend
More recommend