Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing Cristina Sarasua SWIB 2014, Bonn Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 2
relation b a MARC 21 EDM FRBR Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 3
relation b a MARC 21 EDM FRBR Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 4
Please share your thoughts on interlinking! Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 5 Interlinking on the Web of Data Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 6 Cross-dataset links D1 (a,r,b) | a in D1, b in D2 D2 d1:timbl owl:sameAs d2:timbernerslee; d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:donostia owl:sameAs d2:sansebastian; d1:bjork dc:creator d2:volta; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; d1:work2012 o:inspiredBy d2:song1900; o1:Conference owl:equivalentClass o2:Congress; o1:Conference owl:equivalentClass o2:Congress; o1:Democracy skos:related o2:Government; o1:Democracy skos:related o2:Government; o1:Publication skos:broader o2:JournalArticle; o1:Publication skos:broader o2:JournalArticle; o1:ImpressionistPainting rdfs:subClassOf o2:Painting; o1:ImpressionistPainting rdfs:subClassOf o2:Painting; Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 7 Why is interlinking important? What is known about Berlin? What is known about Berlin? Enhance the x:berlin owl:sameAs x:berlin owl:sameAs description of local dbpedia:Berlin; dbpedia:Berlin; entities tour:berlin; tour:berlin; x:berlin o:homeOf x:berlin o:homeOf authors:berlin; authors:berlin; x:img09112014 x:img09112014 Richer queries over lode:atPlace geo:brandtor; lode:atPlace geo:brandtor; aggregated data SELECT ?city SELECT ?city WHERE { WHERE { ?city1 gov:population ?pop . ?city1 gov:population ?pop . Cross-data set ?city1 owl:sameAs ?city2 . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . ?city2 unesco:count ?mon . browsing FILTER (?pop > 1000000 FILTER (?pop > 1000000 ?mon > 50)} ?mon > 50)} Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 8 Generating links Identify the D2 D1 resources to be connected with relation R Comparison criteria Picture: g_Reference_Links Decision boundary between link and non-link Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 9
He is already busy Attribution: Thomas Leu Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 10
He is already busy … but still would like correct and useful links Attribution: Thomas Leu Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 11
Crowdsourced Interlinking Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 12
Crowdsourcing “Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call ” Jeff Howe, 2006 Scalable Fast Macrotask Microtask Contest-based Macrotask Microtask Contest-based Citizen Science Citizen Science crowdsourcing crowdsourcing crowdsourcing crowdsourcing crowdsourcing crowdsourcing -E.g. NLP algorithm for a -E.g. tweet sentiment -E.g. writing an E-Book particular challenging -E.g. classify galaxies in analysis -Months, $30per hour / scenario pictures -Seconds, reward cents hundreds or thousands of -Months, up to thousands - seconds/minutes, no -Crowd workers register dollars of dollards money with simple profile, limited -Freelancers recruitment, -Final evaluation and - Open to everyone filtering interviews winner selection Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 13
An interlinking microtask Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 14
An interlinking microtask Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 15
An interlinking microtask Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 16
Approach Analyse crowd workers 2 D2 D1 Parse RDF links 1 Query D1,D2 Collect crowd responses for the candidate links to 3 be processed cl1: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) cl2: (s,p,o) Generate … … and publish cln: (s,p,o) cln: (s,p,o) microtasks candidate links Aggregated response Collect Generate responses RDF file with final links 4 cl5: (s,p,o) cl5: (s,p,o) … … crowd interlinking cln: (s,p,o) cln: (s,p,o) Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 17
Approach (II) Analyse crowd workers to filter out people – With bad intentions (i.e. scammers) – Who do not have enough knowledge Select representative links from which the answer is known (ground truth) and assess people → domain expert useful Measure x:b2 rdfs:label “Berlinale”; x:b rdfs:label “Berlin”; x:b2 rdfs:label “Berlinale”; x:b rdfs:label “Berlin”; difficulty based rdf:type o:Event; rdf:type o:City; rdf:type o:Event; rdf:type o:City; on data heuristics x:b rdfs:label “Córdoba”; x:b2 rdfs:label “Córdoba”; x:b rdfs:label “Córdoba”; x:b2 rdfs:label “Córdoba”; rdf:type o:City; rdf:type o:City; Select rdf:type o:City; rdf:type o:City; different x:b rdfs:label “Córdoba”; x:b rdfs:label “Córdoba”; matching x:b2 rdf:type o:City; x:b2 rdf:type o:City; rdf:type o:City; rdf:type o:City; cases wgs84:lat 37.883; wgs84:lat 37.883; wgs84:lat -31.400; wgs84:lat -31.400; Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 18
Approach (II) Analyse crowd workers to filter out people – With bad intentions (i.e. scammers) – Who do not have enough knowledge Select representative links from which the answer is known (ground truth) and assess people → domain expert useful Measure x:b2 rdfs:label “Berlinale”; x:b rdfs:label “Berlin”; x:b2 rdfs:label “Berlinale”; x:b rdfs:label “Berlin”; difficulty based Two-way feedback rdf:type o:Event; rdf:type o:City; rdf:type o:Event; rdf:type o:City; on data heuristics x:b rdfs:label “Córdoba”; x:b2 rdfs:label “Córdoba”; x:b rdfs:label “Córdoba”; x:b2 rdfs:label “Córdoba”; rdf:type o:City; rdf:type o:City; Select rdf:type o:City; rdf:type o:City; different x:b rdfs:label “Córdoba”; x:b rdfs:label “Córdoba”; matching x:b2 rdf:type o:City; x:b2 rdf:type o:City; rdf:type o:City; rdf:type o:City; cases wgs84:lat 37.883; wgs84:lat 37.883; wgs84:lat -31.400; wgs84:lat -31.400; Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 19
Approach Context information Analyse crowd workers 2 D2 D1 Parse RDF links 1 Query D1,D2 Collect crowd responses for the candidate links to 3 be processed cl1: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) cl2: (s,p,o) Generate … … and publish cln: (s,p,o) cln: (s,p,o) microtasks candidate links Aggregated response Collect Generate responses RDF file with final links #workers per link 4 cl5: (s,p,o) cl5: (s,p,o) agreement … … crowd interlinking cln: (s,p,o) cln: (s,p,o) Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 20
Approach (II) D1 D2 Manual interlinking D1 D2 Algorithm Review Guide HCOMP interlinking Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 21
Use cases Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 22
Mapping vocabularies Context information pre-configured Run an automatic ontology alignment tool and post-process the results with the crowd See also: [Sarasua et al., 2012] Cristina Sarasua Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 23
More recommend