Semantic integration of bibliographic records (Linked Open Data ) Author: Malakhov D. A.
Introduction 2 There are many different sources of library data. Each organization can use only their information, which is not connected with other sources. Integration by space LOD (Linked Open Data) is a universal solution of this problem. LOD was created to integrate as much information as possible in each subject area of it. Publication of data in this space allows to enrich this information and to provide an access to it. 2/13
Formulation of the problem 3 The purpose is to integrate the NLR (National Library) bibliographic records with records of the BNB (British National Library). The NLR dataset has millions records (test set 17 th.). BNB data set consists of 3.5 million units, it was published in the LOD. To reach the purpose, it’s necessary to solve such problems as : – Publication the NLR data according to the principles of LOD; – Integrating NLR data with BNB data. 3/13
Publication of data on the principles of LOD 4 Necessary actions for the publication of data : – Describing the subject area (creating an ontology). – Converting the NLR data (RUSMARC / bin) to RDF. – Configure the semantic RDF data repository for NLR data. – Providing an access to the NLR data (via HTTP and SPARQL). 4/13
Ontology 5 There are three ways of presenting bibliographic records in RDF : – MODS – the data model Library of Congress (USA). – Dublin core – the set of terms describing the network resources. – FOAF – the set of terms describing a person. BNB reported it's data using Dublin core and FOAF. These standards for data presentation were used. 5/13
Ontology 6 6/13
Preparation of RDF 7 Preparation XSLT transformation (RUSMARC/xml to RDF) Converting RUSMARC/bin to RDF 7/13
Storage creation 8 There are some ways to store semantic data : - storage in a relational database; - format TDB. There are 3 API for semantic storage: - the Jena; - the Sesame; - the Virtuoso. We selected the TDB format and the Jena. 8/13
Providing access to data NLR 9 The server Jetty was chosen for processing HTTP requests. The server returns information about the record, the author or the links, then it gets the full information about the object from the semantic storage via SPARQL. The access point Fuseki which is set up with a logical conclusion Pellet OWL is selected for processing SPARQL queries to storage. 9/13
Creating links 10 The clustering algorithm has been developed to create a link. The documents were linked by clusters. The clustering algorithm : 1) Clusters are created on the basis of a set of data (for a few passages in this set). 2) The remaining elements are distributed in clusters (in one pass on these elements). In the first instance the clusters of the NLR data were created. Then BNB data were distributed by the clusters. Links of documents and clusters were presented in RDF. 10/13
The scheme of the system 11 11/13
Conclusion 12 Further work can be carried out in such areas as : - full-text search in titles and descriptions; - distributed semantic repository; - searching by classifiers UDC and BDC; - searching by ISSN and ISBN. 12/13
13 Thank you for your attention! 13/13
Recommend
More recommend