Automatic creation of mappings between classification systems for bibliographic data Prof. Magnus Pfeffer Stuttgart Media University pfeffer@hdm-stuttgart.de
Agenda  Motivation  Instance-based matching  Application to bibliographic data  Evaluation  Ongoing projects  RDF Representation November 26th, 2013 Semantic Web in Libraries, Hamburg 2
Motivation November 26th, 2013 Semantic Web in Libraries, Hamburg 3
Current situation in Germany  Five regional library unions  Subject headings  Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file  Classification systems  RVK (Regensburg Union Classification)  BK (Basic Classification)  DDC (Dewey Decimal Classification)  Various local classification systems  Low proportion of indexed titles (25-30%) November 26th, 2013 Semantic Web in Libraries, Hamburg 4
Current situation in Germany  National library  Subject headings  Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file  Classification systems  DDC (Dewey Decimal Classification)  Coarse categories  DDC only for titles published since 2007  Only „Reihe A“ (print trade publications) is fully indexed with RSWK November 26th, 2013 Semantic Web in Libraries, Hamburg 5
Austrian National Library  Subject headings  Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file  Classification systems  BK since 2007  RVK in the Austrian librariy union catalogue November 26th, 2013 Semantic Web in Libraries, Hamburg 6
Goals  Re-use existing indexing information  National level  BK is used mainly in northern Germany / Austria  RVK mainly in southern Germany  DDC mainly by the National Library  International level  Make RVK data more accessible to DDC users  Use DDC indexing information available from e.g. the Library of Congress November 26th, 2013 Semantic Web in Libraries, Hamburg 7
Ideas  Use of appropriate classification systems  Facetted search in resource discovery systems  Should be monohierarchical  Should have limited number of classes → DDC (first digits) or BK  Browsing of similar titles  Should be fine-grained → DDC (full) or RVK  (Multi-lingual retrieval) November 26th, 2013 Semantic Web in Libraries, Hamburg 8
Ideas  Enable the use of existing tools and visualisations Denton (2012) Legrady (2005) November 26th, 2013 Semantic Web in Libraries, Hamburg 9
Instance-based Matching November 26th, 2013 Semantic Web in Libraries, Hamburg 10
Ontology matching  Well-studied problem in computer science  Several approaches  Based on the descriptors  Based on the structure  Based on the manifestations (instances) November 26th, 2013 Semantic Web in Libraries, Hamburg 11
Instances  Entries in catalogues with multiple classifications November 26th, 2013 Semantic Web in Libraries, Hamburg 12
Instance-based matching  Assumptions  Classes with semantic overlap co-occur in instances  The more often these classes co-occur, the stronger the overlap  Preparation  Extraction of all pairs of classifications from the data  Count of the extracted pairs November 26th, 2013 Semantic Web in Libraries, Hamburg 13
Example November 26th, 2013 Semantic Web in Libraries, Hamburg 14
Example  Entry 1  Pairs  DDC: 179.9  179.9 / CC 7200  RVK: CC 7200  179.9 / CC 7250  RVK: CC 7250  179.9 / CC 7200  Entry 2  DDC: 179.9  RVK: CC 7200 November 26th, 2013 Semantic Web in Libraries, Hamburg 15
Normalisation  Comparing solely absolute numbers is bad  Some classes are more often used than others  Number of pairs correlates with the number of entries that are classified using a given class  Instead: Use proportion of co-occurrence ↔ occurrence ∣ E c1 ∩ E c2 ∣ ∣ E c1 ∪ E c2 ∣ number of entries with both classifications divided by number of entries with either classification (Jaccard measure for overlap of sets) November 26th, 2013 Semantic Web in Libraries, Hamburg 16
Further interpretation  a and b are two classes from two classification systems A and B  The classes a and b only occur together → exact match  a only co-occurs with b , but b co-occurs with other classes from A → a is narrower concept than b  a co-occurs with several classes from B (including b ) → a is wider concept than b  a and b do not co-occur → cannot infer that a and b are unrelated November 26th, 2013 Semantic Web in Libraries, Hamburg 17
Prior work  Pfeffer (2009)  Analysis of classification system structure and actual use  Locating classes that describe the same concept  Finding ways to improve existing mappings to RVK  Focus on RVK, using data from library union catalogues  Co-occurrence analysis  Results  High co-occurrence and close in the hierarchy: → classes are hard to assign properly  High co-occurrence and far in the hierarchy: → classes describe identical concepts  Mappings from RSWK to RVK could be augmented November 26th, 2013 Semantic Web in Libraries, Hamburg 18
Related work  Isaac et.al. (2007)  Applied instance based matching to bibliographic data  Data from the National Library of the Netherlands  Mapping from a thesaurus to a classification system  Results  Generated mappings are quite good  More sophisticated measures than Jaccard do not lead to better mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 19
Application to bibliographic data November 26th, 2013 Semantic Web in Libraries, Hamburg 20
Bibliographic data is different  Multiple editions  Multiple document types November 26th, 2013 Semantic Web in Libraries, Hamburg 21
Bibliographic data  Skewed data  Multiple editions → More pairs  Some co-occurrences could appear stronger than others  Solution: Pre-clustering individual titles on the „work“ level  Increases chance for instances with more than one classifications  Each cluster contributes only once  Allows using absolute co-occurrence numbers  Cut-off for small numbers  Ranking of competing matches November 26th, 2013 Semantic Web in Libraries, Hamburg 22
Prior work  Pfeffer (2013)  Matching bibliographic records  Based on author, title and uniform title  (as well as information on title changes)  Matches any edition and revision of a work  Including translations  Merge match sets → Discrete clusters  Consolidating indexing information  For indexing purposes, the differences between editions and revisions are irrelevant  Subject headings and classifications are shared between all members of a cluster November 26th, 2013 Semantic Web in Libraries, Hamburg 23
Evaluation November 26th, 2013 Semantic Web in Libraries, Hamburg 24
Comparison with existing mappings  Existing (partial) mappings can be used as a basis for evaluation → „Gold standard“  Comparison of automatic and manual mapping  Recall: Are all the mappings found?  Precision: Are all found mappings correct?  Analysis of additional links  Maybe the gold standard can be improved? November 26th, 2013 Semantic Web in Libraries, Hamburg 25
Ongoing projects November 26th, 2013 Semantic Web in Libraries, Hamburg 26
Data  Bibliographic data  German library union catalogues  German National Library catalogue  Austrian National Library catalogue  British national bibliography  Gold standards  Partial mappings BK ↔ RVK November 26th, 2013 Semantic Web in Libraries, Hamburg 27
Interesting Mappings  RVK → BK  Gold standard exists  BK well suited for faceted retrieval  RVK has largest proportion of classified titles  RVK ↔ DDC  Enable data sharing between the German National Library and the RVK-using libraries  Not limited to classification systems  See Pfeffer (2009) and Wang et.al. (2009) November 26th, 2013 Semantic Web in Libraries, Hamburg 28
Implementation: Tasks  Import and mapping of MAB2 and MARC data  Clustering  Generation of keys for the match process  Matching and clustering  Consolidation of indexing and classification information  Statistics  Co-occurrence counts  Jaccard measure  Output  Full mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 29
Implementation: State  All steps implemented as a prototype  Perl scripts  File-based data and indexes  Current development  Still Perl scripts (but better documented)  All data is accumulated in a document store  MongoDB  Further plan: Porting to MetaFacture framework November 26th, 2013 Semantic Web in Libraries, Hamburg 30
RDF representation November 26th, 2013 Semantic Web in Libraries, Hamburg 31
Recommend
More recommend