Automatic creation of mappings between classification systems for bibliographic data Prof. Magnus Pfeffer Stuttgart Media University pfeffer@hdm-stuttgart.de
Agenda Motivation Instance-based matching Application to bibliographic data Evaluation Ongoing projects RDF Representation November 26th, 2013 Semantic Web in Libraries, Hamburg 2
Motivation November 26th, 2013 Semantic Web in Libraries, Hamburg 3
Current situation in Germany Five regional library unions Subject headings Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file Classification systems RVK (Regensburg Union Classification) BK (Basic Classification) DDC (Dewey Decimal Classification) Various local classification systems Low proportion of indexed titles (25-30%) November 26th, 2013 Semantic Web in Libraries, Hamburg 4
Current situation in Germany National library Subject headings Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file Classification systems DDC (Dewey Decimal Classification) Coarse categories DDC only for titles published since 2007 Only „Reihe A“ (print trade publications) is fully indexed with RSWK November 26th, 2013 Semantic Web in Libraries, Hamburg 5
Austrian National Library Subject headings Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file Classification systems BK since 2007 RVK in the Austrian librariy union catalogue November 26th, 2013 Semantic Web in Libraries, Hamburg 6
Goals Re-use existing indexing information National level BK is used mainly in northern Germany / Austria RVK mainly in southern Germany DDC mainly by the National Library International level Make RVK data more accessible to DDC users Use DDC indexing information available from e.g. the Library of Congress November 26th, 2013 Semantic Web in Libraries, Hamburg 7
Ideas Use of appropriate classification systems Facetted search in resource discovery systems Should be monohierarchical Should have limited number of classes → DDC (first digits) or BK Browsing of similar titles Should be fine-grained → DDC (full) or RVK (Multi-lingual retrieval) November 26th, 2013 Semantic Web in Libraries, Hamburg 8
Ideas Enable the use of existing tools and visualisations Denton (2012) Legrady (2005) November 26th, 2013 Semantic Web in Libraries, Hamburg 9
Instance-based Matching November 26th, 2013 Semantic Web in Libraries, Hamburg 10
Ontology matching Well-studied problem in computer science Several approaches Based on the descriptors Based on the structure Based on the manifestations (instances) November 26th, 2013 Semantic Web in Libraries, Hamburg 11
Instances Entries in catalogues with multiple classifications November 26th, 2013 Semantic Web in Libraries, Hamburg 12
Instance-based matching Assumptions Classes with semantic overlap co-occur in instances The more often these classes co-occur, the stronger the overlap Preparation Extraction of all pairs of classifications from the data Count of the extracted pairs November 26th, 2013 Semantic Web in Libraries, Hamburg 13
Example November 26th, 2013 Semantic Web in Libraries, Hamburg 14
Example Entry 1 Pairs DDC: 179.9 179.9 / CC 7200 RVK: CC 7200 179.9 / CC 7250 RVK: CC 7250 179.9 / CC 7200 Entry 2 DDC: 179.9 RVK: CC 7200 November 26th, 2013 Semantic Web in Libraries, Hamburg 15
Normalisation Comparing solely absolute numbers is bad Some classes are more often used than others Number of pairs correlates with the number of entries that are classified using a given class Instead: Use proportion of co-occurrence ↔ occurrence ∣ E c1 ∩ E c2 ∣ ∣ E c1 ∪ E c2 ∣ number of entries with both classifications divided by number of entries with either classification (Jaccard measure for overlap of sets) November 26th, 2013 Semantic Web in Libraries, Hamburg 16
Further interpretation a and b are two classes from two classification systems A and B The classes a and b only occur together → exact match a only co-occurs with b , but b co-occurs with other classes from A → a is narrower concept than b a co-occurs with several classes from B (including b ) → a is wider concept than b a and b do not co-occur → cannot infer that a and b are unrelated November 26th, 2013 Semantic Web in Libraries, Hamburg 17
Prior work Pfeffer (2009) Analysis of classification system structure and actual use Locating classes that describe the same concept Finding ways to improve existing mappings to RVK Focus on RVK, using data from library union catalogues Co-occurrence analysis Results High co-occurrence and close in the hierarchy: → classes are hard to assign properly High co-occurrence and far in the hierarchy: → classes describe identical concepts Mappings from RSWK to RVK could be augmented November 26th, 2013 Semantic Web in Libraries, Hamburg 18
Related work Isaac et.al. (2007) Applied instance based matching to bibliographic data Data from the National Library of the Netherlands Mapping from a thesaurus to a classification system Results Generated mappings are quite good More sophisticated measures than Jaccard do not lead to better mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 19
Application to bibliographic data November 26th, 2013 Semantic Web in Libraries, Hamburg 20
Bibliographic data is different Multiple editions Multiple document types November 26th, 2013 Semantic Web in Libraries, Hamburg 21
Bibliographic data Skewed data Multiple editions → More pairs Some co-occurrences could appear stronger than others Solution: Pre-clustering individual titles on the „work“ level Increases chance for instances with more than one classifications Each cluster contributes only once Allows using absolute co-occurrence numbers Cut-off for small numbers Ranking of competing matches November 26th, 2013 Semantic Web in Libraries, Hamburg 22
Prior work Pfeffer (2013) Matching bibliographic records Based on author, title and uniform title (as well as information on title changes) Matches any edition and revision of a work Including translations Merge match sets → Discrete clusters Consolidating indexing information For indexing purposes, the differences between editions and revisions are irrelevant Subject headings and classifications are shared between all members of a cluster November 26th, 2013 Semantic Web in Libraries, Hamburg 23
Evaluation November 26th, 2013 Semantic Web in Libraries, Hamburg 24
Comparison with existing mappings Existing (partial) mappings can be used as a basis for evaluation → „Gold standard“ Comparison of automatic and manual mapping Recall: Are all the mappings found? Precision: Are all found mappings correct? Analysis of additional links Maybe the gold standard can be improved? November 26th, 2013 Semantic Web in Libraries, Hamburg 25
Ongoing projects November 26th, 2013 Semantic Web in Libraries, Hamburg 26
Data Bibliographic data German library union catalogues German National Library catalogue Austrian National Library catalogue British national bibliography Gold standards Partial mappings BK ↔ RVK November 26th, 2013 Semantic Web in Libraries, Hamburg 27
Interesting Mappings RVK → BK Gold standard exists BK well suited for faceted retrieval RVK has largest proportion of classified titles RVK ↔ DDC Enable data sharing between the German National Library and the RVK-using libraries Not limited to classification systems See Pfeffer (2009) and Wang et.al. (2009) November 26th, 2013 Semantic Web in Libraries, Hamburg 28
Implementation: Tasks Import and mapping of MAB2 and MARC data Clustering Generation of keys for the match process Matching and clustering Consolidation of indexing and classification information Statistics Co-occurrence counts Jaccard measure Output Full mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 29
Implementation: State All steps implemented as a prototype Perl scripts File-based data and indexes Current development Still Perl scripts (but better documented) All data is accumulated in a document store MongoDB Further plan: Porting to MetaFacture framework November 26th, 2013 Semantic Web in Libraries, Hamburg 30
RDF representation November 26th, 2013 Semantic Web in Libraries, Hamburg 31
Recommend
More recommend