automatic creation of mappings between classification
play

Automatic creation of mappings between classification systems for - PowerPoint PPT Presentation

Automatic creation of mappings between classification systems for bibliographic data Prof. Magnus Pfeffer Stuttgart Media University pfeffer@hdm-stuttgart.de Agenda Motivation Instance-based matching Application to bibliographic


  1. Automatic creation of mappings between classification systems for bibliographic data Prof. Magnus Pfeffer Stuttgart Media University pfeffer@hdm-stuttgart.de

  2. Agenda  Motivation  Instance-based matching  Application to bibliographic data  Evaluation  Ongoing projects  RDF Representation November 26th, 2013 Semantic Web in Libraries, Hamburg 2

  3. Motivation November 26th, 2013 Semantic Web in Libraries, Hamburg 3

  4. Current situation in Germany  Five regional library unions  Subject headings  Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file  Classification systems  RVK (Regensburg Union Classification)  BK (Basic Classification)  DDC (Dewey Decimal Classification)  Various local classification systems  Low proportion of indexed titles (25-30%) November 26th, 2013 Semantic Web in Libraries, Hamburg 4

  5. Current situation in Germany  National library  Subject headings  Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file  Classification systems  DDC (Dewey Decimal Classification)  Coarse categories  DDC only for titles published since 2007  Only „Reihe A“ (print trade publications) is fully indexed with RSWK November 26th, 2013 Semantic Web in Libraries, Hamburg 5

  6. Austrian National Library  Subject headings  Predominantly RSWK („Regeln für den Schlagwortkatalog“ - „Rules for the subject catalogue“) using a shared authority file  Classification systems  BK since 2007  RVK in the Austrian librariy union catalogue November 26th, 2013 Semantic Web in Libraries, Hamburg 6

  7. Goals  Re-use existing indexing information  National level  BK is used mainly in northern Germany / Austria  RVK mainly in southern Germany  DDC mainly by the National Library  International level  Make RVK data more accessible to DDC users  Use DDC indexing information available from e.g. the Library of Congress November 26th, 2013 Semantic Web in Libraries, Hamburg 7

  8. Ideas  Use of appropriate classification systems  Facetted search in resource discovery systems  Should be monohierarchical  Should have limited number of classes → DDC (first digits) or BK  Browsing of similar titles  Should be fine-grained → DDC (full) or RVK  (Multi-lingual retrieval) November 26th, 2013 Semantic Web in Libraries, Hamburg 8

  9. Ideas  Enable the use of existing tools and visualisations Denton (2012) Legrady (2005) November 26th, 2013 Semantic Web in Libraries, Hamburg 9

  10. Instance-based Matching November 26th, 2013 Semantic Web in Libraries, Hamburg 10

  11. Ontology matching  Well-studied problem in computer science  Several approaches  Based on the descriptors  Based on the structure  Based on the manifestations (instances) November 26th, 2013 Semantic Web in Libraries, Hamburg 11

  12. Instances  Entries in catalogues with multiple classifications November 26th, 2013 Semantic Web in Libraries, Hamburg 12

  13. Instance-based matching  Assumptions  Classes with semantic overlap co-occur in instances  The more often these classes co-occur, the stronger the overlap  Preparation  Extraction of all pairs of classifications from the data  Count of the extracted pairs November 26th, 2013 Semantic Web in Libraries, Hamburg 13

  14. Example November 26th, 2013 Semantic Web in Libraries, Hamburg 14

  15. Example  Entry 1  Pairs  DDC: 179.9  179.9 / CC 7200  RVK: CC 7200  179.9 / CC 7250  RVK: CC 7250  179.9 / CC 7200  Entry 2  DDC: 179.9  RVK: CC 7200 November 26th, 2013 Semantic Web in Libraries, Hamburg 15

  16. Normalisation  Comparing solely absolute numbers is bad  Some classes are more often used than others  Number of pairs correlates with the number of entries that are classified using a given class  Instead: Use proportion of co-occurrence ↔ occurrence ∣ E c1 ∩ E c2 ∣ ∣ E c1 ∪ E c2 ∣ number of entries with both classifications divided by number of entries with either classification (Jaccard measure for overlap of sets) November 26th, 2013 Semantic Web in Libraries, Hamburg 16

  17. Further interpretation  a and b are two classes from two classification systems A and B  The classes a and b only occur together → exact match  a only co-occurs with b , but b co-occurs with other classes from A → a is narrower concept than b  a co-occurs with several classes from B (including b ) → a is wider concept than b  a and b do not co-occur → cannot infer that a and b are unrelated November 26th, 2013 Semantic Web in Libraries, Hamburg 17

  18. Prior work  Pfeffer (2009)  Analysis of classification system structure and actual use  Locating classes that describe the same concept  Finding ways to improve existing mappings to RVK  Focus on RVK, using data from library union catalogues  Co-occurrence analysis  Results  High co-occurrence and close in the hierarchy: → classes are hard to assign properly  High co-occurrence and far in the hierarchy: → classes describe identical concepts  Mappings from RSWK to RVK could be augmented November 26th, 2013 Semantic Web in Libraries, Hamburg 18

  19. Related work  Isaac et.al. (2007)  Applied instance based matching to bibliographic data  Data from the National Library of the Netherlands  Mapping from a thesaurus to a classification system  Results  Generated mappings are quite good  More sophisticated measures than Jaccard do not lead to better mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 19

  20. Application to bibliographic data November 26th, 2013 Semantic Web in Libraries, Hamburg 20

  21. Bibliographic data is different  Multiple editions  Multiple document types November 26th, 2013 Semantic Web in Libraries, Hamburg 21

  22. Bibliographic data  Skewed data  Multiple editions → More pairs  Some co-occurrences could appear stronger than others  Solution: Pre-clustering individual titles on the „work“ level  Increases chance for instances with more than one classifications  Each cluster contributes only once  Allows using absolute co-occurrence numbers  Cut-off for small numbers  Ranking of competing matches November 26th, 2013 Semantic Web in Libraries, Hamburg 22

  23. Prior work  Pfeffer (2013)  Matching bibliographic records  Based on author, title and uniform title  (as well as information on title changes)  Matches any edition and revision of a work  Including translations  Merge match sets → Discrete clusters  Consolidating indexing information  For indexing purposes, the differences between editions and revisions are irrelevant  Subject headings and classifications are shared between all members of a cluster November 26th, 2013 Semantic Web in Libraries, Hamburg 23

  24. Evaluation November 26th, 2013 Semantic Web in Libraries, Hamburg 24

  25. Comparison with existing mappings  Existing (partial) mappings can be used as a basis for evaluation → „Gold standard“  Comparison of automatic and manual mapping  Recall: Are all the mappings found?  Precision: Are all found mappings correct?  Analysis of additional links  Maybe the gold standard can be improved? November 26th, 2013 Semantic Web in Libraries, Hamburg 25

  26. Ongoing projects November 26th, 2013 Semantic Web in Libraries, Hamburg 26

  27. Data  Bibliographic data  German library union catalogues  German National Library catalogue  Austrian National Library catalogue  British national bibliography  Gold standards  Partial mappings BK ↔ RVK November 26th, 2013 Semantic Web in Libraries, Hamburg 27

  28. Interesting Mappings  RVK → BK  Gold standard exists  BK well suited for faceted retrieval  RVK has largest proportion of classified titles  RVK ↔ DDC  Enable data sharing between the German National Library and the RVK-using libraries  Not limited to classification systems  See Pfeffer (2009) and Wang et.al. (2009) November 26th, 2013 Semantic Web in Libraries, Hamburg 28

  29. Implementation: Tasks  Import and mapping of MAB2 and MARC data  Clustering  Generation of keys for the match process  Matching and clustering  Consolidation of indexing and classification information  Statistics  Co-occurrence counts  Jaccard measure  Output  Full mappings November 26th, 2013 Semantic Web in Libraries, Hamburg 29

  30. Implementation: State  All steps implemented as a prototype  Perl scripts  File-based data and indexes  Current development  Still Perl scripts (but better documented)  All data is accumulated in a document store  MongoDB  Further plan: Porting to MetaFacture framework November 26th, 2013 Semantic Web in Libraries, Hamburg 30

  31. RDF representation November 26th, 2013 Semantic Web in Libraries, Hamburg 31

Recommend


More recommend