a tool for identifying potential
play

A Tool for Identifying Potential Access Points in Unstructured Text - PowerPoint PPT Presentation

Semantic Analysis Method (SAM): A Tool for Identifying Potential Access Points in Unstructured Text NKOS 2014 (London, UK) September 11-12, 2014 Karen F. Gracy, Marcia Lei Zeng, and Sammy Davidson School of Library and Information Science


  1. Semantic Analysis Method (SAM): A Tool for Identifying Potential Access Points in Unstructured Text NKOS 2014 (London, UK) September 11-12, 2014 Karen F. Gracy, Marcia Lei Zeng, and Sammy Davidson School of Library and Information Science Kent State University

  2. NKOS 2014 The Problem  Many legacy descriptions in library, archival, and museum (LAM) information systems contain numerous unstructured text blocks.  Many untapped potential access points can be found in this unstructured data.  To implement linked data applications in LAM environments, potential access points must be semantically defined and mapped to other vocabularies, such as name authority files and external data sources.  LAM professionals need a tool to help them solve the challenge of converting unstructured textual descriptions of cultural heritage material into linked data. 2

  3. NKOS 2014 Features of Archival Description Can occur at multiple levels:  The same collection can be described in whole or in part (e.g., a description of  subgroupings and individual items). Descriptions appearing in bibliographic catalogs are often abbreviated  collection-level descriptions (top of the hierarchy), and may have some controlled vocabulary terms attached by catalogers. Multi-level finding aids are often generated by processing archivists and may or  may not contain controlled vocabulary terms. Finding aids can be separated into two major sections,  Prefatory notes describing the creator of the materials and the scope and contents  of the collection Detailed descriptions at multiple levels, which may or may not contain location  information of the material (e.g., Box 3, folder 17) Both sections can be characterized by large blocks of unstructured text.  Full understanding of a particular entity’s importance to the collection as a whole  is often reliant on the position of that entity within the larger hierarchy of documents. 3

  4. Sample Finding Aid : Pearl Harbor Attack (Dec 6-Dec 8, 1941) Source: http://www.fdrlibrary.marist.edu/archives/pdfs/findingaids/findingaid_pearlharborattack.pdf

  5. (cont.) Sample Finding Aid : Pearl Harbor Attack (Dec 6-Dec 8, 1941) Source: http://www.fdrlibrary.marist.edu/archives/pdfs/findingaids/findingaid_pearlharborattack.pdf

  6. NKOS 2014 The Proposed Solution The Semantic Analysis Method (SAM) tool provides a bridge from unstructured descriptions and narratives to semantically-enhanced descriptions containing identified and tagged access points. The SAM tool accomplishes the following:  Identifies name entities and topics via a semantic analysis engine (OpenCalais);  Produces an initial output in the form of a JSON data file, which is then converted to the comma-separated-value (CSV) format.  Resulting CSV file can then be imported into a data cleanup application such as OpenRefine for further editing and removal of misidentified entities. 6

  7. NKOS 2014 Overview of SAM Tool Functionality The Semantic Analysis Method (SAM) Tool automates identification and extraction of potential access points and parses the resulting data into a database for further cleanup and editing.

  8. NKOS 2014 SAM Tool Development The SAM Tool integrates:  Open Calais semantic analysis API service;  j-calais, a third-party library that provides a Java interface to the OpenCalais API; and,  Additional scripts in Java to streamline the tasks of: 1. Obtaining text files from a finding aid data repository; 2. Calling the OpenCalais web service API; 3. Performing the tasks of access point extraction and social tagging through the Open Calais service; 4. Converting the resulting data to the CSV database format. 8

  9. NKOS 2014 SAM Tool Step 1: Obtaining Text 9

  10. NKOS 2014 OpenCalais Viewer Open source, free version of semantic analysis engine. • Creates semantic metadata (lists of entities and social tags), generated in RDF, that can • be used for news aggregators and blogs, as well as other linked data applications. Users can copy and paste text from PDFs, websites, databases, etc. directly into the • window. The SAM Tool automates this process of inserting text into the window. •

  11. NKOS 2014 Inputting Text into OpenCalais Semantic Analysis Engine Using the SAM Tool Options for inputting text for • analysis in SAM Tool include: Manual copy and paste from existing • document Single file upload • Batch file upload • 11

  12. NKOS 2014 OpenCalais with Input Unstructured Text

  13. NKOS 2014 SAM Tool Step 2: Extracting Entities and Tags

  14. NKOS 2014 Example of Results from OpenCalais Semantic Analysis

  15. NKOS 2014 Entities Generated by OpenCalais A Few of the More Useful OpenCalais Entity Types Person • • Company, Facility, Organization, Product (see also Topics) City, Continent, Country, NaturalFeature, • ProvinceOrState, Region • MusicAlbum, Movie, PublishedMedium, RadioProgram, TVShow IndustryTerm, Position, Product (see also • corporate body names), Technology

  16. NKOS 2014 OpenCalais Entity Types Mapped to Types of Common LAM Access Points OpenCalais Entity Types Entity Groupings Example Matches to LAM Vocabularies Person Personal names MARC: 100/700 EAD: <persname> Company, Facility, Organization, Corporate body MARC: 110/710 Product (see also Topics) names EAD: <corpname> City, Continent, Country, Geographic names MARC: 651 NaturalFeature, ProvinceOrState, EAD: <geogname> Region MusicAlbum, Movie, Publications (Titles) MARC: 240; PublishedMedium, RadioProgram, EAD: <title> TVShow IndustryTerm, Position, Product (see Topics MARC: 650 also corporate body names), EAD: <subject> Technology 16

  17. NKOS 2014 Relevance Rankings ―The relevance scoring takes into account the disambiguation of companies and geographies so that each unique entity will get a single relevance score, even if it is referenced in various ways throughout the text .‖— OpenCalais website

  18. NKOS 2014 Social Tags Generated by OpenCalais  ― SocialTags … attempts to emulate how a person would tag a specific piece of content … isn’t true semantic extraction.‖  ―A topic extracted by Categorization with a score higher than 0.6 will also be extracted as a SocialTag. If its score is higher than 0.8, its importance (as a SocialTag) will be set to 1. If the score is between 0.6 and 0.8 its importance is set to 2.‖ – OpenCalais website

  19. NKOS 2014 SAM Tool Step 3: Converting and Clean-Up

  20. NKOS 2014 The Resulting Database  JSON  CSV  CSV table has four fields:  Entity-type  Entity-name  Relevance-ratio  File-source 20

  21. NKOS 2014 Example of Extracted Entities from Finding Aids

  22. NKOS 2014 Example of Cleanup Activity in Resultant Database

  23. NKOS 2014 Testing the SAM Tool  Test collection consisted of 45 archival finding aids drawn from 16 repositories.  Collections were selected to provide a variety of types of archival materials, including:  Personal papers  Corporate records  Government records  ―Artificial collections,‖ i.e., materials from multiple provenances gathered to document a particular person, family, corporate body, topic, or event.  OpenCalais raw analysis of the finding aids for these collections resulted in:  8,096 individual entities  336 suggested social tags 23

  24. NKOS 2014 Testing the SAM Tool (cont.)  Number of potential access points into collection descriptions identified by semantic analysis was a significant increase over number of controlled vocabulary terms assigned to the same collections by catalogers in collection-level MARC records .  In test collection, the median number of assigned corporate body names in MARC collection-level records was 0-2 names (depending on type of collection)  For same collections, analysis of full text of finding aids (describing full extent of collection at all levels), the median number of uncontrolled corporate body entities could range from 0-71, depending on type of collection, and the place in the finding aid (detailed descriptions of series, subseries, files, and items provided the most potential entities). 24

  25. NKOS 2014 Testing the SAM Tool (cont.)  Data clean up will reduce the number of unique entities through the processes of:  Deduplication;  Collapse of synonyms into single data points;  Removal of incorrect extractions. 25

  26. NKOS 2014 Errors Generated by the Semantic Analysis Process  Entity Duplication  Entity Variants  Entity Miscategorization  Inclusion of Unrelated Text as Part of Entity Name 26

  27. NKOS 2014 Entity duplication  Common in archival finding aids, where the same entity can be mentioned in multiple places (history and scope notes, the container listings, series descriptions, etc.)  Example:  New York, N.Y. (extracted and listed five times from the same finding aid) 27

  28. NKOS 2014 Entity variants Finding aids can contain multiple variants of names, particularly personal and corporate body  names. The biography or administrative history are the most likely places for entity variants to appear, as  names can change over a person’s life or the life of a corporate body. It can be particularly difficult to resolve names in archival descriptions, as these names are less  likely to appear in national/international authority lists. Example below, from the Alexander Pope Papers finding aid (three variants found):  28

Recommend


More recommend