Migrating The Language Archive to a new repository solution PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS Photo: Gunter Senft
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION The Language Archive • Digital archive of language materials based at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands (One of 84 research institutes of the German Max Planck Society) • Archive exists since the late 90’s, initially archiving language materials from our own field researchers and language acquisition researchers • Became the central archive for the DOBES endangered languages documentation programme, funded by the Volkswagen Foundation in 2000 • Archive was named “The Language Archive” (TLA) in 2011, as a collaboration between 3 research funding organisations
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Collections in The Language Archive • Holds more than 350 collections covering more than 250 different languages: • Languages from around the world studied by Max Planck Institute field linguists • First and second language acquisition corpora • Endangered languages documented for the VolkswagenStiftung DOBES programme • Spoken Dutch corpus • Sign language corpora • More than 15.000 hours of audio and video recordings • More than 1 million files • About 110 TB of data
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION UNESCO Memory of the World • October 2015: Selected collections of TLA added to the UNESCO Memory of the World register. • 64 collections, containing materials from 102 different languages • 3000 hours of video, 5000 hours of audio, 43,000 images, 17,000 written documents • Great recognition of the value of these collections for the world, as well as of the work that TLA has done in preserving them
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION TLA Repository • Starting in the late ‘90s, a repository solution was developed in-house, since no existing solution was around that suited our needs • Over the years this grew into a rather complex system using a variety of frameworks and paradigms, developed by many different developers → difficult and costly to maintain, not optimal in terms of user experience, partly using outdated web technology • Meanwhile, various open source repository systems had been developed that became widely used • 2014: decision was made to build up a new repository using an existing open source platform as a basis, to reduce maintenance costs and to enhance the user experience
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION CLARIN Centre • TLA is a centre of the CLARIN European Research Infrastructure for Language Resources and Technology • Being a CLARIN “B Type” centre comes with certain technical requirements for the repository such that it is interoperable with the overall infrastructure • Meertens Institute in Amsterdam is also a CLARIN centre and was a partner in TLA. Had similar needs for a repository, therefore development of the new repository solution was jointly undertaken by Max Planck Institute and Meertens Institute
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION CLARIN “B Type” Centre requirements • Support CLARIN CMDI metadata • Offer metadata via the OAI-PMH protocol • Support for Shibboleth/SAML2 authentication in order to be part of CLARIN Service Provider Federation • Support for persistent identifiers (e.g. using Handle system) • Repository must be able to meet CoreTrustSeal requirements
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Further requirements at TLA • Support for faceted search • Versioning support • Support for data types present in TLA • Checksum support • Support for Persistent Identifiers using the Handle system • File format verification • Elaborate access control • LDAP support for authentication • Preferably programming languages for which we had expertise
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Existing repository system comparison • Basic criteria: • Open Source • “Mature” • Widely used • Actively maintained • A number of solutions were evaluated to see whether they met our further technical requirements: • DSpace • Eprints • Fedora Commons 3.8.1/Islandora • Fedora Commons 3.8.1/Hydra (now Samvera) • Greenstone
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Feature comparison Fedora Commons DSpace EPrints Greenstone Main progr. language Java Java Perl Java Nested collections Yes Somehow No No Accommodate CMDI Yes Yes Yes Yes Support Data Types Yes Yes Yes Yes File format verification Islandora/Hydra Plug-in No No Checksums Yes Yes Yes Yes Versioning Yes Yes Yes No Handle PID Plug-in Yes No No OAI-PMH Yes Yes Yes Yes Access Control Yes Yes No Yes LDAP Yes Yes Yes No Shibboleth Plug-in Yes Yes No Facet Search Islandora/Hydra Yes Plug-in Yes
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Our choice: Fedora/Islandora • Both DSpace and the two Fedora-based systems met most of the technical requirements • Two main reasons for choosing a Fedora-based system over DSpace: • Deeply nested collection hierarchies could not be easily reflected in DSpace content model (at least in 2014) • Turnkey-style solution with integrated front-ends meant that modifications likely had to be done in DSpace core • Reasons for choosing Islandora over Hydra • Programming language expertise present (PHP vs. Ruby) • Highly modular approach of Islandora (Drupal modules) • (even though Hydra/Sufia deposit UI was better suited)
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Islandora performance testing • Tools were developed to transform existing collections with CMDI metadata into FOXML that could be ingested into Fedora • All 1 million objects were ingested into a Fedora/Islandora instance using Fedora batch ingest scripts • Most performance bottlenecks could be solved by making use of the (optional) Solr index, rather than the Mulgara triple store • All data was ingested as “Externally managed” datastreams in Fedora. This made ingest faster and gave us more control over file system locations • Conclusion was that the Fedora/Islandora combination was fast enough for the size of our repository
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front- and back-end • A custom deposit front- and back-end was needed, since we needed a more controlled ingest workflow as well as a user interface that was easy to use for self-deposit by our researchers • “Doorkeeper” ingest workflow engine was developed to perform a customizable set of actions before data is eventually ingested into Fedora, e.g.: • Check SIP completeness • Check whether ingested files are of accepted types and conform to defined criteria (XPath rules on FITS output) • Check CMDI metadata validity and transform to DC + OLAC • Issue Handle PIDs and add them to metadata • Update parent of ingested object (CMDI has top-down hierarchy) • Move files to persistent storage
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front-end • Connects to local network shares where internal users store their research data • Connects to a Nextcloud self-hosted cloud instance running on the same server for external depositors to upload data • Metadata can be uploaded as XML files or entered using web forms • “Validation” step checks whether SIP is complete and conforms to defined archival standards • Existing objects in the repository can be modified or amended (versions will be created)
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front-end
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front-end
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Migration: January/February 2018 • Migration of over 1 million objects comprising over 100 TB of data took a bit more than a week • Production setup using Blazegraph triple store rather than Mulgara • All Object datastreams “Externally managed” • Performance on 6-core VM with 48 GB of RAM overall very good • Spring 2018: soft launch of deposit UI with selected researchers • October 2018: deposit UI made available to all depositors • CoreTrustSeal certified in January 2019
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Repository Browse/Search/Explore
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Repository Browse/Search/Explore
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Repository Browse/Search/Explore
MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Remaining performance issues • Changing access permissions on large collections takes a very long time • Occasional performance issues when requests come it at fast pace, not sure yet what causes this • Slow loading of “compound objects” with many children (> 100), since SPARQL still used there instead of Solr
Recommend
More recommend