Migrating The Language Archive to a new repository solution PAUL - PowerPoint PPT Presentation

Migrating The Language Archive to a new repository solution PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS Photo: Gunter Senft

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION The Language Archive • Digital archive of language materials based at the Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands (One of 84 research institutes of the German Max Planck Society) • Archive exists since the late 90’s, initially archiving language materials from our own field researchers and language acquisition researchers • Became the central archive for the DOBES endangered languages documentation programme, funded by the Volkswagen Foundation in 2000 • Archive was named “The Language Archive” (TLA) in 2011, as a collaboration between 3 research funding organisations

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Collections in The Language Archive • Holds more than 350 collections covering more than 250 different languages: • Languages from around the world studied by Max Planck Institute field linguists • First and second language acquisition corpora • Endangered languages documented for the VolkswagenStiftung DOBES programme • Spoken Dutch corpus • Sign language corpora • More than 15.000 hours of audio and video recordings • More than 1 million files • About 110 TB of data

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION UNESCO Memory of the World • October 2015: Selected collections of TLA added to the UNESCO Memory of the World register. • 64 collections, containing materials from 102 different languages • 3000 hours of video, 5000 hours of audio, 43,000 images, 17,000 written documents • Great recognition of the value of these collections for the world, as well as of the work that TLA has done in preserving them

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION TLA Repository • Starting in the late ‘90s, a repository solution was developed in-house, since no existing solution was around that suited our needs • Over the years this grew into a rather complex system using a variety of frameworks and paradigms, developed by many different developers → difficult and costly to maintain, not optimal in terms of user experience, partly using outdated web technology • Meanwhile, various open source repository systems had been developed that became widely used • 2014: decision was made to build up a new repository using an existing open source platform as a basis, to reduce maintenance costs and to enhance the user experience

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION CLARIN Centre • TLA is a centre of the CLARIN European Research Infrastructure for Language Resources and Technology • Being a CLARIN “B Type” centre comes with certain technical requirements for the repository such that it is interoperable with the overall infrastructure • Meertens Institute in Amsterdam is also a CLARIN centre and was a partner in TLA. Had similar needs for a repository, therefore development of the new repository solution was jointly undertaken by Max Planck Institute and Meertens Institute

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION CLARIN “B Type” Centre requirements • Support CLARIN CMDI metadata • Offer metadata via the OAI-PMH protocol • Support for Shibboleth/SAML2 authentication in order to be part of CLARIN Service Provider Federation • Support for persistent identifiers (e.g. using Handle system) • Repository must be able to meet CoreTrustSeal requirements

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Further requirements at TLA • Support for faceted search • Versioning support • Support for data types present in TLA • Checksum support • Support for Persistent Identifiers using the Handle system • File format verification • Elaborate access control • LDAP support for authentication • Preferably programming languages for which we had expertise

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Existing repository system comparison • Basic criteria: • Open Source • “Mature” • Widely used • Actively maintained • A number of solutions were evaluated to see whether they met our further technical requirements: • DSpace • Eprints • Fedora Commons 3.8.1/Islandora • Fedora Commons 3.8.1/Hydra (now Samvera) • Greenstone

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Feature comparison Fedora Commons DSpace EPrints Greenstone Main progr. language Java Java Perl Java Nested collections Yes Somehow No No Accommodate CMDI Yes Yes Yes Yes Support Data Types Yes Yes Yes Yes File format verification Islandora/Hydra Plug-in No No Checksums Yes Yes Yes Yes Versioning Yes Yes Yes No Handle PID Plug-in Yes No No OAI-PMH Yes Yes Yes Yes Access Control Yes Yes No Yes LDAP Yes Yes Yes No Shibboleth Plug-in Yes Yes No Facet Search Islandora/Hydra Yes Plug-in Yes

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Our choice: Fedora/Islandora • Both DSpace and the two Fedora-based systems met most of the technical requirements • Two main reasons for choosing a Fedora-based system over DSpace: • Deeply nested collection hierarchies could not be easily reflected in DSpace content model (at least in 2014) • Turnkey-style solution with integrated front-ends meant that modifications likely had to be done in DSpace core • Reasons for choosing Islandora over Hydra • Programming language expertise present (PHP vs. Ruby) • Highly modular approach of Islandora (Drupal modules) • (even though Hydra/Sufia deposit UI was better suited)

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Islandora performance testing • Tools were developed to transform existing collections with CMDI metadata into FOXML that could be ingested into Fedora • All 1 million objects were ingested into a Fedora/Islandora instance using Fedora batch ingest scripts • Most performance bottlenecks could be solved by making use of the (optional) Solr index, rather than the Mulgara triple store • All data was ingested as “Externally managed” datastreams in Fedora. This made ingest faster and gave us more control over file system locations • Conclusion was that the Fedora/Islandora combination was fast enough for the size of our repository

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front- and back-end • A custom deposit front- and back-end was needed, since we needed a more controlled ingest workflow as well as a user interface that was easy to use for self-deposit by our researchers • “Doorkeeper” ingest workflow engine was developed to perform a customizable set of actions before data is eventually ingested into Fedora, e.g.: • Check SIP completeness • Check whether ingested files are of accepted types and conform to defined criteria (XPath rules on FITS output) • Check CMDI metadata validity and transform to DC + OLAC • Issue Handle PIDs and add them to metadata • Update parent of ingested object (CMDI has top-down hierarchy) • Move files to persistent storage

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front-end • Connects to local network shares where internal users store their research data • Connects to a Nextcloud self-hosted cloud instance running on the same server for external depositors to upload data • Metadata can be uploaded as XML files or entered using web forms • “Validation” step checks whether SIP is complete and conforms to defined archival standards • Existing objects in the repository can be modified or amended (versions will be created)

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Deposit front-end

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Migration: January/February 2018 • Migration of over 1 million objects comprising over 100 TB of data took a bit more than a week • Production setup using Blazegraph triple store rather than Mulgara • All Object datastreams “Externally managed” • Performance on 6-core VM with 48 GB of RAM overall very good • Spring 2018: soft launch of deposit UI with selected researchers • October 2018: deposit UI made available to all depositors • CoreTrustSeal certified in January 2019

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Repository Browse/Search/Explore

MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION Remaining performance issues • Changing access permissions on large collections takes a very long time • Occasional performance issues when requests come it at fast pace, not sure yet what causes this • Slow loading of “compound objects” with many children (> 100), since SPARQL still used there instead of Solr

Migrating The Language Archive to a new repository solution PAUL - PowerPoint PPT Presentation

Migrating The Language Archive to a new repository solution PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS Photo: Gunter Senft MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION The Language Archive Digital archive of

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

Archive Presentation The Description of the Future Pharmaceutical Archive The Archive Context

Migrating to Java 9 Modules @Sander_Mak By Sander Mak Migrating to Java 9 Java 8 java -cp ..

Migrating Legacy.com Migrating a top 50 most visited site in the U.S. onto Drupal - Legacy.com

ESO Science Archive: 1D spectra publishing process ESO archive evolving from raw to science-ready

Migrating GNOME to Git Migrating GNOME to Git (a human & technical perspective) Frdric

Migrating to PostgreSQL Boriss Mejas Consultant - 2ndQuadrant Air Guitar Player https://www.

WHAT ORIGIN OF THE ARCHIVE FEE 1 DISTRICT COURT ARCHIVE FEE Government Code Section 51.305(b)

A Dublin Core Application Profile for the digital Pina Bausch Archive Kerstin Diwisch Bernhard

Limited Use Repository Updates Citizens Coordination Council April 18, 2018 Craig Cameron U.S.

Repository (IDR) Dr. Chris Harle Becky Liao Integrated Data Repository (IDR) Mar. 3, 2020

Status of the Repository at Status of the Repository at Yucca Mountain Presented to: DOE-EM

Grid Data Repository Dariush Shirmohammadi FERC Technical Conference June 28, 2018 Agenda

Sydney eScholarship Repository and DSpace Sten Christensen & Gary Browne Sydney eScholarship

Utah Utah Regional Regional Repository Repository Program Program Utah State Utah State

The Repository is the CRIS, and the CRIS is the Repository Jenny Evans, University of Westminster

IODEF Data Model Status (progress from -04) <draft-ietf-inch-iodef-05> tracked @

Conversational state management in Web Service Technologies Homework for Seminars in Software

Assignment 5b Software and Web Security March 24 th , 2015 Initial state RAX 0x????????????????

Translating Handwritten Bushman Texts Kyle Williams and Hussein Suleman Digital Libraries

Writing reliable end to end tests End to end browser tests They take a long time to run. Around

XQuery Web Data Management and Distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux

V T A L A S Scientific Coordinator Overview Nozha Boujemaa Large Scale Consortium

A Less Distant Future Sanskrit Texts for Scholarly Communities in the Digital Age Andrew Ollett

Sambuz

Useful Links

Newsletter

Mail Us

Migrating The Language Archive to a new repository solution PAUL - PowerPoint PPT Presentation

Migrating The Language Archive to a new repository solution PAUL TRILSBEEK MAX PLANCK INSTITUTE FOR PSYCHOLINGUISTICS Photo: Gunter Senft MIGRATING THE LANGUAGE ARCHIVE TO A NEW REPOSITORY SOLUTION The Language Archive Digital archive of

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

Archive Presentation The Description of the Future Pharmaceutical Archive The Archive Context

Migrating to Java 9 Modules @Sander_Mak By Sander Mak Migrating to Java 9 Java 8 java -cp ..

Migrating Legacy.com Migrating a top 50 most visited site in the U.S. onto Drupal - Legacy.com

ESO Science Archive: 1D spectra publishing process ESO archive evolving from raw to science-ready

Migrating GNOME to Git Migrating GNOME to Git (a human &amp; technical perspective) Frdric

Migrating to PostgreSQL Boriss Mejas Consultant - 2ndQuadrant Air Guitar Player https://www.

WHAT ORIGIN OF THE ARCHIVE FEE 1 DISTRICT COURT ARCHIVE FEE Government Code Section 51.305(b)

A Dublin Core Application Profile for the digital Pina Bausch Archive Kerstin Diwisch Bernhard

Limited Use Repository Updates Citizens Coordination Council April 18, 2018 Craig Cameron U.S.

Repository (IDR) Dr. Chris Harle Becky Liao Integrated Data Repository (IDR) Mar. 3, 2020

Status of the Repository at Status of the Repository at Yucca Mountain Presented to: DOE-EM

Grid Data Repository Dariush Shirmohammadi FERC Technical Conference June 28, 2018 Agenda

Sydney eScholarship Repository and DSpace Sten Christensen &amp; Gary Browne Sydney eScholarship

Utah Utah Regional Regional Repository Repository Program Program Utah State Utah State

The Repository is the CRIS, and the CRIS is the Repository Jenny Evans, University of Westminster

IODEF Data Model Status (progress from -04) &lt;draft-ietf-inch-iodef-05&gt; tracked @

Conversational state management in Web Service Technologies Homework for Seminars in Software

Assignment 5b Software and Web Security March 24 th , 2015 Initial state RAX 0x????????????????

Translating Handwritten Bushman Texts Kyle Williams and Hussein Suleman Digital Libraries

Writing reliable end to end tests End to end browser tests They take a long time to run. Around

XQuery Web Data Management and Distribution Serge Abiteboul Ioana Manolescu Philippe Rigaux

V T A L A S Scientific Coordinator Overview Nozha Boujemaa Large Scale Consortium

A Less Distant Future Sanskrit Texts for Scholarly Communities in the Digital Age Andrew Ollett

Sambuz

Useful Links

Newsletter

Mail Us

Migrating GNOME to Git Migrating GNOME to Git (a human & technical perspective) Frdric

Sydney eScholarship Repository and DSpace Sten Christensen & Gary Browne Sydney eScholarship

IODEF Data Model Status (progress from -04) <draft-ietf-inch-iodef-05> tracked @