CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) - PowerPoint PPT Presentation

LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION

DISCLAIMER This presentation and any subsequent discussion represents work and perspectives on work completed at the Texas State Library and Archives Commission by the presenter. Opinions and perspectives provided by this presenter are their own and may not indicate the official stance of the agency.

CTS: THE CORRESPONDENCE TRACKING SYSTEM Some details 1. Completely homegrown system 2. Interface written in Visual Basic 6 3. Running against a MS SQL Server database 4. The database itself is a record 5. Covers physical mail, webmail, phone calls 6. Each mail/webmail item was supposed to have a corresponding image file or PDF

WHAT IF… The content in the database could be extracted in a way that captured the elements of the Governor’s staff interface? And then paired with the individual images themselves in the preservation/access system for staff research? And possibly indexed for some linked data fun?

FROM: HTTPS://WWW.YOUTUBE.COM/WATCH?V=AOF5LCT5JD0

IF YOU HAVE A HAMMER, EVERYTHING LOOKS LIKE A NAIL About me and the tools at my disposal 1. I had been working on database preservation 2. I love virtualization 3. I had also been using Python extensively for API and data manipulation in other project 4. Therefore almost all work was done with Python in a virtual machine for this project 5. I like the new Doctor Courtesy https://imgur.com/gallery/NIgUNZZ

OVERVIEW OF THE WORK Preserve Study Export and database database manipulate structure data Export data Fix Final data to valid miscellaneous manipulation sidecar files problems

THE ACTUAL STEPS ● ● Get SQL Server 2018 running Use Python script to export metadata into ● Preserve the database into SIARD format individual files ● ● Review tables in SQL Server Management Use Python script to create valid XML ● Studio and Database Visualization Toolkit to Use Python script to validate the XML understand data structure ● Fix broken XML, re-validate until all good ● Review fields in CTS GUI to see what staff ● Transform metadata export to desired schema would have worked with (x2, see later explanation) ● ● Determine how tables should be connected Use Python script to remove artifacts from ● Export tables to CSV format transforms ● ● Use Python PANDAS to merge tables Use Python to correct filenaming/pairing errors ● Replace illegal characters in spreadsheets ● Re-upload files with sidecar metadata

Preserve the STEP ONE database

STEP 1: PRESERVE THE DATABASE Running SQL Server Run Database Preservation Toolkit ● SIARD format, XML-based ● First step, see the database in its actual ○ captures all database content and most unmediated format functions ● Take SQL dump and import it into SQL Server ● Invented by Swiss Federal Archives ● Use SQL Server Management Studio or similar ○ SIARD Suite app converted databases to SIARD software to review structure and contents ● Database Preservation Toolkit is a product of EARK ● Maybe can export directly to a spreadsheet? and seeks to automate conversion, more detailed Run XML export? SIARD2 standard ● http://www.database-preservation.com/ SQL Server management studio available here: ● Later Swiss Federal Archives released a tool for https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms?view=sql-server-2017 SIARD2.1 standard ○ https://www.bar.admin.ch/bar/en/home/archiving /tools/siard-suite.html

IN SQL SERVER MANAGEMENT STUDIO

IN DATABASE VISUALIZATION TOOLKIT

WHAT IT SHOULD HAVE LOOKED LIKE

Study database STEP TWO structure

STEP 2: STUDY THE DATABASE STRUCTURE 1. Review staff GUI for essential elements 2. Find elements in database tables 3. Develop a plan on how to reconstruct the information elements from all tables 4. Beware programmatic joins not represented in linked tables

Export and manipulate STEP THREE data

STEP 3: EXPORT AND MANIPULATE DATA 1. Export each table to CSV using an DBVTK export function 2. Load individual CSVs using python PANDAS 3. Merge CSV files on shared column data Use an outer, inner, left/right a. join? 4. Iteratively save, slice and dice the output

Export data to valid STEP FOUR sidecar files

STEP 4: EXPORT DATA TO VALID SIDECAR FILES ● Eliminate the illegal characters from the CSV(s) first I didn’t the first time and spent over a ○ day correcting the results ● Load each CSV and run a script to export that data into a metadata file per ??? Make sure it appends data, not ○ overwrites. You may have multiple entries for the same thing ● Run a script to encapsulate the data to create valid XML ● Run another script to validate your XML This Photo by Unknown Author is licensed under CC BY-SA

Final data STEP FIVE manipulation

STEP 5: FINAL DATA MANIPULATION ● Check existing XML schemas for fit ○ 95 data points ○ TEI too simple ○ Qualified Dublin Core not a good fit ● Write your own? ○ Yes! ● Run XSLTs against XML files to match chosen/written schema ● Run more XSLTs to de-dupe content ● Re-arrange XML into correct directory structure ● Pair with files in-system or re-upload files

Fix miscellaneous STEP SIX problems

PROBLEM ONE: MISSING IMAGES AND DB ENTRIES ● Everything should have been there ● Paper correspondence only sampled ● Some images had no metadata. Outgoing/incoming correspondence not logged? Log name is correct? ● Some metadata had no images. Missing files? Never scanned? ● 353,674 Mail entries without any logged scan. Never scanned? Forgot to add filename? ● Yes to all

PROBLEM ONE: SOLUTION(S) ● Develop a script to identify what might be missing ● Including specific filepaths for processing ● Create a cute no-scan placeholder file for missing scans so metadata is preserved ● Leave items without metadata as is. Still text searchable

PROBLEM TWO: CAPITALIZATION ERRORS ● False negatives for matching XML because… ● Staff did not capitalize database entries the same way they capitalized the images ● Problem because metadata pairing process is sensitive to exact filename Solution ● Use comparative script to generate a list of image/metadata files without matches (with filepath) ● Use a script to de-capitalize listed filenames and compare. ● If there is a match, use the image version of the filename to rename the metadata file

PROBLEM THREE: SAME IMAGE IN MULTIPLE PLACES ● False negatives for matching XML because… ● The file is in another folder altogether ● And it is in multiple places Solution ● Use comparative script to generate a list of image/metadata files without matches (with filepath) ● Use a script to de-capitalize listed filenames, drop the filepath and compare. ● If there is a match, copy the file to a new location with the correct filepath

PROBLEM FOUR: MISFILED/MISNAMED FILES ● Files put in the wrong directory ● E.G. 200106110167.tif filed in directory 2001/01/0111 ● Files misnamed ● E.G. 200106110167.tif misnamed as 200101110167.tif Solution ● If no matches in metadata, generate a generic metadata file suggesting look for correct metadata based on content of file ● SIP creator tool catches duplicate names, correct at point that it find errors.

PROBLEM FIVE: LOGGED PHONE CALLS ● 771,825 logged phone calls ● No document for these ● Need an object to pair metadata to OR ● Upload metadata only and rely on text search? ● Create an html version of metadata? Solution ● Find a cool icon ● Use a script to generate a list of metadata files but with the file extension changed to match the icon file extension ● Use a script to mass copy the icon into an image that can be uploaded

LESSONS LEARNED/COULD HAVE DONE BETTER Expanded conversation to account for ● more internal stakeholder/staff requests Don’t trust that anybody (that they did ● 100% of what they said they did) Direct database SQL queries? ● Before the fact contingency planning ● http://4.bp.blogspot.com/- pOMrxILoPV8/TgOfWqGU8SI/AAAAAAAAAlU/XXDsDr4BaS8/s1600/mist ake3.jpg

NOW LET’S DISCUSS... How could this have been done better? 1. What situations are other people facing? 2. What limitations do you have to work 3. around? Any other thoughts? 4. Courtesy NBC.com (https://www.nbc.com/saturday-night-live/video/coffee-talk/n10457)

BRIAN THOMAS NON-GOVERNMENTAL EMAIL: BRIAN.THE.ARCHIVIST@GMAIL. CONTACT COM INFORMATION GOVERNMENTAL EMAIL: BTHOMAS@TSL.TEXAS.GOV WORK PHONE: 512-475-3374

SOME USEFUL SCRIPTS/TRICKS

MERGING SPREADSHEETS USING PYTHON/PANDAS

EXPORTING TO XML FROM CSV USING PYTHON

XML ENCAPSULATION AND VALIDATION USING PYTHON

CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) - PowerPoint PPT Presentation

LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION DISCLAIMER This presentation and any subsequent discussion represents work and

Correspondence Management and Workflow Optimisation Workshop Your Facilitator is Nick Sharples

Business Correspondence Tone! Dr Bean ( ) at Business Correspondence Tone! Tone

package package ca function function ca mjca (simple) correspondence multiple

Types of Correspondence Problems and Data Sets 1 1 Correspondence Registration 2

Harish-Chandra characters and the local Langlands correspondence Tasho Kaletha University of

Modular Springer Correspondence for classical groups Karine Sorlin Universit e de Picardie

The nonabelian Hodge correspondence Sanath Devalapurkar March 24, 2020 Sanath Devalapurkar The

D5.1 Post Correspondence Problem (Semi-)Decidability Undecidable Halting Problem Problems

The CurryHoward Correspondence between Temporal Logic and Functional Reactive Programming

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

The correspondence problem Deformation-Drive Shape Correspondence Hao (Richard) Zhang 1 , Alla

Partial Functional Correspondence Emanuele Rodol` a USI Lugano Joint work with A. T orsello

Correspondence across views Correspondence: matching points, patches, edges, or regions across

Physics and geometry of knots-quivers correspondence Piotr Kucharski Uppsala University, Sweden

Investigation of Gauge/Gravity Correspondence Investigation of Gauge/Gravity Correspondence

Lambert-Kant correspondence Lisa Benossi Libori Summer School Presentation 1 Brief overview of

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Adding a Programming Language Adding a Language Francois Ouellet , Director of Development

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Projects 3-4 person groups preferred CNN lecture Mockdag Deliverables: Poster & Report &

Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [

APPLIED & COMPUTATIONAL MATHEMATICS (ACME) A NEW DEGREE FOR 21 ST CENTURY DISCOVERY AND

OSM in Loation Siene Jaak Laineste @jaakl CARTO 8 What is CAbTO CAbTO

A practical approach of different programming techniques to implement a real-time application

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) - PowerPoint PPT Presentation

LEVERAGING CORRESPONDENCE MANAGEMENT SYSTEMS (FOR DIGITAL OBJECT METADATA) BRIAN THOMAS ELECTRONIC RECORDS SPECIALIST TEXAS STATE LIBRARY AND ARCHIVES COMMISSION DISCLAIMER This presentation and any subsequent discussion represents work and

Correspondence Management and Workflow Optimisation Workshop Your Facilitator is Nick Sharples

Business Correspondence Tone! Dr Bean ( ) at Business Correspondence Tone! Tone

package package ca function function ca mjca (simple) correspondence multiple

Types of Correspondence Problems and Data Sets 1 1 Correspondence Registration 2

Harish-Chandra characters and the local Langlands correspondence Tasho Kaletha University of

Modular Springer Correspondence for classical groups Karine Sorlin Universit e de Picardie

The nonabelian Hodge correspondence Sanath Devalapurkar March 24, 2020 Sanath Devalapurkar The

D5.1 Post Correspondence Problem (Semi-)Decidability Undecidable Halting Problem Problems

The CurryHoward Correspondence between Temporal Logic and Functional Reactive Programming

Correspondence Analysis and Moderate Outliers Anna Langovaya, Sonja Kuhnt TU Dortmund Ferbruar

The correspondence problem Deformation-Drive Shape Correspondence Hao (Richard) Zhang 1 , Alla

Partial Functional Correspondence Emanuele Rodol` a USI Lugano Joint work with A. T orsello

Correspondence across views Correspondence: matching points, patches, edges, or regions across

Physics and geometry of knots-quivers correspondence Piotr Kucharski Uppsala University, Sweden

Investigation of Gauge/Gravity Correspondence Investigation of Gauge/Gravity Correspondence

Lambert-Kant correspondence Lisa Benossi Libori Summer School Presentation 1 Brief overview of

Advanced Analytics in Business [D0S07a] Big Data Platforms &amp; Technologies [D0S06a] Data

Adding a Programming Language Adding a Language Francois Ouellet , Director of Development

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Projects 3-4 person groups preferred CNN lecture Mockdag Deliverables: Poster &amp; Report &amp;

Experience with RDataFrame Spotlight on interactive/exploratory use Oliver Lantwin [

APPLIED &amp; COMPUTATIONAL MATHEMATICS (ACME) A NEW DEGREE FOR 21 ST CENTURY DISCOVERY AND

OSM in Loation Siene Jaak Laineste @jaakl CARTO 8 What is CAbTO CAbTO

A practical approach of different programming techniques to implement a real-time application

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Data

Projects 3-4 person groups preferred CNN lecture Mockdag Deliverables: Poster & Report &

APPLIED & COMPUTATIONAL MATHEMATICS (ACME) A NEW DEGREE FOR 21 ST CENTURY DISCOVERY AND