Digital Libraries & Archives Max Kemman University of Luxembourg October 4, 2015 Doing Digital History: Introduction to Tools and Technology
Recap from last time Why would we want to write for the web? Can we write an HTML document?
Today Libraries & Archives → • Turning the "analog signal" into a "digital signal" → • Turning the "digital signal" into machine-readable data → • Making the machine-readable data searchable → • Current state of the art → • A Digital Archive of Letters → • Next time → •
Libraries & Archives What is a library? What is an archive?
Aspects of an archive • Provenance • Respect des fonds • Respect de l'ordre • Context • Historical sensation?
What is a digital library/archive? • Content collected on behalf of users • Institution • Service Is a digital library or archive more than a database? Borgman, C. L. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227– 243.
Reasons for digitising Terras - Digitisation and Digital Resources in the Humanities What are the 8 things Terras describes? 1. Access 2. Search 3. Reinstate out of print materials 4. Display material in inaccessible formats 5. Enhancing of digital images 6. Conserve fragile objects 7. Integration into teaching materials 8. Collection of geographically dispersed material
Reasons for digitising Enhancing of digital images: Google Art Project Collection of geographically dispersed material: Europeana
What is digitisation? Terras describes 3 stages of digitisation, what are they? • Turning the "analog signal" into a "digital signal" • Turning the "digital signal" into machine-readable data • Making the machine-readable data searchable
Turning the "analog signal" into a "digital signal" Terras describes three forms of material: • Text • Sound and moving images • 3D objects
Text 1. Digital photography Grazer Büchertisch Wolfenbütteler Buchspiegel Multispectral photography 2. Scan Flatbed scanner Overhead scanner (all slides concerning digitisation of text kindly provided by eCodicology - Hannah Busch)
Digital photography Grazer Büchertisch
Digital photography Wolfenbütteler Buchspiegel
Digital photography
Digital photography Multispectral Imaging
Scan Flatbed scanner Overhead scanner
Scan Automatic scanning https://www.youtube.com/embed/cmhIJOqepVU
Scanning a book without opening it http://gizmodo.com/mit-invented-a-camera-that-can-read-closed-books-1786522492
Requirements for digital images • Resolution in DPI (dots per inch): minimum of 300 • RGB colour space • TIFF format
Audio and moving images If you thought text was hard...
Photos kindly provided by NISV - made by Marco Hofsté
Photos kindly provided by NISV - made by Marco Hofsté
Photos kindly provided by NISV - made by Marco Hofsté
Photos kindly provided by NISV - made by Marco Hofsté
Audio and moving images After digitising the film, need to synchronize with the audio
3D objects Two characteristics of interest • Setting • Tabletop • Tripod • Handheld • Light • Laser • White
Scent? http://www.atlasobscura.com/articles/meet-the-woman-who-is-preserving-the-smell-of-history
Turning the "digital signal" into machine-readable data Re-keying vs OCR? Re-keying: manual transcription
Turning the "digital signal" into machine-readable data Re-keying vs OCR? Re-keying: manual transcription OCR (Object Character Recognition): computer interprets each letter
Object Character Recognition difficulties OCR is not perfect (image source) Letters change: s / ſ / f (image source)
OCR difficulties OCR quality depends on • Quality of the original document: letters and pages • Quality of the image • Not possible for hand-written material
Handwritten material (Monk project)
Audio and visual material (simplified) Speech to text Keyframes Edge detection
Making the machine-readable data searchable Bush - As We May Think Too much information out there Compression for storage is not enough: need to be able to consult it Not just extraction, but selection
Selecting material Searching libraries and archives? In non-digital archives & libraries, distinction between: • Data - the object • Metadata - the description of the object Metadata is used to find the object Indexing : data sorted alphabetically or numerically
Index Alphabetical list with points to location Full-text search: the contents used to find the object: meta/data? Keyword search: term frequency-inverse document frequency
Association of documents Bush: human mind works by association Memex: tying items together Web: hyperlinks! Keyword search: Google PageRank
Association of documents/objects Linked Data / Semantic Web https://www.youtube.com/embed/TJfrNo3Z-DU Keyword search: Google Knowledge Graph (example)
Audiovisual material Similarity search Content search?
Audiovisual material Search in video?
Audiovisual material Search in video?
Audiovisual material Search in video?
Audiovisual material Search in video?
Current state of the art
Heritage digitized in Europe About 10% digitized In Europeana: 12% of digitized material Estimated cost of digitising 100%: €100 billion
Aspects of an archive • Provenance • Respect des fonds • Original order • Context • Historical sensation? Does a digital archive reflect this? Keyword search: no order, limited context No authentic documents
Search Full-text search works, but limited by imperfections of OCR Audiovisual search is starting to get interesting
Search With these millions of objects, Terras states simple access tools are not enough Can we research the digital library or archive as a whole?
A Digital Archive of Letters During this course we will use a collection of letters How are letters different from other texts (Dobson)? Data & Metadata • Content of the letters • Sender • Receiver • Date • Location
A single letter What is the letter about? Why did the author write this letter?
A set of letters What are the letters about? Are there differences between the letters? Who are the senders and receivers? Do we find a community?
A whole lot of letters What kind of subjects are covered in the collection? Are there differences in time? Who are the senders and receivers? Do we find communities of people writing one another?
Digital letters To do such research with a computer, we need a lot of letters in digital form As we just saw, digitisation is not trivial Can we use digital-born letters?
A Republic of Emails • Hillary Clinton used her own email server for government business • When this was discovered, she was made to disclose her email, and the gov had to provide emails as part of a FOIA request • Wikileaks then hosted the emails on their website: https://wikileaks.org/clinton-emails/ • We have 30,322 emails & attachments, 50,547 pages, from the period 30 June 2010 to 12 August 2013 • A total of 7,570 emails sent by Hillary Clinton (25%) Some more background: https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy
Creating a database of emails Let's try with one email: https://wikileaks.org/clinton-emails/emailid/2 Let's try another one: https://wikileaks.org/clinton-emails/emailid/123 What is an email? Is it the same as a letter? Can we do this for 30,322 emails?
Creating a database of emails We 'scraped' wikileaks automatically to get all the emails Because of the size, we separated the content from the metadata and saved these per 1,000:
Current state of the Folder #items Folder #items Folder #items Folder #items database f-0 999 f-10 1,000 f-20 1,000 f-30 323 f-1 1,000 f-11 1,000 f-21 1,000 Is our database complete? f-2 1,000 f-12 998 f-22 1,000 Does it matter? f-3 1,000 f-13 997 f-23 1,000 f-4 1,000 f-14 998 f-24 1,000 f-5 1,000 f-15 1,000 f-25 998 f-6 1,000 f-16 1,000 f-26 1,000 f-7 1,000 f-17 1,000 f-27 1,000 f-8 1,000 f-18 1,000 f-28 999 f-9 1,000 f-19 998 f-28 999
For next time 11 October Big Data Reading: (see Moodle) • Wallach, H. (2014). Big Data, Machine Learning , and the Social Sciences: Fairness, Accountability, and Transparency. Medium. • Hitchcock, T. (2014). Big Data, Small Data and Meaning. Historyonics.
Recommend
More recommend