digital libraries archives
play

Digital Libraries & Archives Max Kemman University of - PowerPoint PPT Presentation

Digital Libraries & Archives Max Kemman University of Luxembourg October 4, 2015 Doing Digital History: Introduction to Tools and Technology Recap from last time Why would we want to write for the web? Can we write an HTML document?


  1. Digital Libraries & Archives Max Kemman University of Luxembourg October 4, 2015 Doing Digital History: Introduction to Tools and Technology

  2. Recap from last time Why would we want to write for the web? Can we write an HTML document?

  3. Today Libraries & Archives → • Turning the "analog signal" into a "digital signal" → • Turning the "digital signal" into machine-readable data → • Making the machine-readable data searchable → • Current state of the art → • A Digital Archive of Letters → • Next time → •

  4. Libraries & Archives What is a library? What is an archive?

  5. Aspects of an archive • Provenance • Respect des fonds • Respect de l'ordre • Context • Historical sensation?

  6. What is a digital library/archive? • Content collected on behalf of users • Institution • Service Is a digital library or archive more than a database? Borgman, C. L. (1999). What are digital libraries? Competing visions. Information Processing and Management, 35(3), 227– 243.

  7. Reasons for digitising Terras - Digitisation and Digital Resources in the Humanities What are the 8 things Terras describes? 1. Access 2. Search 3. Reinstate out of print materials 4. Display material in inaccessible formats 5. Enhancing of digital images 6. Conserve fragile objects 7. Integration into teaching materials 8. Collection of geographically dispersed material

  8. Reasons for digitising Enhancing of digital images: Google Art Project Collection of geographically dispersed material: Europeana

  9. What is digitisation? Terras describes 3 stages of digitisation, what are they? • Turning the "analog signal" into a "digital signal" • Turning the "digital signal" into machine-readable data • Making the machine-readable data searchable

  10. Turning the "analog signal" into a "digital signal" Terras describes three forms of material: • Text • Sound and moving images • 3D objects

  11. Text 1. Digital photography Grazer Büchertisch Wolfenbütteler Buchspiegel Multispectral photography 2. Scan Flatbed scanner Overhead scanner (all slides concerning digitisation of text kindly provided by eCodicology - Hannah Busch)

  12. Digital photography Grazer Büchertisch

  13. Digital photography Wolfenbütteler Buchspiegel

  14. Digital photography

  15. Digital photography Multispectral Imaging

  16. Scan Flatbed scanner Overhead scanner

  17. Scan Automatic scanning https://www.youtube.com/embed/cmhIJOqepVU

  18. Scanning a book without opening it http://gizmodo.com/mit-invented-a-camera-that-can-read-closed-books-1786522492

  19. Requirements for digital images • Resolution in DPI (dots per inch): minimum of 300 • RGB colour space • TIFF format

  20. Audio and moving images If you thought text was hard...

  21. Photos kindly provided by NISV - made by Marco Hofsté

  22. Photos kindly provided by NISV - made by Marco Hofsté

  23. Photos kindly provided by NISV - made by Marco Hofsté

  24. Photos kindly provided by NISV - made by Marco Hofsté

  25. Audio and moving images After digitising the film, need to synchronize with the audio

  26. 3D objects Two characteristics of interest • Setting • Tabletop • Tripod • Handheld • Light • Laser • White

  27. Scent? http://www.atlasobscura.com/articles/meet-the-woman-who-is-preserving-the-smell-of-history

  28. Turning the "digital signal" into machine-readable data Re-keying vs OCR? Re-keying: manual transcription

  29. Turning the "digital signal" into machine-readable data Re-keying vs OCR? Re-keying: manual transcription OCR (Object Character Recognition): computer interprets each letter

  30. Object Character Recognition difficulties OCR is not perfect (image source) Letters change: s / ſ / f (image source)

  31. OCR difficulties OCR quality depends on • Quality of the original document: letters and pages • Quality of the image • Not possible for hand-written material

  32. Handwritten material (Monk project)

  33. Audio and visual material (simplified) Speech to text Keyframes Edge detection

  34. Making the machine-readable data searchable Bush - As We May Think Too much information out there Compression for storage is not enough: need to be able to consult it Not just extraction, but selection

  35. Selecting material Searching libraries and archives? In non-digital archives & libraries, distinction between: • Data - the object • Metadata - the description of the object Metadata is used to find the object Indexing : data sorted alphabetically or numerically

  36. Index Alphabetical list with points to location Full-text search: the contents used to find the object: meta/data? Keyword search: term frequency-inverse document frequency

  37. Association of documents Bush: human mind works by association Memex: tying items together Web: hyperlinks! Keyword search: Google PageRank

  38. Association of documents/objects Linked Data / Semantic Web https://www.youtube.com/embed/TJfrNo3Z-DU Keyword search: Google Knowledge Graph (example)

  39. Audiovisual material Similarity search Content search?

  40. Audiovisual material Search in video?

  41. Audiovisual material Search in video?

  42. Audiovisual material Search in video?

  43. Audiovisual material Search in video?

  44. Current state of the art

  45. Heritage digitized in Europe About 10% digitized In Europeana: 12% of digitized material Estimated cost of digitising 100%: €100 billion

  46. Aspects of an archive • Provenance • Respect des fonds • Original order • Context • Historical sensation? Does a digital archive reflect this? Keyword search: no order, limited context No authentic documents

  47. Search Full-text search works, but limited by imperfections of OCR Audiovisual search is starting to get interesting

  48. Search With these millions of objects, Terras states simple access tools are not enough Can we research the digital library or archive as a whole?

  49. A Digital Archive of Letters During this course we will use a collection of letters How are letters different from other texts (Dobson)? Data & Metadata • Content of the letters • Sender • Receiver • Date • Location

  50. A single letter What is the letter about? Why did the author write this letter?

  51. A set of letters What are the letters about? Are there differences between the letters? Who are the senders and receivers? Do we find a community?

  52. A whole lot of letters What kind of subjects are covered in the collection? Are there differences in time? Who are the senders and receivers? Do we find communities of people writing one another?

  53. Digital letters To do such research with a computer, we need a lot of letters in digital form As we just saw, digitisation is not trivial Can we use digital-born letters?

  54. A Republic of Emails • Hillary Clinton used her own email server for government business • When this was discovered, she was made to disclose her email, and the gov had to provide emails as part of a FOIA request • Wikileaks then hosted the emails on their website: https://wikileaks.org/clinton-emails/ • We have 30,322 emails & attachments, 50,547 pages, from the period 30 June 2010 to 12 August 2013 • A total of 7,570 emails sent by Hillary Clinton (25%) Some more background: https://en.wikipedia.org/wiki/Hillary_Clinton_email_controversy

  55. Creating a database of emails Let's try with one email: https://wikileaks.org/clinton-emails/emailid/2 Let's try another one: https://wikileaks.org/clinton-emails/emailid/123 What is an email? Is it the same as a letter? Can we do this for 30,322 emails?

  56. Creating a database of emails We 'scraped' wikileaks automatically to get all the emails Because of the size, we separated the content from the metadata and saved these per 1,000:

  57. Current state of the Folder #items Folder #items Folder #items Folder #items database f-0 999 f-10 1,000 f-20 1,000 f-30 323 f-1 1,000 f-11 1,000 f-21 1,000 Is our database complete? f-2 1,000 f-12 998 f-22 1,000 Does it matter? f-3 1,000 f-13 997 f-23 1,000 f-4 1,000 f-14 998 f-24 1,000 f-5 1,000 f-15 1,000 f-25 998 f-6 1,000 f-16 1,000 f-26 1,000 f-7 1,000 f-17 1,000 f-27 1,000 f-8 1,000 f-18 1,000 f-28 999 f-9 1,000 f-19 998 f-28 999

  58. For next time 11 October Big Data Reading: (see Moodle) • Wallach, H. (2014). Big Data, Machine Learning , and the Social Sciences: Fairness, Accountability, and Transparency. Medium. • Hitchcock, T. (2014). Big Data, Small Data and Meaning. Historyonics.

Recommend


More recommend