corpus assembly as text data integration
play

Corpus Assembly as Text Data Integration from Digital Libraries and - PowerPoint PPT Presentation

Corpus Assembly as Text Data Integration from Digital Libraries and the Web Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School Romanticism as a Model


  1. Corpus Assembly as Text Data Integration from Digital Libraries and the Web Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany Jun 3 2019 – Urbana-Champaign IL JCDL 19‘ – Session 1A – Generation and Linking

  2. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) General Li Literature Gazette, ALZ Jena/Halle Germany Very important historical text source for literary studies in German Romanticism (1790-1830)

  3. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Corpus • Analyse Research Result

  4. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow Printed • Scan Book Scanned • OCR Picture • Encode Full Text • Assemble Corpus • Analyse Research Result

  5. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow 315 Volumes Printed • Scan Book ≈ 150,000 Pages ≈ 150,000,000 Tokens Scanned • OCR Picture • Encod Full Text • Assemble Corpus • Analyse Research Result

  6. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow 315 Volumes Printed • Scan Book ≈ 150,000 Pages ≈ 150,000,000 Tokens Scanned • OCR Picture • Encode Cost- and Time- Full Text • Assemble Consuming Corpus • Analyse Research Result

  7. All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Alternative Workflow Digital Libraries • Encode Full Text • Assemble Corpus • Analyse Research Result

  8. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library

  9. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library Austria: Austrian National Library

  10. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library Austria: Austrian National Library Switzerland: University of Lausanne

  11. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library Switzerland: University of Lausanne

  12. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan

  13. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan

  14. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan

  15. Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library 1,200+ Volumes Princeton University Stanford University University of Illinois 600,000+ Pages University of Michigan 600,000,000+ Tokens

  16. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata

  17. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata https://archive.org/details/bub_gb_udTjAAAAMAAJ/

  18. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select

  19. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select 14 different full-text versions for this page!

  20. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select Best-Quality • Encode Full-Texts • Assemble

  21. Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select Best-Quality • Encode Full-Texts • Assemble Target- Corpus

  22. Result Digital Libraries • Collect and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality • Encode Full-Texts • Assemble Target- Corpus

  23. Result Digital Libraries • Collect and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality ≈ 82% coverage • Encode Full-Texts • Assemble 315 Volumes ≈ 150,000 Pages Target- Corpus ≈ 150,000,000 Tokens

  24. Result The Largest Corpus for German Romanticism Digital Libraries • Collect https://github.com/JULIELab/ALZ and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality ≈ 82% coverage • Encode Full-Texts • Assemble 315 Volumes ≈ 150,000 Pages Target- Corpus ≈ 150,000,000 Tokens

  25. Problems • Restricted Accessibility • Heterogeneous Digitizing Conditions and OCR-Qualities

  26. Conclusion • The Largest Corpus for German Romanticism • Big Potential of DLs for Computational Literary Studies • More Cooperation Between DLs Desirable • Better Metadata and OCR-Quality are Desirable

  27. Corpus Assembly as Text Data Integration from Digital Libraries and the Web Thank you! Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany

Recommend


More recommend