Corpus Assembly as Text Data Integration from Digital Libraries and the Web Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany Jun 3 2019 – Urbana-Champaign IL JCDL 19‘ – Session 1A – Generation and Linking
All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) General Li Literature Gazette, ALZ Jena/Halle Germany Very important historical text source for literary studies in German Romanticism (1790-1830)
All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Corpus • Analyse Research Result
All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow Printed • Scan Book Scanned • OCR Picture • Encode Full Text • Assemble Corpus • Analyse Research Result
All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow 315 Volumes Printed • Scan Book ≈ 150,000 Pages ≈ 150,000,000 Tokens Scanned • OCR Picture • Encod Full Text • Assemble Corpus • Analyse Research Result
All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Traditional Workflow 315 Volumes Printed • Scan Book ≈ 150,000 Pages ≈ 150,000,000 Tokens Scanned • OCR Picture • Encode Cost- and Time- Full Text • Assemble Consuming Corpus • Analyse Research Result
All llgemeine Lit iteratur-Zeitung (1785-1849) 1849) Alternative Workflow Digital Libraries • Encode Full Text • Assemble Corpus • Analyse Research Result
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library Austria: Austrian National Library
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library Austria: Austrian National Library Switzerland: University of Lausanne
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library Switzerland: University of Lausanne
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library Princeton University Stanford University University of Illinois University of Michigan
Scattered Dig igital Resources of ALZ Germany: Bavarian State Library UK: Austria: University of Oxford Austrian National Library USA: Harvard University Switzerland: Indiana University University of Lausanne New York Public Library 1,200+ Volumes Princeton University Stanford University University of Illinois 600,000+ Pages University of Michigan 600,000,000+ Tokens
Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata
Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata https://archive.org/details/bub_gb_udTjAAAAMAAJ/
Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select
Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select 14 different full-text versions for this page!
Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select Best-Quality • Encode Full-Texts • Assemble
Proposed Workflow Digital Libraries • Collect and the Web • Correct Metadata • Evaluate Full-Texts • Select Best-Quality • Encode Full-Texts • Assemble Target- Corpus
Result Digital Libraries • Collect and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality • Encode Full-Texts • Assemble Target- Corpus
Result Digital Libraries • Collect and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality ≈ 82% coverage • Encode Full-Texts • Assemble 315 Volumes ≈ 150,000 Pages Target- Corpus ≈ 150,000,000 Tokens
Result The Largest Corpus for German Romanticism Digital Libraries • Collect https://github.com/JULIELab/ALZ and the Web • Correct Metadata 261 Volumes • Evaluate 126,612 Pages Full-Texts • Select 120,369,005 Tokens Best-Quality ≈ 82% coverage • Encode Full-Texts • Assemble 315 Volumes ≈ 150,000 Pages Target- Corpus ≈ 150,000,000 Tokens
Problems • Restricted Accessibility • Heterogeneous Digitizing Conditions and OCR-Qualities
Conclusion • The Largest Corpus for German Romanticism • Big Potential of DLs for Computational Literary Studies • More Cooperation Between DLs Desirable • Better Metadata and OCR-Quality are Desirable
Corpus Assembly as Text Data Integration from Digital Libraries and the Web Thank you! Udo Hahn & Tinghui Duan Jena University Language & Information Engineering (JULIE) Lab https://julielab.de/ DFG Graduate School „Romanticism as a Model“ http://modellromantik.uni-jena.de Friedrich Schiller University Jena, Germany
Recommend
More recommend