scholarly text curation robust anchoring requirements
play

Scholarly Text Curation & Robust Anchoring Requirements Timothy - PowerPoint PPT Presentation

W3C Workshop on Annotations San Francisco, California USA 2 April 2014 Scholarly Text Curation & Robust Anchoring Requirements Timothy W. Cole (t-cole3@illinois.edu) Thomas G. Habing (thabing@illinois.edu) Anchoring Methods to Support


  1. W3C Workshop on Annotations San Francisco, California USA 2 April 2014 Scholarly Text Curation & Robust Anchoring Requirements Timothy W. Cole (t-cole3@illinois.edu) Thomas G. Habing (thabing@illinois.edu)

  2. Anchoring Methods to Support Curatorial Annotation of Scholarly Text Resources • Should be fine-grained – for text this means individual words and phrases • Should ensure persistence, e.g., even as adjacent content is updated / corrected • Can be aligned across derivative formats and serializations, even across repository boundaries • Can support search & replace, e.g., the target is set of all instances found in a specific context • Should help distinguish curatorial annotations of a specific digitization & its derivatives from annotations of intellectual substance W3C Annotation Workshop 4/2/2014 2 t-cole3@illinois.edu

  3. Use cases • Correction of OCR & manual transcriptions – HathiTrust DL: 11 million digitized volumes automated OCR – Text Creation Partnership (TCP): 50,000 manually transcribed – Support corrections by curators, outside experts, the crowd • Correction of automated annotation, e.g., part-of- speech tagging of TEI • Distinct from proposed targeting schemes for other kinds of scholarly use cases, e.g., commentary W3C Annotation Workshop 4/2/2014 3 t-cole3@illinois.edu

  4. Why Annotations? • So proposed corrections can be reviewed and themselves annotated as needed • To share with other repositories • To maintain portability of provenance W3C Annotation Workshop 4/2/2014 4 t-cole3@illinois.edu

  5. W3C Annotation Workshop 4/2/2014 5 t-cole3@illinois.edu

  6. Veridian Digital Library Software

  7. W3C Annotation Workshop 4/2/2014 7 t-cole3@illinois.edu

  8. Align across representations • Treat the OCR as an annotation of a segment of PDF / JPEG page image – Annotating agent is OCR program • Proposed OCR correction is then an annotation of the OCR annotation of the page image • Complicating factor – OCR outputs at page level, correction is usually done at line level W3C Annotation Workshop 4/2/2014 8 t-cole3@illinois.edu

  9. Annotating repeated errors? The string “Jrbana” appears 782 times in OCR texts of the Urbana Daily Courier (1903-1935) Do we need to require that users find every instance of “Jrbana”....? Can we have search-and-replace annotations? * “Urbana” appears ~ 200,000 times W3C Annotation Workshop 4/2/2014 9 t-cole3@illinois.edu

  10. http://annolex.at.northwestern.edu/

  11. W3C Annotation Workshop 4/2/2014 11 t-cole3@illinois.edu

  12. W3C Annotation Workshop 4/2/2014 13 t-cole3@illinois.edu

  13. From TCP-EEBO Collection as hosted at the University of Michigan W3C Annotation Workshop 4/2/2014 14 t-cole3@illinois.edu

  14. //*[@id="doccontent"]/div/div[39] <div class="sp"> <div class="speaker">Valer.</div> <p>Oh Collatin <span class="gap">•…</span>! I am a true Cittizen and in this I will best shew my selfe to be one, to take part with the stronger. If <span class="rend- italic">Se<span class="gap">•…</span> ius</span> ore-come, I am Liegeman to <span class="rend-italic">Serutus,</span> &amp; if <span class="rend- italic">Ta<span class="gap">•…</span> quin</span> subdue, I am for <span class="rend-italic">Viue Tarquinius.</span> </p> </div> W3C Annotation Workshop 4/2/2014 15 t-cole3@illinois.edu

  15. Relation between anchoring method & what’s being targeted • 1 st challenge – understanding the annotator’s intention. • 2 nd challenge – using a targeting approach that is consistent with annotator’s intent • Some schemes limit possible range of interpretations – Chapter, verse & line approaches (e.g., CTS) W3C Annotation Workshop 4/2/2014 16 t-cole3@illinois.edu

  16. W3C Annotation Workshop 4/2/2014 17 t-cole3@illinois.edu

  17. W3C Annotation Workshop 4/2/2014 19 t-cole3@illinois.edu

  18. Anchoring Methods to Support Curatorial Annotation of Scholarly Text Resources • Should be fine-grained – for text this means individual words and phrases • Should ensure persistence, e.g., even as adjacent content is updated / corrected • Can be aligned across derivative formats and serializations, even across repository boundaries • Can support search & replace, e.g., the target is set of all instances found in a specific context • Should help distinguish curatorial annotations of a specific digitization & its derivatives from annotations of intellectual substance W3C Annotation Workshop 4/2/2014 20 t-cole3@illinois.edu

Recommend


More recommend