constructing a canonicalized corpus of historical german
play

Constructing a Canonicalized Corpus of Historical German by Text - PowerPoint PPT Presentation

Constructing a Canonicalized Corpus of Historical German by Text Alignment Bryan Jurish, Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften


  1. Constructing a Canonicalized Corpus of Historical German by Text Alignment     Bryan Jurish,     Deutsches Textarchiv Marko Drotschmann, Berlin-Brandenburgische Akademie der Wissenschaften     http://deutschestextarchiv.de   Henriette Ast New Methods in Historical Corpora Manchester, 29 th -30 th April, 2011 NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 1/20

  2. Overview The Big Picture Canonicalization Desiderata Proposal Construction Sources Text Alignment Manual Annotation Applications Test Corpus Canonicalization Lexicon Conclusion NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 2/20

  3. — The Big Picture — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 3/20

  4. Canonicalization a.k.a. (orthographic) ‘standardization’, ‘normalization’, ‘modernization’, . . . The Problem Historical text �∋ orthographic conventions Conventional NLP tools ⇒ strict orthography Fixed lexicon keyed by orthographic form Extant lexemes only þ e Olde Wydgette Shoppe ↓ ↓ ↓ ↓ the old widget shop The Approach Map each word w to a unique canonical cognate � w Synchronically active “extant equivalents” Preserve both root and relevant features of input Defer application analysis to canonical forms NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 4/20

  5. Desiderata Evaluation Compare various canonicalization functions c ( · ) Task : information retrieval = ⇒ (precision, recall) Retrieval via canonical equivalence: � c ◦ c − 1 � � � retrieved c ( w ) := ( w ) = v : c ( v ) = c ( w ) Relevance requires manual verification! relevant( w ) := ? Ground-Truth Corpus Manually verified canonicalization pairs ( w, � w ) “Gold standard” � c ( · ) for training & evaluation c ( w ) = { v : � v = � relevant( w ) := retrieved � w } Minimize manual annotation effort . . . but how? NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 5/20

  6. Proposal Intuitions Contemporary editions of historical works ⇒ already standardized = Expect mostly identity canonicalizations w = � w (at least for 18 th -19 th century German) Construction (Sketch) Align historical text with a contemporary edition maximize identity alignments Confirm or Reject type-wise alignments exploit Heaps’ Law Manually annotate only unconfirmed tokens don’t lose “interesting” anomalous material NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 6/20

  7. — Construction — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 7/20

  8. Sources Text Resources Source texts: Deutsches Textarchiv (DTA) Belles lettres , drama, verse, philosophy Target texts: gutenberg.org , zeno.org Prototype Corpus 13 volumes, published 1780–1880 ca. 350k tokens ∼ 28k types (words only) Ongoing Construction (‘full’ corpus) 129 volumes, published 1780–1901 ca. 5.2M tokens ∼ 219k types (words only) NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 8/20

  9. Text Alignment Preprocessing Tokenization (1 word / line) e Transliteration e.g. ( S �→ s), ( o �→ ö) Basic Alignment (GNU diff ) Token-wise LCS > 77% identity, > 94% transliterated identity Heuristic Alignment For each change change hunk multi-token alignments e.g. (zwei und vierzig �→ zweiundvierzig) character-wise ‘best’ match (Levenshtein) NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 9/20

  10. Type-wise Confirmation Idea Manually confirm or reject non-identity alignments Exploit Heaps’ Law vocabulary grows logarithmically with corpus size Conservative acceptance only Results (prototype corpus) Available: 18k tokens ∼ 5.8k types Confirmed: 16k tokens (90%) ∼ 4.5k types (77%) Throughput ca. 3.95 seconds / pair ≈ 15 words / second NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 10/20

  11. Token-wise Annotation Idea Resolve remaining uncanonicalized tokens (ca. 2%) Retain anomalous canonicalization patterns Preprocessing Filters Block pruning ( ≈ 2.2%) Closed-class lexicon Annotations Canonical form + administrative flags Expert review for problematic cases Throughput (total) ≈ 1.3 words / second NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 11/20

  12. — Experiments — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 12/20

  13. Materials Prototype Corpus ❀ Ground-Truth Relevance Most thoroughly annotated corpus subset 340k tokens; 29k types (words only) Full Corpus ❀ Canonicalization Lexicon ( LEX )   arg max � w f ( w, � w ) if f ( w ) > 0 LEX ( w ) =  w otherwise Strictly disjoint from test corpus (by author) Partially annotated (no expert review) 2.4M tokens; 140k types (words only) HMM Canonicalization Cascade (Jurish, 2010c) Robust finite-state canonicalizer Tested methods: ID , LEX , HMM , HMM + LEX NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 13/20

  14. Results ����� � ����� ���� ����� ����� ���� �� �� ����� �� �� ���� � � ����� ���� ����� ��� ��� ������� ��� ��� ������� % Types % Tokens pr rc F pr rc F 98.3 57.1 72.2 99.7 79.1 88.2 ID 97.9 93.2 95.5 99.5 98.9 99.2 HMM 98.3 85.3 91.3 99.7 98.2 98.9 LEX 98.3 94.8 96.5 99.7 99.2 99.5 HMM + LEX NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 14/20

  15. Conclusion Construction Alignment with contemporary edition Type-wise confirmation Token-wise annotation ❀ minimal-effort corpus bootstrapping Applications Simple corpus-based lexicon ⇒ surprisingly effective very high precision mediocre recall for unknown types (sparse data) ‘Exception’ lexicon for HMM canonicalizer best overall performance corpus-based and generative techniques complement one another NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 15/20

  16. þ e Olde LaĆe Slyde (“The End”) Thank you for listening! NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 16/20

  17. — Addenda — NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 17/20

  18. Token Annotation GUI NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 18/20

  19. GUI: Batch Editor Window NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 19/20

  20. Administrivia Class N %Edited ��� 2684 59.22 % LEX �� 874 19.29 % ���� NE ����� 792 17.48 % JOIN ����� 101 2.23 % ����� GRAPH 72 1.59 % SPLIT 40 0.88 % BUG 8 0.18 % GONE 1 0.02 % FM NMiHC / 2011-04-30 / Jurish, Drotschmann, Ast / Constructing a canonicalized corpus . . . – p. 20/20

Recommend


More recommend