Using an Alignment-based Lexicon for Canonicalization of Historical - PowerPoint PPT Presentation

Using an Alignment-based Lexicon for Canonicalization of Historical Text     Deutsches Textarchiv Bryan Jurish, Berlin-Brandenburgische Akademie der Wissenschaften   Henriette Ast jurish@bbaw.de Historical Corpora 2012 Goethe Universität, Frankfurt am Main, 6 th -8 th December, 2012 HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 1/24

Overview The Big Picture Canonicalization Aligned Corpus Alignment-based Lexicon Nasty Surprises Identity Pairs Sanitation Engineering Trimmed Corpus Experiments Method Results Conclusion HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 2/24

— The Big Picture — HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 3/24

Canonicalization a.k.a. (orthographic) ‘standardization’, ‘normalization’, ‘modernization’, . . . The Problem Historical text �∋ orthographic conventions Conventional NLP tools ⇒ strict orthography Fixed lexicon keyed by orthographic form Extant lexemes only þ e Olde Wydgette Shoppe ↓ ↓ ↓ ↓ the old widget shop The Approach Map each word w to a unique canonical cognate � w Synchronically active “extant equivalents” Preserve both root and relevant features of input Defer application analysis to canonical forms HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 4/24

Aligned Corpus Ground-Truth Canonicalizations Manually verified canonicalization pairs ( w �→ � w ) Full sentential context Intuitions 1 Contemporary editions = ⇒ already standardized 2 Expect mostly identity canonicalizations ( w = � w ) Construction (sketch) (Jurish, Drotschmann & Ast, [forthcoming]) Align historical text with a contemporary edition maximize identity alignments Confirm or reject type-wise alignments Manually annotate only unconfirmed tokens 126 volumes (1780–1901), 5.6M tokens, 212k types HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 5/24

Alignment-based Lexicon Basic Idea LEX : A ∗ → A ∗ : w �→ � Deterministic type-wise mapping w Choose most frequent modern form for each input word use string identity fallback for unknown words Expected Weaknesses (Kempken et al., 2006; Gotscharek et al., 2009b) Can’t handle any ambiguity Identity fallback ❀ sparse data effects especially for productive morphological processes Alternatives ID : naïve string identity baseline HMM : robust generative HMM canonicalizer (Jurish, 2010c; 2012) HMM + LEX : alignment-based lexicon with HMM fallback . . . and more! HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 6/24

— Nasty Surprises — (and some ways to deal with them) HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 7/24

Nasty Surprises Intuition (1) Violations Assumed: modern edition = ⇒ strict orthography Implicitly accepted identity pairs ( w �→ w ) ca. 59% types, 87% tokens identical modulo transliteration Not always justified by the editions used (oops) Letter Case bruder �→ bruder � = Bruder “brother” trost �→ trost � = Trost “comfort” Extinct Forms ward �→ ward � = wurde “was” däuchte �→ däuchte � = dünkte “seems” � = andere Prosodic Foot andre �→ andre “other” eignen �→ eignen � = eigenen “own” Dialect kömmt �→ kömmt � = kommt “comes” nich �→ nich � = nicht “not” � = ins Apostrophes in’s �→ in’s “into the” s’ist �→ s’ist � = es ist “it is” HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 8/24

Sanitation Engineering a.k.a. ‘garbage disposal’ Coarse Pruning (by Region) Dropped 5 volumes : verse, case, dialect Dropped 204 pages in 41 volumes : dialect, foreign material 245k tokens ∼ 32k types ∼ 12k local types Heuristic Pruning (by Type) Invalidated all types containing apostrophes or quotation marks mixture of alphabetic and non-alphabetic characters 16k tokens ∼ 9k types The Usual Suspects (under review) Inconsistency with respect to online error database Unknown “modern” forms ( TAGH ) (Geyken & Hanneforth, 2006) 57k tokens ∼ 12k types marked “suspicious” currently 55k tokens, 10k suspicious types re-validated HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 9/24

Corpus Summary Text Resources Source texts: Deutsches Textarchiv (DTA) Belles lettres , drama, verse, philosophy (1780–1901) Target texts: gutenberg.org , zeno.org 126 volumes ∼ 5.6M tokens ∼ 212k pair-types Corpus Pruning Removed all sentences containing “suspicious” material 13% tokens ∼ 18% types Trimmed Corpus 121 volumes ∼ 4.9M tokens ∼ 174k types HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 10/24

— Experiments — HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 11/24

Method ‘Prototype’ Corpus ❀ Ground-Truth Relevance � � relevant( w, � w ) := ( v, � v ) : � v = � w Most thoroughly annotated corpus subset 13 volumes ∼ 320k tokens ∼ 28k types (words only) Training Corpus ❀ Canonicalization Lexicon ( LEX )   arg max � w f ( w, � w ) if f ( w ) > 0 LEX ( w ) =  w otherwise Strictly disjoint from test corpus (by author) 101 volumes ∼ 3.5M tokens ∼ 158k types (words only) Evaluation Simulated information retrieval (pr , rc , F) (van Rijsbergen, 1979) Tested methods: ID , LEX , HMM , HMM + LEX HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 12/24

Results �� % Types % Tokens pr rc F pr rc F 99.1 55.7 71.3 99.8 78.5 87.9 ID 99.0 87.8 93.1 99.8 98.5 99.2 LEX 98.3 93.6 95.9 99.6 98.5 99.1 HMM 98.6 95.7 97.1 99.8 99.3 99.5 HMM + LEX HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 13/24

Conclusion Aligned Corpus Fast bootstrapping for a canonicalization lexicon . . . but beware of identity mappings! Alignment-based Canonicalization Lexicon Surprisingly effective on its own very high precision mediocre recall for unknown types (sparse data) Better as ‘exception’ lexicon for HMM canonicalizer best overall performance corpus-based and generative techniques complement one another HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 14/24

þ e Olde LaĆe Slyde (“The End”) Thank you for listening! HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 15/24

— Addenda — HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 16/24

Results (pre-cleanup) �� % Types % Tokens pr rc F pr rc F 98.3 57.1 72.2 99.7 79.1 88.2 ID 98.3 85.3 91.3 99.7 98.2 98.9 LEX 97.9 93.2 95.5 99.5 98.9 99.2 HMM 98.3 94.8 96.5 99.7 99.2 99.5 HMM + LEX HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 17/24

Pruning Tool: Document List http://kaskade.dwds.de/dta-ecp/view.perl HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 18/24

Pruning Tool: Properties http://kaskade.dwds.de/dta-ecp/edit.perl?doc=39#tabProps HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 19/24

Pruning Tool: Regions http://kaskade.dwds.de/dta-ecp/edit.perl?doc=39#tabPlot HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 20/24

Pruning Tool: Pairs http://kaskade.dwds.de/dta-ecp/edit.perl?doc=39#tabPairs HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 21/24

Corpus Editor: Types View http://kaskade.dwds.de/dtaec/types.perl?where=wold%3D%27Holle%27 HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 22/24

Corpus Editor: KWIC View http://kaskade.dwds.de/dtaec/kwic.perl?where=wold%3D%27Holle%27 HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 23/24

Corpus Editor: Sentence View http://kaskade.dwds.de/dtaec/sent.perl?sent=86493&token=1823583 HistCorp 2012 / 2012-12-06 / Jurish, Ast / Using an alignment-based lexicon . . . – p. 24/24

Using an Alignment-based Lexicon for Canonicalization of Historical - PowerPoint PPT Presentation

Using an Alignment-based Lexicon for Canonicalization of Historical Text Deutsches Textarchiv Bryan Jurish, Berlin-Brandenburgische Akademie der Wissenschaften Henriette Ast jurish@bbaw.de Historical Corpora 2012

Moving beyond the lexicon Moving beyond the lexicon An isolated lexicon? An isolated lexicon?

Ambiguity and the Lexicon in Natural Language Informatics 2A: Lecture 12 2 The Lexicon Word

Ambiguity and the Lexicon in Natural Language 2 The Lexicon Informatics 2A: Lecture 12 Closed vs.

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Ambiguity and the Lexicon in Natural Language Informatics 2A: Lecture 14 Mirella Lapata School

Lexicon Induction Melanie Bolla and Olga Whelan Ling 575 Lexicon Induction (and the problem it

Pronunciation Lexicon Background Outline Brief Introduction on Pronunciation Lexicon

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Ben Burr Trail PROJECT ALIGNMENT Project alignment Hamblen Elem School PROJECT ALIGNMENT

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Sequence Alignment (chapter 6) The biological problem l Global alignment l Local alignment l

Image alignment Slides from Derek Hoiem, Svetlana Lazebnik Image source Alignment applications

TOD Alignment Rezoning Public Meeting July 18, 2019 TOD Alignment Rezoning The TOD Alignment

This week CSE 527 Sequence alignment Computational Biology More sequence alignment

Sequence Alignment Mark Voorhies 5/20/2015 Mark Voorhies Sequence Alignment Exercise: Scoring

Modernising historical words Toma Erjavec 1 Yves Scherrer 2 1 Dept. of Knowledge Technologies,

Geographic visualisation of place names in Swedish literary texts Dana Dannlls, Lars Borin,

Structure From Motion EECS 442 David Fouhey Fall 2019, University of Michigan

The Nordic Dialect Corpus Janne Bondi Johannessen RILIVS, September 17th-18th 2009, University of

Phil Green Steve Renals Steve Young Cambridge University Workshop on Speech, Language and Human

Natural Language Processing Classification III Dan Klein UC Berkeley 1 Classification 2

Agenda: Construct Validity and the CEFR 1. Mediation according to the CEFR-Companion Volume 2.

MDJ- 2018 MDS Consensus Criteria for ET (Deuschl et al.1998) Inclusion: Bilateral, largely

Sambuz

Useful Links

Newsletter

Mail Us