Comparing Canonicalizations of Historical German Text Bryan Jurish - PowerPoint PPT Presentation

Comparing Canonicalizations of Historical German Text Bryan Jurish jurish@bbaw.de Project “Deutsches Textarchiv” Berlin-Brandenburg Academy of Sciences Berlin, Germany SIGMORPHON 2010 Uppsala, Sweden 15 July, 2010 SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 1/22

Overview The Big Picture The Situation The Problem The Proposal Canonicalization Methods Phonetic Identity Levenshtein Edit Distance Heuristic Rewrite Transducer Evaluation Test Corpus Evaluation Measures Results SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 2/22

The Big Picture SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 3/22

The Situation Historical Text �∋ Orthographic Conventions also applies to OCR text, E-Mail SMS, Tweets, . . . High variance of graphemic forms fr e fröhlich frölich, fröhlich, vrœlich, frœlich, o lich, e “joyful” fr o hlich, vrölich, fröhlig, frölig, . . . Herzenleid hertzenleid, herzenleit, hertzenleyd, hertzen- “heart-sorrow” laidt, hertzenlaydt, herzenleyd, . . . Conventional NLP Tools ⇒ Strict Orthography Document indexers, PoS taggers, stemmers, morphological analyzers, parsers, . . . Fixed lexicon keyed by orthographic form Extant lexemes only SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 4/22

The Problem Conventional Tools Historical Corpus ⊕ Soup = Corpus variants missing from application lexicon Low coverage (many unknown types) Poor recall (relevant data not retrieved) Degraded accuracy (poor model fit) . . . and more! SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 5/22

The Proposal þ e Olde Wydgett Shoppe In a Nutshell ↓ ↓ ↓ ↓ the old widget shop Conflate each word w with its canonical cognates � w Defer application analysis to canonical forms analyses R ( w ) := � w ∈ Lex ∩ [ w ] R analyses( � w ) e Canonical Cognates Synchronically active “extant equivalents” � w ∈ Lex Preserve both root and relevant features of input Conflation Relation Binary relation ∼ R on strings (words) in A ∗ Prototypically a true equivalence relation SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 6/22

Canonicalization Methods SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 7/22

Phonetic Conflation: Sketch Idea (Jurish, 2008) Map each word w to a unique phonetic form pho( w ) Conflate words with identical phonetic forms w ∼ Pho v : ⇔ pho( w ) = pho( v ) Phonetization: Letter-to-Sound (LTS) Conversion Well-known in text-to-speech (TTS) research ims_german_festival LTS rule-set (Möhler et al., 2001) slightly modified for historical input compiled as a finite-state transducer (FST) M ∼ Pho = M Pho ◦ M − 1 Pho ◦ Id(Lex) SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 8/22

Phonetic Conflation: Problems Insufficient (too permissive) Phonetic Identity �⇒ Lexical Equivalence Precision Errors (conflated but not equivalent) Not too dangerous (yet) usz–Uhus vil–fiel in–ihn “out”–“owls” “much”–“fell” “in”–“him” Unnecessary (too strict) Phonetic Identity �⇐ Lexical Equivalence Recall Errors (equivalent but not conflated) This is the more severe of the two problems! guot–gut tiuvel–Teufel umb–um “good” “devil” “around” SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 9/22

Levenshtein Conflation: Sketch Idea Relax strict identity criterion (improve recall) Map each input word to “nearest” extant type string edit distance (Levenshtein, 1966) computable even for infinite lexica (Mohri, 2002) Gory Details best Lev ( w ) := arg min v ∈ Lex � M Lev � ( w, v ) w ∼ Lev v : ⇔ best Lev ( w ) = best Lev ( v ) Synchronic lexicon Lex ⊆ A ∗ TAGH input language (Geyken & Hanneforth, 2006) Edit Distance WFST M Lev Best-first search using gfsmxl C library SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 10/22

Levenshtein Conflation: Problems Search Space too Large Backtracking & heap maintainence are O ( |A| · | w | ) circa 150 times slower than phonetic conflation Metric Granularity too Coarse No context-sensitivity c ( th → t ) = c ( uhu → uu ) = 1 No target-sensitivity c (¨ u → i ) = c (¨ u → x ) = 1 Examples for d Lev = 1 best Lev ( w ) � w w “out” “eye” aug aus auge “almost” “grabs” faszt fast fasst “book” “also” ouch buch auch “advice” “cream” ram rat rahm “people” “full” vol volk voll SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 11/22

Rewrite Cascade: Sketch Idea: Generalized Edit Distance via WFSTs Replace coarse Levenshtein metric Reduce search space Attenuate edit costs for e.g. elision mp → m/ # � 1 � , n → en/ # � 5 � vowel shift o → a / u � 1 � , o → a / � 9 � (un)voicing p → b / � 5 � , b → p / � 8 � corpus quirks sz → ß / � 1 � , f → s / � 10 � Implementation Heuristic “rewrite” transducer M rw replaces M Lev w ∼ rw v : ⇔ best rw ( w ) = best rw ( v ) 306 manually constructed SPE-style two-level rules circa 40 times faster than Levenshtein conflation SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 12/22

Rewrite Cascade: Problems Resource-Intensive Heuristic rule-set must be manually developed requires “expert” knowledge time-consuming task Language-Specific No immediate generalization to other languages Computationally Expensive circa 4 times slower than Pho . . . still a big improvement over Lev SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 13/22

Evaluation SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 14/22

Evaluation: Basics Gold Standard Test Corpus G Historical German verse from e-DWB1 (Bartz et al., 2004) 11,242 tokens; 4157 types Canonical cognate manually assigned to each token Evaluation Measures Simulated information retrieval task Type- and token-wise precision ( pr ), recall ( rc ), and F SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 15/22

Evaluation: Results Type-wise % Token-wise % R pr rc F pr f rc f F f Id 99.9 70.8 82.9 99.1 83.7 90.7 Pho 96.7 80.1 87.6 92.7 89.6 91.1 Lev 96.6 78.9 86.9 97.2 87.8 92.2 rw 98.5 88.4 93.2 98.2 93.4 95.8 Pho | Lev 94.1 84.3 88.9 91.3 91.6 91.5 Pho | rw 96.1 89.8 92.8 92.5 94.5 93.5 SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 16/22

Evaluation: Results: Id Type-wise % Token-wise % R pr rc F pr f rc f F f 99.9 99.1 Id 70.8 82.9 83.7 90.7 Pho 96.7 80.1 87.6 92.7 89.6 91.1 96.6 78.9 86.9 97.2 87.8 92.2 Lev 93.2 95.8 98.5 88.4 98.2 93.4 rw Pho | Lev 94.1 84.3 88.9 91.3 91.6 91.5 89.8 94.5 Pho | rw 96.1 92.8 92.5 93.5 Id : naïve string identity Most precise, but worst recall Especially poor recall for low-frequency types Historical text really is tricky! SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 17/22

Evaluation: Results: Pho Type-wise % Token-wise % R pr rc F pr f rc f F f 99.9 99.1 Id 70.8 82.9 83.7 90.7 Pho 96.7 80.1 87.6 92.7 89.6 91.1 96.6 78.9 86.9 97.2 87.8 92.2 Lev 93.2 95.8 98.5 88.4 98.2 93.4 rw Pho | Lev 94.1 84.3 88.9 91.3 91.6 91.5 89.8 94.5 Pho | rw 96.1 92.8 92.5 93.5 Pho : Phonetic conflation Poor token-wise precision Small number of errors for high-frequency types in–ihn (“in”–“him”) wider–wieder (“against”–”again”) SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 18/22

Evaluation: Results: Lev Type-wise % Token-wise % R pr rc F pr f rc f F f 99.9 99.1 Id 70.8 82.9 83.7 90.7 Pho 96.7 80.1 87.6 92.7 89.6 91.1 96.6 78.9 86.9 97.2 87.8 92.2 Lev 93.2 95.8 98.5 88.4 98.2 93.4 rw Pho | Lev 94.1 84.3 88.9 91.3 91.6 91.5 89.8 94.5 Pho | rw 96.1 92.8 92.5 93.5 Lev : Levenshtein conflation No recall improvement vs. Pho too many spurious conflations union Pho | Lev does somewhat better SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 19/22

Evaluation: Results: rw Type-wise % Token-wise % R pr rc F pr f rc f F f 99.9 99.1 Id 70.8 82.9 83.7 90.7 Pho 96.7 80.1 87.6 92.7 89.6 91.1 96.6 78.9 86.9 97.2 87.8 92.2 Lev 93.2 95.8 98.5 88.4 98.2 93.4 rw Pho | Lev 94.1 84.3 88.9 91.3 91.6 91.5 89.8 94.5 Pho | rw 96.1 92.8 92.5 93.5 rw : Heuristic rewrite transducer Best method overall circa 60% fewer recall errors vs. string identity Recall further improved by including Pho SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 20/22

Conclusion Summary Historical text corpora and conventional tools won’t play together nicely Best canonicalization by heuristic rewrite FST implementing linguistic intuitions helps! Phonetic, Levenshtein methods more accessible improved by exception lexica, cost upper bounds Next Steps Larger corpus (under construction) Precision recovery for overgeneration (alpha) Language-independent (pseudo-)metrics SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 21/22

þ e Olde LaĄt Slyde (“The End”) Thank you for listening! http://www.deutschestextarchiv.de SIGMORPHON-2010 / Jurish / Comparing Canonicalizations . . . – p. 22/22

Comparing Canonicalizations of Historical German Text Bryan Jurish - PowerPoint PPT Presentation

Comparing Canonicalizations of Historical German Text Bryan Jurish jurish@bbaw.de Project Deutsches Textarchiv Berlin-Brandenburg Academy of Sciences Berlin, Germany SIGMORPHON 2010 Uppsala, Sweden 15 July, 2010 SIGMORPHON-2010 /

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Text Mining and Historical Research Beatrice Alex balex@inf.ed.ac.uk MSc Historical Research,

Making Sense of Performance in Data Analytics Frameworks Authors: Kay Ousterhout, Ryan Rasti,

Conflation & Matching Break out session Why we need conflation? Matching legacy data with

Students on placement: the view from both sides Dr Bob Pymm, School of Information Studies, CSU,

CO COMM MM 31 310: 0: Fu Fund ndra rais ising ing Pers rsonal al Pro rodu duct

Genesis Series Lesson #003 February 25, 2003 Dean Bible Ministries www.deanbibleministries.org

January 2018 Overview Why the changes have been made What the changes are Support and

Health Ethics Seminar October 22, 2020 Heidi Janz, Ph.D. Assistant Adjunct Professor John

Supporting Incremental Re-Computation with Whole System Provenance: Issues and Approaches Ashish

Comparing Canonicalizations of Historical German Text Bryan Jurish - PowerPoint PPT Presentation

Comparing Canonicalizations of Historical German Text Bryan Jurish jurish@bbaw.de Project Deutsches Textarchiv Berlin-Brandenburg Academy of Sciences Berlin, Germany SIGMORPHON 2010 Uppsala, Sweden 15 July, 2010 SIGMORPHON-2010 /

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Text Mining and Historical Research Beatrice Alex balex@inf.ed.ac.uk MSc Historical Research,

Making Sense of Performance in Data Analytics Frameworks Authors: Kay Ousterhout, Ryan Rasti,

Conflation &amp; Matching Break out session Why we need conflation? Matching legacy data with

Students on placement: the view from both sides Dr Bob Pymm, School of Information Studies, CSU,

CO COMM MM 31 310: 0: Fu Fund ndra rais ising ing Pers rsonal al Pro rodu duct

Genesis Series Lesson #003 February 25, 2003 Dean Bible Ministries www.deanbibleministries.org

January 2018 Overview Why the changes have been made What the changes are Support and

Health Ethics Seminar October 22, 2020 Heidi Janz, Ph.D. Assistant Adjunct Professor John

Supporting Incremental Re-Computation with Whole System Provenance: Issues and Approaches Ashish

Conflation & Matching Break out session Why we need conflation? Matching legacy data with