Morphological and Part-of-Speech Tagging of Historical Language - PowerPoint PPT Presentation

Morphological and Part-of-Speech Tagging of Historical Language Data: A Comparison Stefanie Dipper Linguistics Department Ruhr-University Bochum 5 January, 2012 Workshop on Annotation of Corpora for Research in the Humanities Heidelberg Stefanie Dipper Morphological and POS Tagging 5.1.2012 1 / 27

Goals A morphological and a part-of-speech (POS) tagger for texts from Middle High German (MHG, 1050–1350) Ex: sach ‘saw’: 3.Sg.Past.Ind (Morph) (POS) VVFIN Stefanie Dipper Morphological and POS Tagging 5.1.2012 2 / 27

Goals A morphological and a part-of-speech (POS) tagger for texts from Middle High German (MHG, 1050–1350) Ex: sach ‘saw’: 3.Sg.Past.Ind (Morph) (POS) VVFIN Use of available state-of-the-art taggers without any adaption some preprocessing of the annotated training data Stefanie Dipper Morphological and POS Tagging 5.1.2012 2 / 27

Goals A morphological and a part-of-speech (POS) tagger for texts from Middle High German (MHG, 1050–1350) Ex: sach ‘saw’: 3.Sg.Past.Ind (Morph) (POS) VVFIN Use of available state-of-the-art taggers without any adaption some preprocessing of the annotated training data Challenge: highly variant data no spelling conventions, e.g. sitzen, sizzen ‘sit’ different dialects, e.g. bruoder, pruder ‘brother’ → Data sparseness Stefanie Dipper Morphological and POS Tagging 5.1.2012 2 / 27

Goals A morphological and a part-of-speech (POS) tagger for texts from Middle High German (MHG, 1050–1350) Ex: sach ‘saw’: 3.Sg.Past.Ind (Morph) (POS) VVFIN Use of available state-of-the-art taggers without any adaption some preprocessing of the annotated training data Challenge: highly variant data no spelling conventions, e.g. sitzen, sizzen ‘sit’ different dialects, e.g. bruoder, pruder ‘brother’ → Data sparseness Questions: (how much) does normalization help? – original (“diplomatic”) vs. normalized wordforms does POS preprocessing help morphological tagging? Stefanie Dipper Morphological and POS Tagging 5.1.2012 2 / 27

Outline The corpus 1 Training experiments 2 Stefanie Dipper Morphological and POS Tagging 5.1.2012 3 / 27

The corpus Outline The corpus 1 Training experiments 2 Stefanie Dipper Morphological and POS Tagging 5.1.2012 4 / 27

The corpus Project context Projects “Mittelhochdeutsche Grammatik” and “Referenzkorpus Mittelhochdeutsch” (Universities of Bochum and Bonn) Stefanie Dipper Morphological and POS Tagging 5.1.2012 5 / 27

The corpus Project context Projects “Mittelhochdeutsche Grammatik” and “Referenzkorpus Mittelhochdeutsch” (Universities of Bochum and Bonn) Goals a balanced, annotated reference corpus of MHG diplomatic transcriptions final size: 300 texts, 2 million wordforms available via the internet (ANNIS) Stefanie Dipper Morphological and POS Tagging 5.1.2012 5 / 27

The corpus Project context Projects “Mittelhochdeutsche Grammatik” and “Referenzkorpus Mittelhochdeutsch” (Universities of Bochum and Bonn) Goals a balanced, annotated reference corpus of MHG diplomatic transcriptions final size: 300 texts, 2 million wordforms available via the internet (ANNIS) Annotations parts of speech (POS) morphological tags lemma normalized wordform Stefanie Dipper Morphological and POS Tagging 5.1.2012 5 / 27

The corpus Project context Projects “Mittelhochdeutsche Grammatik” and “Referenzkorpus Mittelhochdeutsch” (Universities of Bochum and Bonn) Goals a balanced, annotated reference corpus of MHG diplomatic transcriptions final size: 300 texts, 2 million wordforms available via the internet (ANNIS) Annotations parts of speech (POS) morphological tags lemma normalized wordform Currently: semi-automatic annotation tools by Thomas Klein, Bonn (2001) require a lot of human intervention Stefanie Dipper Morphological and POS Tagging 5.1.2012 5 / 27

The corpus The data 51 texts with 211,000 tokens from the MHG Reference Corpus From two dialect regions: Upper (UG) and Central German (CG) Stefanie Dipper Morphological and POS Tagging 5.1.2012 6 / 27

The corpus Upper and Central (and Lower) German Source: Wikimedia Stefanie Dipper Morphological and POS Tagging 5.1.2012 7 / 27

The corpus Spelling variation E.g. (normalized) wolte ‘wanted’ – wolt – wolta – wolte – woltt – wolti – wolthe – walde – uuolde – volde – . . . Stefanie Dipper Morphological and POS Tagging 5.1.2012 8 / 27

The corpus Spelling variation E.g. (normalized) wolte ‘wanted’ – wolt – wolta – wolte – woltt – wolti – wolthe – walde – uuolde – volde – . . . Normalization: mapping to a virtual, idealized historical wordform Stefanie Dipper Morphological and POS Tagging 5.1.2012 8 / 27

The corpus The data: some statistics Texts Tokens Types diplomatic normalized [ NHG ] 51 total 211,000 40,500 20,500 .19 .10 [.14] 27 CG 91,000 22,000 13,000 .24 .14 [.18] 20 UG 67,000 15,000 8,500 .22 .13 4 mixed 53,000 Stefanie Dipper Morphological and POS Tagging 5.1.2012 9 / 27

The corpus The data: some statistics Texts Tokens Types diplomatic normalized [ NHG ] 51 total 211,000 40,500 20,500 .19 .10 [.14] 27 CG 91,000 22,000 13,000 .24 .14 [.18] 20 UG 67,000 15,000 8,500 .22 .13 4 mixed 53,000 On average: roughly 2 spelling variants (diplomatic) per wordform (normalized) Stefanie Dipper Morphological and POS Tagging 5.1.2012 9 / 27

The corpus The data: some statistics Texts Tokens Types diplomatic normalized [ NHG ] 51 total 211,000 40,500 20,500 .19 .10 [.14] 27 CG 91,000 22,000 13,000 .24 .14 [.18] 20 UG 67,000 15,000 8,500 .22 .13 4 mixed 53,000 On average: roughly 2 spelling variants (diplomatic) per wordform (normalized) Type-token ratio: higher ratio → more diverse data – CG more diverse than UG [ – cf. modern German (NHG): TTR = .14/.18] Stefanie Dipper Morphological and POS Tagging 5.1.2012 9 / 27

The corpus Predictions I 1. Normalized vs. diplomatic: tagging normalized data should be easier Stefanie Dipper Morphological and POS Tagging 5.1.2012 10 / 27

The corpus Predictions I 1. Normalized vs. diplomatic: tagging normalized data should be easier 2. CG vs. UG vs. NHG: rather unclear a) pro CG: more training data available b) pro UG: less diverse (lower type/token ratio) c) pro MHG (normalized, equal size): less diverse than NHG Stefanie Dipper Morphological and POS Tagging 5.1.2012 10 / 27

The corpus Tagsets POS based on the STTS tagset (standard German tagset) NN, NE “normal noun, proper noun” VVFIN, VVINF, VVIMP, VVPP “finite full verb, infinitive, imperative, past participle” Stefanie Dipper Morphological and POS Tagging 5.1.2012 11 / 27

The corpus Tagsets POS based on the STTS tagset (standard German tagset) NN, NE “normal noun, proper noun” VVFIN, VVINF, VVIMP, VVPP “finite full verb, infinitive, imperative, past participle” Morphology “large” STTS tagset Comp.Fem.Acc.Sg “(adjective:) comparative form, feminine, accusative, singular” 3.Sg.Past.* “(verb:) 3rd singular past tense, unspecified for mood” Stefanie Dipper Morphological and POS Tagging 5.1.2012 11 / 27

The corpus Underspecification POS and morph: more underspecified tags in MHG than in NHG (no native speakers) Gender of nouns: not yet as fixed as nowadays Example: slange ‘snake’: masc/fem daz si slangen bizzen - *.Acc.Pl MascFem.Nom.Pl 3.Pl.Past.* that them snakes bit ‘that snakes bit them’ Stefanie Dipper Morphological and POS Tagging 5.1.2012 12 / 27

The corpus Tagsets: some statistics (normalized data) POS # Tags Ø Tags/wofo Median (max) CG norm 44 1 . 10 ± 0 . 37 1 (7) UG norm 41 1 . 10 ± 0 . 35 1 (6) NHG (210K) 53 1 . 05 ± 0 . 23 1 (6) (90K) 51 1 . 04 ± 0 . 21 1 (6) Stefanie Dipper Morphological and POS Tagging 5.1.2012 13 / 27

The corpus Tagsets: some statistics (normalized data) POS # Tags Ø Tags/wofo Median (max) CG norm 44 1 . 10 ± 0 . 37 1 (7) UG norm 41 1 . 10 ± 0 . 35 1 (6) NHG (210K) 53 1 . 05 ± 0 . 23 1 (6) (90K) 51 1 . 04 ± 0 . 21 1 (6) Morphology # Tags Ø Tags/wofo Median (max) CG norm 245 1 . 40 ± 1 . 16 1 (23) UG norm 219 1 . 46 ± 1 . 28 1 (33) NHG (210K) 230 1 . 37 ± 0 . 97 1 (26) (90K) 205 1 . 32 ± 0 . 86 1 (18) Stefanie Dipper Morphological and POS Tagging 5.1.2012 13 / 27

The corpus Predictions II 1. Normalized vs. diplomatic: tagging normalized data should be easier 2. CG vs. UG vs. NHG: rather unclear a) pro CG: more training data available b) pro UG: less diverse (lower type/token ratio) c) pro MHG (normalized, equal size): less diverse than NHG 3. Morphology vs. POS: tagging POS should be easier (lower ambiguity rate) Stefanie Dipper Morphological and POS Tagging 5.1.2012 14 / 27

The corpus Predictions II 1. Normalized vs. diplomatic: tagging normalized data should be easier 2. CG vs. UG vs. NHG: rather unclear a) pro CG: more training data available b) pro UG: less diverse (lower type/token ratio) c) pro MHG (normalized, equal size): less diverse than NHG 3. Morphology vs. POS: tagging POS should be easier (lower ambiguity rate) 4. CG vs. UG vs. NHG: again, rather unclear a) CG and UG rather similar b) POS: pro UG, morph: pro CG (lower maxima) c) pro NHG: lower ambiguity rates Stefanie Dipper Morphological and POS Tagging 5.1.2012 14 / 27

Morphological and Part-of-Speech Tagging of Historical Language - PowerPoint PPT Presentation

Morphological and Part-of-Speech Tagging of Historical Language Data: A Comparison Stefanie Dipper Linguistics Department Ruhr-University Bochum 5 January, 2012 Workshop on Annotation of Corpora for Research in the Humanities Heidelberg

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Part of Speech Tagging Informatics 2A: Lecture 15 Mirella Lapata School of Informatics

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Part of Speech Tagging Informatics 2A: Lecture 16 John Longley School of Informatics University

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Feature-Based Tagging The Task, Again Recall: tagging ~ morphological disambiguation

The Tagging Task Part-of-Speech Tagging Input: the lead paint is unsafe Output: the/Det lead/N

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

Syntactic Processing: Parts-of-Speech Tagging CSE354 - Spring 2020 Task Syntactic

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

A New Universal Morphological Feature Schema for Rich Morphological Annotation and Cross-Lingual

Context 1. Observation (before light pollution) 2. Explanation 3. Model Phaenomena " the

Quantum Schur algebras and their affine and super counterparts Jie Du University of New South

Dredging Methods in the US and South Korea Pierre Y. Julien Colorado State University World

Safety Management & Site Establishment Unit 11 - Site Plant Management Learning Outcomes

Cellularity and the Jones basic construction Cellularity definition Ponidicherry Conference,

Antecedent and referent types of abstract pronominal anaphora Costanza Navarretta University of

Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify

Tutorial Outline Tutorial Outline XLE: XLE: What is a deep grammar and why would you want

Sambuz

Useful Links

Newsletter

Mail Us

Morphological and Part-of-Speech Tagging of Historical Language - PowerPoint PPT Presentation

Morphological and Part-of-Speech Tagging of Historical Language Data: A Comparison Stefanie Dipper Linguistics Department Ruhr-University Bochum 5 January, 2012 Workshop on Annotation of Corpora for Research in the Humanities Heidelberg

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Part of Speech Tagging Informatics 2A: Lecture 15 Mirella Lapata School of Informatics

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Part of Speech Tagging Informatics 2A: Lecture 16 John Longley School of Informatics University

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Feature-Based Tagging The Task, Again Recall: tagging ~ morphological disambiguation

The Tagging Task Part-of-Speech Tagging Input: the lead paint is unsafe Output: the/Det lead/N

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning &amp; H.

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

Syntactic Processing: Parts-of-Speech Tagging CSE354 - Spring 2020 Task Syntactic

Maximum Entropy Tagging (for the Maximum Entropy method itself, refer to NPFL067 added slides

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

A New Universal Morphological Feature Schema for Rich Morphological Annotation and Cross-Lingual

Context 1. Observation (before light pollution) 2. Explanation 3. Model Phaenomena &quot; the

Quantum Schur algebras and their affine and super counterparts Jie Du University of New South

Dredging Methods in the US and South Korea Pierre Y. Julien Colorado State University World

Safety Management &amp; Site Establishment Unit 11 - Site Plant Management Learning Outcomes

Cellularity and the Jones basic construction Cellularity definition Ponidicherry Conference,

Antecedent and referent types of abstract pronominal anaphora Costanza Navarretta University of

Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify

Tutorial Outline Tutorial Outline XLE: XLE: What is a deep grammar and why would you want

Sambuz

Useful Links

Newsletter

Mail Us

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

Context 1. Observation (before light pollution) 2. Explanation 3. Model Phaenomena " the

Safety Management & Site Establishment Unit 11 - Site Plant Management Learning Outcomes