Design and Development of Part-of-Speech-Tagging Resources for Wolof - PowerPoint PPT Presentation

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Design and Development of Part-of-Speech-Tagging Resources for Wolof Cheikh M. Bamba Dione Jonas Kuhn Sina Zarrieß Department of Linguistics, University of Potsdam (Germany) Institute for Natural Language Processing (IMS), University of Stuttgart (Germany) Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Introduction: Wolof, a Low Resource Language 1 Starting from Scratch: Tagset Design 2 Fast Gold Standard Annotation 3 Experiments with State-of-the-art PoS Taggers 4 Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Wolof Spoken in Senegal Lingua franca for 80% of Senegals population (9 million speakers) 4 million native speakers West-Atlantic language Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Wolof Language Ex. Object vs. Subjec focus (1) Maa lekk mburu. Complex system of FOC-Subj.1SG eat bread. inflectional It was me who ate bread. markers/pronouns (2) Mburu laa lekk . (almost no verbal Bread FOC-Obj.1SG eat. inflection) It was bread that I ate. Ex. Applicative Very productive derivation morphology (3) Togg-al naa xale bi ceeb. Cook-APPL 1SG child DET rice. I cooked rice for the child. Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Wolof Resources No NLP tools or resources available for Wolof! Linguistically quite well documented (some descriptive grammars, recent work on specific aspects of the grammar) Some online resources Wolof Wikipedia: 1065 articles (Problem: inconsistent orthography) We used the Wolof Bible Consistent orthography Available as a parallel corpus (e.g. English,French, Arabic translations) Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Motivation Low resource languages are ... investigated in theoretical linguistics, annotated corpora are missing University of Potsdam: research programme on information structure, NLP resources support corpus-based, cross-lingual investigations of of information structure a test-bed for NLP techniques existing for well-resourced languages often simulated by using small sets from well-resourced languages (e.g. in research on bootstrapping, unsupervised learning techniques, ...) Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Starting from Scratch: Tagset Design No established Part-of-Speech inventory for Wolof (not even on the level of coarse-grained lexical categories) Debate about adjectives in Wolof Inconsistent glosses/categorisations in the theoretical literature Inconsistencies for verb categories What is the appropriate level of tagset granularity? Should the tagset capture e.g. nominal classes? Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Tagset Design: General Strategy General desiderata for a tagset: Capture interesting linguistic categories Be predictable/learnable for automatic taggers EAGLES guidelines, Leech and Wilson [1996] Interleaving tagset design and annotation experiments Distinguishing various granularity levels Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Establishing Tagset Granularity Started out with fairly detailed tagset (200 tags) Experiments with tagset reductions Final “standard tagset” includes theoretically interesting distinctions that can be reasonably made by automatic PoS taggers Granularity levels Detailed Medium General Standard Definite Articles 200 tags 44 tags 14 tags 80 tags SG/b-class/proximal ATDs.b.P ATDs AT ARTD PL/y-class/remote ATDp.y.R ATDp AT ARTD SG/b-class/sent. focus ATDs.b.SF ATDSF AT ARTF SG/w-class/sent. focus ATDs.w.SF ATDSF AT ARTF Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Interleaving Tagset Design and Annotation PoS categories for Wolof verbs Ten most frequent errors on tagset with 3 verb Problem: finiteness categories theoretical work on (incorr.) gold error ratio tokens Wolof establishes 3 system tag tag wrt. gold tag affected verb finiteness VVFIN VVNFN 5.88% 0.83% categories: VVFIN, VVNFN VVINF 45.24% 0.72% VVINF, VVNFN NC VVNFN 4.28% 0.60% (Zribi-Hertz and VVNFN VVFIN 30.43% 0.53% Diagne [2002]) NC NP 12.22% 0.42% automatic VVNFN VVRP 29.17% 0.26% PoS-Taggers do not VVNFN NC 2.23% 0.23% learn the distinction VVINF VVNFN 1.60% 0.23% Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Interleaving Tagset Design and Annotation PoS categories for Wolof verbs Ten most frequent errors made on tagset with 1 Solution: verb category one tag for overtly (incorr.) gold error ratio tokens non-inflected verbs system tag tag wrt. gold tag affected VV NC 3.94% 0.42% (VV) NC VV 1.95% 0.38% several fine-grained PREL PERS 3.07% 0.34% tags for NP NC 3.23% 0.34% token-internally PREL AT 5.59% 0.30% inflected verbs (e.g. AV NC 2.51% 0.26% VN for negated verbs) NP VV 1.17% 0.23% AT AP 2.37% 0.15% Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Capturing Linguistically Interesting Categories PoS categories for focus markers Standard tagset captures different focus types It should allow for corpus-based investigations of information structure Evaluate focus identification based on automatic tagging Quality of automatic POS-based focus identification on 100 sentences Focus Type Evaluation Abs.Freq in Abs. Freq in Precision Recall Test set Corpus Subject (ISuF) 95.65% 100% 39 1119 Verb (IVF) 100% 90% 11 759 Object (ICF) 68.75% 90.90% 11 910 Sentence (ISF) 100% 87.5% 16 635 3423 focus instances (predicted) Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Creating Gold Standard Data Annotated data: ca. 27,000 tokens from the New Testament Annotation effort: 1 month for 1 person Automatic pre-annotation reduced the effort (by more than 50%) Implementation includes: Tokeniser and sentence splitter (based on the GATE environment) Heuristics for stemming and lemmatising Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Automatic Pre-Annotation Suffix guessing on entire corpus (4) ... gis -leen ! ... look ! generation of a full form “-leen” is an imperative suffix lexicon based on ... indicates a verbal category closed-class lexemes (1700 add “gis” as a verb to the lexicon entries) suffix-guessing for Pre-annotation open-class lexemes (25000 (5) man de ab kanaara la fi gis . entries) “I can only see a turkey here.” pre-annotated each token ↓ with all options found in the full form lexicon (6) man PERS | DWQ de IJ ab ARTI kanaara NC la PRO | ICF | ARTD fi AV gis VVBP Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Comparing State-of-the-art PoS Taggers Can our gold standard data be used for training reliable automatic taggers? TnT tagger: Brants [2000] 1 trigram Hidden Markov model 96.7% accuracy on NEGRA TreeTagger: Schmid [1994] 2 decision tree model 96.06% on NEGRA SVMTool: Gim´ enez and M` arquez [2004] 3 support vector machine classifier (very rich, lexical feature model) 97.1% on the Wall Street Journal Dione,Kuhn,Zarrieß Part-of-Speech-Tagging for Wolof

Design and Development of Part-of-Speech-Tagging Resources for Wolof - PowerPoint PPT Presentation

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Design and Development of Part-of-Speech-Tagging Resources for Wolof Cheikh M. Bamba

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Part of Speech Tagging Informatics 2A: Lecture 15 Mirella Lapata School of Informatics

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Part of Speech Tagging Informatics 2A: Lecture 16 John Longley School of Informatics University

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

The Tagging Task Part-of-Speech Tagging Input: the lead paint is unsafe Output: the/Det lead/N

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

Syntactic Processing: Parts-of-Speech Tagging CSE354 - Spring 2020 Task Syntactic

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

Traffic UTM Tagging AdWords WebMaster Tools UTM TAGGING Where does my traffic come from? UTM

SDMX in EViews Download timeseries Louis de Charsonville December 3, 2016 1/18 Table of

Next class Presentation guidelines 20 mins for each team (random order) 15 mins

Chapter 15: Roman Tragedy Quintus Ennius first major Roman-born playwright after Livius

Lecture 6: Texture Tuesday, Sept 18 Graduate students Texture Problem set 1 extension ideas

Evolving the Internet: Changing the Engines in Mid-flight Mark Handley Professor of Networked

Nebraska 4-H Achievement Applications The Achievement Application What is it? A standard

Peterloo Learning Objectives To find out about what happened at the Peterloo Massacre. To

Multi MT MTM - Lisbon, 1 Sept 2017 Lucia Specia (USFD) MMT MTM - Lisbon, 1 Sept 2017 1 / 72 A

Sambuz

Useful Links

Newsletter

Mail Us

Design and Development of Part-of-Speech-Tagging Resources for Wolof - PowerPoint PPT Presentation

Introduction: Wolof, a Low Resource Language Starting from Scratch: Tagset Design Fast Gold Standard Annotation Experiments with State-of-the-art PoS Taggers Design and Development of Part-of-Speech-Tagging Resources for Wolof Cheikh M. Bamba

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and

Part of Speech Tagging Informatics 2A: Lecture 15 Mirella Lapata School of Informatics

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Part of Speech Tagging Informatics 2A: Lecture 16 John Longley School of Informatics University

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

The Tagging Task Part-of-Speech Tagging Input: the lead paint is unsafe Output: the/Det lead/N

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning &amp; H.

Natural Language Processing Parts of Speech Part of Speech Tagging Dan Klein UC

Syntactic Processing: Parts-of-Speech Tagging CSE354 - Spring 2020 Task Syntactic

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Tagging and sequence

Forewords Tagging in a nutshell Sources Slides inspired by M. Rajman and J.-C. Chappelier,

Traffic UTM Tagging AdWords WebMaster Tools UTM TAGGING Where does my traffic come from? UTM

SDMX in EViews Download timeseries Louis de Charsonville December 3, 2016 1/18 Table of

Next class Presentation guidelines 20 mins for each team (random order) 15 mins

Chapter 15: Roman Tragedy Quintus Ennius first major Roman-born playwright after Livius

Lecture 6: Texture Tuesday, Sept 18 Graduate students Texture Problem set 1 extension ideas

Evolving the Internet: Changing the Engines in Mid-flight Mark Handley Professor of Networked

Nebraska 4-H Achievement Applications The Achievement Application What is it? A standard

Peterloo Learning Objectives To find out about what happened at the Peterloo Massacre. To

Multi MT MTM - Lisbon, 1 Sept 2017 Lucia Specia (USFD) MMT MTM - Lisbon, 1 Sept 2017 1 / 72 A

Sambuz

Useful Links

Newsletter

Mail Us

Part Of Speech (POS) Tagging Based on Foundations of Statistical NLP by C. Manning & H.