A Bambara Tonalization System for Word Sense Disambiguation Using - PowerPoint PPT Presentation

A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering Luigi (Y.-C.) Liu Damien Nouvel ER-TIM, INALCO, 2 rue de Lille, Paris, France IJCNLP, 2017 Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 1 / 16

Outline I Introduction 1 Bambara Reference Corpus 2 Related Works 3 System Architecture 4 Methodology 5 Fundemental Definitions Tonalization as Edit Operation Segmentation Edit Operation Filtering Experiment Result 6 Conclusion 7 Perspectives 8 Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 2 / 16

Introduction Bambara: african language with 4 tones (´ ` ˇ ˆ) Orthography: official orthography does not represent tones. Word sense more ambiguous: some unaccented tokens can correspond to several tonalized forms − → challenge for NLP applications. Goal: implemente an automatic tonalizer for Bambara Improve subsequent NLP processings Facilitate linguistic analysis for Bambara. Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 3 / 16

Bambara Reference Corpus I Bambara Reference Corpus consists 2 parts : a non-disambiguated subcorpus a man-annotated subcorpus Part Words (dist.) Tonalization 38.73% Non-disamb. 2160M (58M) Other 8.90% Disamb. 358M (23M) None 52.35% Table: Corpus statistics Table: Annotation statistics Written Popular Manuscript 49.8% 43% 10.4% Audiovisual 2.8% 5.7% 10.5% Internet 7.2% Undet. Undet. 34% 19.5% 17.1% Magazines Academic Oral Figure: Corpus composition (medium, source) Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 4 / 16

Bambara Reference Corpus II Wordform annotation is done for each wordform and for 3 main features: POS Tagging Tone Marker Restoration Gloss Assignment For POS tagging, we use conditional random fields (CRFs; Lafferty et al. 2001) for sequential modeling Over 23 morpho-syntactic possible tags Accuracy: 94% (satisfying for under-resourced language) For the tone marker restoration task, we considered using similar methods Over 20,870 distinctive tonal forms Learning is quiet inefficent due to large-scale label set Problem: the drawback of modeling sequences of large-scale label set (of tonal form) is the expensive computational cost needed to estimate CRF parameters. Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 5 / 16

Related Works Word-level modeling (Simard (1998), Tufis and Chitu (1990)) French Accent insertion : 2-layers Hidden Markov Model (HMM) Romanian Automatic diacrtization : 3-gram tagger Word and character levels modeling (Elshafei et al. (2006), Scannell (2011), Nguyen et al. (2012)) Arabic Diacritization : 1-layer HMM Uni-codification for African languages : Naive Bayes classifier. Vietnamese Accent restoration : CRFs and other Hybrid approaches (Said et al. (2013), Metwally et al. (2016)) CRF + morphological analyzer CRF + HMM + morphological analyzer Category Decomposition (Tellier et al., 2010) Decompose label set in smaller pieces to train separately. Result : time-wise efficiency improvement at train phase. Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 6 / 16

System Architecture Tone Edit δ ′ ( i ) Code y + Encoder Marker Operation Segmenter δ ′ ( i ) δ (E) Filter Dispatcher x (S) δ ′ ( i ) (F) (D) − Figure: Block diagram for the proposed Bambara tonalization system at training stage δ ′ ( i ) Edit + y Operation Decoder y ( i ) Assembler δ ′ ( i ) δ ′ ( i ) Assembler − x Segmenter x ( i ) Figure: Block diagram for the proposed Bambara tonalization system at tonalization stage Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 7 / 16

Fundemental definitions I Discrete random variables X − → non-tonalized token : kelen Y − → tonalized token : k ` elen (adj. same) , k ` el ´ en (intj. already) ∆ − → differential code : (+1 , 2 , ´) , (+1 , 2 , ´)(+1 , 4 , `) Mappings ∆ = E ( Y ; X ) − → encoder function Y = D (∆; X ) − → decoder function Y = D ( E ( Y ; X ); X ) Note: predict differential code ∆, recovery Y from ∆ by decoder D . Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 8 / 16

Tonalization as Edit Operation I Code ∆ can be either ∅ (when X = Y , 52.35%) or a concatenation of codewords like σ 1 σ 2 σ 3 , ... A codeword σ is a triplet ( m , p , c ) containing m: operation type (+1 for insertion, -1 for deletion) p: position for operation, a positive integer c: character (if insertion), c ∈ Ω Encoder E ( y ; x ) = Applying W.-F. algorithm 1 (Wagner and Fischer, 1974) on ( x , y ) to produce the code δ Decoder D ( δ ; x ) = Applying edit operations in δ on x to get tonalized token y 1 In this article, we apply Wagner-Fischer algorithm in its special case where there are only 2 available edit operations against 3 edit operations including the substitution as in its general case. Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 9 / 16

Segmentation To facilite learning processing, segmentation is introduced to divide data pair ( x , δ ) to train in several segments of data pair ( x ( i ) , δ ( i ) ) where i is segment id. Learning on segments of data pair is easier beacuse that there is less edit operations to predict and this facilite our tonalization modeling. The segmentation mode w w = − 1 indicates a syllabification (by morphological parser) w = 0 for no segmentation w > 0 specifies a w -width regular segmentation 2 . 2 A regular segmenter forms a segment of every w successive characters, from left to right (i.e. in direction of writing of Bambara), in its input string. By exception, the last segment at output contains the rest of the string which has not yet been segmented so that we allow it to be equal or shorter than a segment of w characters. Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 10 / 16

Edit Operation Filtering Annotators also introduce : typographic, orthographic corrections. Focus on tonalization operations − → filtering on edit operations. Tone Marker Filtering: for each position of input string, Remove all insertions except for tone markers Keep only the 1st of tone insertions Keep only the 1st of tone deletions Edit operation dispatcher F m : it gives from input code δ in a sub-sequence composed of operations of type m, m = − 1 , +1 If δ in is a filtered result, inverse mapping from { F − 1 ( δ in ) , F +1 ( δ in ) } to δ in exists. What we call edit operation assembler. Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 11 / 16

Experiment Result I About half (52.35%) of tokens in BRC do not need any tone markers w -1 (Syll.) 1 2 3 4 0 Sys. Majority vote 0.843 S ◦ E 0.923 0.915 0.922 0.922 0.917 0.893 time 101.63 25.52 42.03 235.35 378.37 2683.72 D ◦ F ◦ S ◦ E 0.923 0.912 0.923 0.923 0.918 0.893 time 19.88 17.62 13.17 15.67 19.62 261.83 Table: Accuracy for our system trained with four different system configurations and eight segmentation modes (p = 50%) Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 12 / 16

Experiment Result II 1 60 40 0 . 9 min. % 20 0 . 8 0 0 . 7 20 40 60 80 accuracy training size (%) time Figure: Accuracy and time of training (configurated as D ◦ F ◦ S ◦ E using syllabification) with respect to different training size 90%-10% Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 13 / 16

Experiment Result III Error Type Ratio Tone Only 58.52% Position Only 1.17% Tone and Position 0.023% Silence 40.08% Table: Error dist. by type for insertion opt. with p = 50%, system = D ◦ F ◦ S ◦ E Predicted ´ ` ˆ ˇ ´ 0.9541 0.0438 0.0021 0.0000 Actual ` 0.0841 0.9141 0.0015 0.0003 ˆ 0.0035 0.0322 0.9643 0.0000 ˇ 0.0000 0.0952 0.0000 0.9048 Table: Confusion matrix on prediction of tone markers Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 14 / 16

Conclusion Differential encoder : Reduce entropy of labels to be predicted, make CRF learning efficient Allow to implement tone marker filter, edit operation decomposition Segmentation : Increase tonalization accuracy Greatly reduce training time Tone marker filter : Normalize the tonalized token Lead to reduce training time Edit operation decomposition unit (dispatcher) : Split the tokens in insertion and deletion of tone markers Allows to accelerate furthermore the training time reduction Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 15 / 16

Perspectives Take into account more linguistic information for bambara Generalization for other languages like French, Arabic, Yoruba, etc. Avaliable ressources and tools : Bambara Reference Corpus (French) : http://cormand.huma-num.fr/index.html Tonalizer - CRF-based Tone Reconstitution Tool (English): https://github.com/vieenrose/tonalizer Luigi (Y.-C.) Liu, Damien Nouvel (ERTIM) A Bambara Tonalization System for WSD IJCNLP, 2017 16 / 16

A Bambara Tonalization System for Word Sense Disambiguation Using - PowerPoint PPT Presentation

A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering Luigi (Y.-C.) Liu Damien Nouvel ER-TIM, INALCO, 2 rue de Lille, Paris, France IJCNLP, 2017 Luigi (Y.-C.) Liu,

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Why is Bambara groundnut able to grow and fix N 2 under contrasting soil conditions in different

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft fr Sprachwissenschaft (DGfS)

When the plain sense of Scripture makes common sense, make no other sense, therefore take every

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

HW #8 WordNet-based WSD Perform word sense disambiguation of probe word In context of

SENSE 2013 Findings for College of Southern Idaho Presentation Overview SENSE Overview

TUFF TUFF TUFF TUFF TUFF TUFF TUFF TUFF MAKING MAKING MAKING MAKING SENSE OF SENSE OF

The Holy Grail of Sense Definition: The Holy Grail of Sense Definition: Creating a

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Code as a Crime Scene @AdamTornhill Empear AB http://www.adamtornhill.com/ Understanding Code

CS 528 Mobile and Ubiquitous Computing Lecture 3b: Intents, Fragments, Database and Camera

Decision support and Op0misa0on for policy making Michela Milano

Supporting Investment in the Offshore Wind Industry in New Jersey Michele N. Siekerka, Esq.

61A Lecture 18 Friday, March 6 Announcements 2 Announcements Project 3 due Thursday 3/12 @

Coercion Quantification Ningning Xie 1 Richard A. Eisenberg 2 22 Sept. 2018 Haskell

On the Power of Coercion Abstraction Julien Cretin Didier Rmy INRIA January 26, 2012 1 / 36

Subtyping in Type Theory: Coercion Contexts and Local Coercions Z. Luo and F. Part Dept of

A Bambara Tonalization System for Word Sense Disambiguation Using - PowerPoint PPT Presentation

A Bambara Tonalization System for Word Sense Disambiguation Using Differential Coding, Segmentation and Edit Operation Filtering Luigi (Y.-C.) Liu Damien Nouvel ER-TIM, INALCO, 2 rue de Lille, Paris, France IJCNLP, 2017 Luigi (Y.-C.) Liu,

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Why is Bambara groundnut able to grow and fix N 2 under contrasting soil conditions in different

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft fr Sprachwissenschaft (DGfS)

When the plain sense of Scripture makes common sense, make no other sense, therefore take every

Final Projects Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison

HW #8 WordNet-based WSD Perform word sense disambiguation of probe word In context of

SENSE 2013 Findings for College of Southern Idaho Presentation Overview SENSE Overview

TUFF TUFF TUFF TUFF TUFF TUFF TUFF TUFF MAKING MAKING MAKING MAKING SENSE OF SENSE OF

The Holy Grail of Sense Definition: The Holy Grail of Sense Definition: Creating a

&gt;&gt;&gt;CLICK HERE&lt;&lt;&lt; Presentation d un document word New Haven. peugeot 207 workshop

Is this a word that would be used by a mature language user? Is it a frequently used word?

Code as a Crime Scene @AdamTornhill Empear AB http://www.adamtornhill.com/ Understanding Code

CS 528 Mobile and Ubiquitous Computing Lecture 3b: Intents, Fragments, Database and Camera

Decision support and Op0misa0on for policy making Michela Milano

Supporting Investment in the Offshore Wind Industry in New Jersey Michele N. Siekerka, Esq.

61A Lecture 18 Friday, March 6 Announcements 2 Announcements Project 3 due Thursday 3/12 @

Coercion Quantification Ningning Xie 1 Richard A. Eisenberg 2 22 Sept. 2018 Haskell

On the Power of Coercion Abstraction Julien Cretin Didier Rmy INRIA January 26, 2012 1 / 36

Subtyping in Type Theory: Coercion Contexts and Local Coercions Z. Luo and F. Part Dept of

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

>>>CLICK HERE<<< Presentation d un document word New Haven. peugeot 207 workshop