Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville¹, Mérième Bouhandi¹, Emmanuel Morin¹, and Philippe Langlais² ¹LS2N, Université de Nantes, France ²RALI, Université de Montréal, Canada
What is Bilingual Lexicon Induction (BLI)? Finding translations of words between language ● Useful for Machine Translation, Information Retrieval… ●
English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○
English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○ Reference Lists ● 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus ○
English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○ Reference Lists ● 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus ○ Seed lexicon ● MUSE : 10,872 pairs (Conneau 2017) ○ ELRA-M0033 : 243,539 pairs ○
Methods BoW : Bag of Words (Rapp 1999, Fung 1998) ● Count of co-occurrences ○ Translation using a seed lexicon ○
Methods BoW : Bag of Words (Rapp 1999, Fung 1998) ● Count of co-occurrences of each pair of words ○ Translation using a seed lexicon ○ 2 Embeddings ● ○ fastText (Bojanowski 2016) : uncontextualised, one vector per word ○ ELMo (Peters 2018) : contextualised, one vector per token Anchor embeddings (Schuster 2019) : mean of all the vectors of the same word ■ Mapping (Mikolov 2013, Artetxe 2018) ● Supervised : need a seed lexicon ■ Unsupervised : no seed lexicon ■
Evaluation Our general list is used on a lot of work, but what is it really ? Some pairs : ● Enjoy / Enjoy ○ Madagascar / Madagascar ○ Hugo / Hugo ○ We create 3 sublists by : ● Removing pairs with words not in a monolingual dictionary ○ Removing pairs too graphically close (Levenshtein distance) ○ Removing too frequent pairs ○
Experiences Results are presented with Precision @ 1 ●
Experiences Results are presented with Precision @ 1 ● We vary : ● 3 approaches : BoW, Contextualised Embeddings and Uncontextualised Embeddings ○ 2 corpora : Specialized or General ○ 2 seed lexicon : a small (MUSE) and a bigger and better (ELRA) ○ 4 Reference lists : Original, in-dictionary, edit distance, frequency ○ We seek to look which parameters matters ●
Results Supervised is still the way to go ● fastText is better ● Bigger/better seed lexicon ● degrades results
Results Supervised is still the way to go ● fastText is better ● Bigger/better seed lexicon ● degrades results Specialized domain BoW : having a ● bigger seed lexicon is better
Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ●
Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ● Huge loss for all method ●
Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ● Huge loss for all method ● BoW really bad ●
Analysis (general domain) fastText finds graphically close ● words
Analysis (general domain) fastText finds graphically close ● words ELMo seems to capture the ● meaning
Analysis (general domain) fastText finds graphically close ● words ELMo seems to capture the ● meaning BoW is really affected by ● occurrences
Analysis (specialized domain) Seems easier as words are ● less likely to be found in varying context
Analysis (specialized domain) Seems easier as words are less ● likely to be found in varying context Still ok for lower frequency ●
Analysis (specialized domain) Seems easier as words are less ● likely to be found in varying context Still ok for lower frequency ● fastText still finds graphically ● close words when low frequency
Conclusion Reference lists needs to be questioned and not used as is ● We hope this work will help people to best consider this aspect ●
Recommend
More recommend