word representations seed lexicons mapping procedures and
play

Word Representations, Seed Lexicons, Mapping Procedures, and - PowerPoint PPT Presentation

Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville, Mrime Bouhandi, Emmanuel Morin, and Philippe


  1. Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville¹, Mérième Bouhandi¹, Emmanuel Morin¹, and Philippe Langlais² ¹LS2N, Université de Nantes, France ²RALI, Université de Montréal, Canada

  2. What is Bilingual Lexicon Induction (BLI)? Finding translations of words between language ● Useful for Machine Translation, Information Retrieval… ●

  3. English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○

  4. English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○ Reference Lists ● 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus ○

  5. English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○ Reference Lists ● 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus ○ Seed lexicon ● MUSE : 10,872 pairs (Conneau 2017) ○ ELRA-M0033 : 243,539 pairs ○

  6. Methods BoW : Bag of Words (Rapp 1999, Fung 1998) ● Count of co-occurrences ○ Translation using a seed lexicon ○

  7. Methods BoW : Bag of Words (Rapp 1999, Fung 1998) ● Count of co-occurrences of each pair of words ○ Translation using a seed lexicon ○ 2 Embeddings ● ○ fastText (Bojanowski 2016) : uncontextualised, one vector per word ○ ELMo (Peters 2018) : contextualised, one vector per token Anchor embeddings (Schuster 2019) : mean of all the vectors of the same word ■ Mapping (Mikolov 2013, Artetxe 2018) ● Supervised : need a seed lexicon ■ Unsupervised : no seed lexicon ■

  8. Evaluation Our general list is used on a lot of work, but what is it really ? Some pairs : ● Enjoy / Enjoy ○ Madagascar / Madagascar ○ Hugo / Hugo ○ We create 3 sublists by : ● Removing pairs with words not in a monolingual dictionary ○ Removing pairs too graphically close (Levenshtein distance) ○ Removing too frequent pairs ○

  9. Experiences Results are presented with Precision @ 1 ●

  10. Experiences Results are presented with Precision @ 1 ● We vary : ● 3 approaches : BoW, Contextualised Embeddings and Uncontextualised Embeddings ○ 2 corpora : Specialized or General ○ 2 seed lexicon : a small (MUSE) and a bigger and better (ELRA) ○ 4 Reference lists : Original, in-dictionary, edit distance, frequency ○ We seek to look which parameters matters ●

  11. Results Supervised is still the way to go ● fastText is better ● Bigger/better seed lexicon ● degrades results

  12. Results Supervised is still the way to go ● fastText is better ● Bigger/better seed lexicon ● degrades results Specialized domain BoW : having a ● bigger seed lexicon is better

  13. Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ●

  14. Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ● Huge loss for all method ●

  15. Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ● Huge loss for all method ● BoW really bad ●

  16. Analysis (general domain) fastText finds graphically close ● words

  17. Analysis (general domain) fastText finds graphically close ● words ELMo seems to capture the ● meaning

  18. Analysis (general domain) fastText finds graphically close ● words ELMo seems to capture the ● meaning BoW is really affected by ● occurrences

  19. Analysis (specialized domain) Seems easier as words are ● less likely to be found in varying context

  20. Analysis (specialized domain) Seems easier as words are less ● likely to be found in varying context Still ok for lower frequency ●

  21. Analysis (specialized domain) Seems easier as words are less ● likely to be found in varying context Still ok for lower frequency ● fastText still finds graphically ● close words when low frequency

  22. Conclusion Reference lists needs to be questioned and not used as is ● We hope this work will help people to best consider this aspect ●

Recommend


More recommend