Word Representations, Seed Lexicons, Mapping Procedures, and - PowerPoint PPT Presentation

Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville¹, Mérième Bouhandi¹, Emmanuel Morin¹, and Philippe Langlais² ¹LS2N, Université de Nantes, France ²RALI, Université de Montréal, Canada

What is Bilingual Lexicon Induction (BLI)? Finding translations of words between language ● Useful for Machine Translation, Information Retrieval… ●

English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○

English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○ Reference Lists ● 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus ○

English/French Data Corpora ● General (Wikipedia) : 200M words ○ Specialized (Breast Cancer) : 1M words ○ Reference Lists ● 1,446 pairs of terms from MUSE (Conneau, 2017) for general corpus ○ 248 pairs of terms from UMLS for breast cancer corpus ○ Seed lexicon ● MUSE : 10,872 pairs (Conneau 2017) ○ ELRA-M0033 : 243,539 pairs ○

Methods BoW : Bag of Words (Rapp 1999, Fung 1998) ● Count of co-occurrences ○ Translation using a seed lexicon ○

Methods BoW : Bag of Words (Rapp 1999, Fung 1998) ● Count of co-occurrences of each pair of words ○ Translation using a seed lexicon ○ 2 Embeddings ● ○ fastText (Bojanowski 2016) : uncontextualised, one vector per word ○ ELMo (Peters 2018) : contextualised, one vector per token Anchor embeddings (Schuster 2019) : mean of all the vectors of the same word ■ Mapping (Mikolov 2013, Artetxe 2018) ● Supervised : need a seed lexicon ■ Unsupervised : no seed lexicon ■

Evaluation Our general list is used on a lot of work, but what is it really ? Some pairs : ● Enjoy / Enjoy ○ Madagascar / Madagascar ○ Hugo / Hugo ○ We create 3 sublists by : ● Removing pairs with words not in a monolingual dictionary ○ Removing pairs too graphically close (Levenshtein distance) ○ Removing too frequent pairs ○

Experiences Results are presented with Precision @ 1 ●

Experiences Results are presented with Precision @ 1 ● We vary : ● 3 approaches : BoW, Contextualised Embeddings and Uncontextualised Embeddings ○ 2 corpora : Specialized or General ○ 2 seed lexicon : a small (MUSE) and a bigger and better (ELRA) ○ 4 Reference lists : Original, in-dictionary, edit distance, frequency ○ We seek to look which parameters matters ●

Results Supervised is still the way to go ● fastText is better ● Bigger/better seed lexicon ● degrades results

Results Supervised is still the way to go ● fastText is better ● Bigger/better seed lexicon ● degrades results Specialized domain BoW : having a ● bigger seed lexicon is better

Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ●

Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ● Huge loss for all method ●

Results Supervised is still the way to go ● fastText is better ● Bigger/better dictionary degrades ● results Specialized domain BoW, having a ● bigger dictionary is better fastText worse while ELMo better ● Huge loss for all method ● BoW really bad ●

Analysis (general domain) fastText finds graphically close ● words

Analysis (general domain) fastText finds graphically close ● words ELMo seems to capture the ● meaning

Analysis (general domain) fastText finds graphically close ● words ELMo seems to capture the ● meaning BoW is really affected by ● occurrences

Analysis (specialized domain) Seems easier as words are ● less likely to be found in varying context

Analysis (specialized domain) Seems easier as words are less ● likely to be found in varying context Still ok for lower frequency ●

Analysis (specialized domain) Seems easier as words are less ● likely to be found in varying context Still ok for lower frequency ● fastText still finds graphically ● close words when low frequency

Conclusion Reference lists needs to be questioned and not used as is ● We hope this work will help people to best consider this aspect ●

Word Representations, Seed Lexicons, Mapping Procedures, and - PowerPoint PPT Presentation

Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville, Mrime Bouhandi, Emmanuel Morin, and Philippe

Homework Assignment: 5 11-721: Grammars and Lexicons 11-721: Grammars and Lexicons Fall 2007

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

SEED INDUSTRY PERSPECTIVES ON THE NATIONAL NATIVE SEED STRATEGY ROBBY HENES SOUTHWEST SEED

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

Sentiment Analysis Learning Sen*ment Lexicons Dan Jurafsky

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

The Seed System Production and Quality Management Hemispheric Conference Low Level Presence in

2017 National Native Seed Conference A California Seed Industry Perspective David Gilpin,

Global Seed Trade Challenges & Opportunities SEED WORLD 2019, Sept 19, 2019 Indian

NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk Essential

Texture Mapping Texture Mapping 1 Texture Mapping Texture Mapping Motivation Motivation:

Texture Mapping Surface mapping OpenGl and Implementation Details Texture mapping Bump

Advanced Texturing Environment Mapping Environment Mapping reflections Environment Mapping

Challenges and wishes of vegetable seed companies related to seed treatment SeedMeetsTechnology

r rstr str t

Aviation Security Learning and Development 2nd AFI SECFAL Symposium Gaborone, Botswana 25 May

ASP: WHERE ARE THEY NOW Julia Gray, University of Kansas 2010 & 2012 Alumni 2010+2012

HEPMAD 13 6th International Conference in High-Energy Physics 4-10th September 2013 Antananarivo

REPRESENTING THE PRESENT-DAY REGIONAL CLIMATE OVER SOUTHERN AFRICA IN A DOUBLE NESTED SYSTEM Dr

Mathematics by Experiment, I & II : Plausible Reasoning in the 21st Century Jonathan M.

Mean Field Limits for Ginzburg-Landau Vortices Sylvia Serfaty Universit P. et M. Curie Paris 6,

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir

Word Representations, Seed Lexicons, Mapping Procedures, and - PowerPoint PPT Presentation

Canadian AI 2020 Word Representations, Seed Lexicons, Mapping Procedures, and Reference Lists: What Matters in Bilingual Lexicon Induction from Comparable Corpora? Martin Laville, Mrime Bouhandi, Emmanuel Morin, and Philippe

Homework Assignment: 5 11-721: Grammars and Lexicons 11-721: Grammars and Lexicons Fall 2007

Fall Seminar Seed Sampling &amp; Labeling Larry Nees Seed Administrator Office of INDIANA

Texture and other Mappings Texture Mapping Texture Mapping Bump Mapping Bump Mapping

SEED INDUSTRY PERSPECTIVES ON THE NATIONAL NATIVE SEED STRATEGY ROBBY HENES SOUTHWEST SEED

Image Warping Image Mapping Image Mapping - Examples Forward Mapping Forward Mapping -

TEXTURE MAPPING 1 OUTLINE Introduce Mapping Methods Texture Mapping Environment

Sentiment Analysis Learning Sen*ment Lexicons Dan Jurafsky

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

The Seed System Production and Quality Management Hemispheric Conference Low Level Presence in

2017 National Native Seed Conference A California Seed Industry Perspective David Gilpin,

Global Seed Trade Challenges &amp; Opportunities SEED WORLD 2019, Sept 19, 2019 Indian

NLU lecture 5: Word representations and morphology Adam Lopez alopez@inf.ed.ac.uk Essential

Texture Mapping Texture Mapping 1 Texture Mapping Texture Mapping Motivation Motivation:

Texture Mapping Surface mapping OpenGl and Implementation Details Texture mapping Bump

Advanced Texturing Environment Mapping Environment Mapping reflections Environment Mapping

Challenges and wishes of vegetable seed companies related to seed treatment SeedMeetsTechnology

r rstr str t

Aviation Security Learning and Development 2nd AFI SECFAL Symposium Gaborone, Botswana 25 May

ASP: WHERE ARE THEY NOW Julia Gray, University of Kansas 2010 &amp; 2012 Alumni 2010+2012

HEPMAD 13 6th International Conference in High-Energy Physics 4-10th September 2013 Antananarivo

REPRESENTING THE PRESENT-DAY REGIONAL CLIMATE OVER SOUTHERN AFRICA IN A DOUBLE NESTED SYSTEM Dr

Mathematics by Experiment, I &amp; II : Plausible Reasoning in the 21st Century Jonathan M.

Mean Field Limits for Ginzburg-Landau Vortices Sylvia Serfaty Universit P. et M. Curie Paris 6,

Structured Prediction Models via the Matrix-Tree Theorem Terry Koo maestro@csail.mit.edu Amir

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA

Global Seed Trade Challenges & Opportunities SEED WORLD 2019, Sept 19, 2019 Indian

ASP: WHERE ARE THEY NOW Julia Gray, University of Kansas 2010 & 2012 Alumni 2010+2012

Mathematics by Experiment, I & II : Plausible Reasoning in the 21st Century Jonathan M.