Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup - PowerPoint PPT Presentation

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla lation G. Sigurdsson, J-B. Alayrac, A. Nematzadeh, L. Smaira, M. Malinowski, J. Carreira, P. Blunsom, A. Zisserman 6/15/20 Video Pentathlon The End-of-End-to-End: A Video Understanding Pentathlon Workshop

How can we learn a link between different languages from unpaired narrated videos?

Our goal: relate different languages through the visual domain Je casse les oeufs.* I need to mix the eggs with the flour. * I break the eggs.

Our setup: unsupervised word translation Je casse les oeufs. different videos in each language ( no paired data ) I need to mix the eggs with the flour.

Dataset: HowToWorld dataset We extend the HowTo100M[a] dataset in other languages (we follow the same collection procedure but obtain different videos narrated in their original language). [a] [HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips, Miech, Zhukov, Alayrac, Tapaswi, Laptev and Sivic, ICCV19]

Base Model: learn a joint space between languages and video mix eggs with flour Contrastive loss [b] [b] MIL-NCE: [End-to-End Learning of Visual Representations from Uncurated Instructional Videos, Miech, Alayrac, Smaira, Laptev, Sivic and Zisserman, CVPR20]

Base Model: learn a joint space between languages and video Contrastive loss Bilingual-visual joint space je casse les oeufs

Base Model: learn a joint space between languages and video Contrastive loss Bilingual-visual joint space Next, we evaluate the quality of the joint bilingual space with English to French word retrieval: je casse les oeufs “Given a word in English, we score all French words using dot products in the joint space and report the percentage of time a correct translation is retrieved in top 1 (R@1)”

Quantitative results for the Base Model Dictionary: 1000 words in English and French coming from (Conneau et al, 2017) Simple words: Top 1000 words from Wikipedia Visual: restrict to “visual” words (remove abstracts concepts). Dictionary Simple Words (Conneau et al., 2017) (top 1000 words Wikipedia) English-French (reporting recall@1) All Visual All Visual Random Chance 0.1 0.2 0.1 0.2 Base Model 9.1 15.2 28.0 45.3

Do we need videos at all? It has been shown that one can align word embeddings in different languages via a simple transformation (rotation). Only a few correspondences are required to estimate this transformation: Unsupervised approaches are possible that do not require any paired data to learn that alignment: [ MUSE , Conneau, et al , ICLR2018], [ VecMap : Artetxe, et al. ACL 2018] These methods have robustness issues (e.g. language similarity / training corpora statistics), can vision help there?

MUSE MUSE: 1) Find an initial linear mapping via an adversarial approach 2) From initialization find the most aligned word pairs and use them as Aligned anchors to refine the mapping with space the Procrustes algorithm 3) Normalizing the distances using the local neighborhood

MUVE: Multilingual Unsupervised Visual Embeddings MUSE: MUSE: MUVE 1) Find an initial linear mapping via an 1) Use AdaptLayer as an initial linear adversarial approach mapping 2) From initialization find the most aligned word pairs and use them as Aligned anchors to refine the mapping with space the Procrustes algorithm 3) Normalizing the distances using the local neighborhood

MUVE vs Base model Dictionary Simple Words (Conneau et al., 2017) (top 1000 words) English-French (reporting R@1) All Visual All Visual Random Chance 0.1 0.2 0.1 0.2 Base Model 9.1 15.2 28.0 45.3 MUVE 28.9 39.5 58.3 67.5

Performance of models across language pairs Larger gap in performance for more distant languages reporting R@1 En-Fr En-Ko En-Ja MUSE 26.3 11.8 11.6 (Conneau et al. , 2017) VecMap 28.4 13.0 13.7 (Artetxe et al. , 2018) MUVE 28.9 17.7 15.1 (ours) Supervised 57.9 41.8 41.1

Robustness to dissimilarity of text corpora for embedding pretraining HowTo-Fr MUSE VecMap MUVE reporting R@10 (Conneau et al. , 2017) (Artetxe et al. , 2018) (ours) HowTo-En 45.8 45.4 47.3 WMT-En 0.3 0.2 26.4 Wiki-En 0.3 0.1 32.6

Conclusion and Future work

Takeaways Conclusion Unsupervised word translation through visual grounding is possible based on unpaired and uncurated narrated videos. The information contained in vision is complementary to the one contained in the structure of the languages which enables to have a better and more robust approach (MUVE) for unsupervised word translation. Future work From words to sentences. Using multilingual datasets to learn better visual representation (more data sources, less biased, …) Links: Paper, Blog post Q&A Time CVPR2020: Thursday, June 18, 2020 9-11AM and 9–11PM PT

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup - PowerPoint PPT Presentation

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla lation G. Sigurdsson, J-B. Alayrac, A. Nematzadeh, L. Smaira, M. Malinowski, J. Carreira, P. Blunsom, A. Zisserman 6/15/20 Video

Conventional Rounding Rules Conventional Rounding Rules Conventional Rounding Rules Conventional

F o x Ca nyo n Gro undwa te r Ma na g e me nt Ag e nc y s Gro undwa te r Susta ina b ility

U Gro Capital Q4 FY19 / FY19 Earnings Update U GRO Capital | Executive Summary U GRO Capital

Symmetric Gro up pe rmutatio ns o f 3 o bje c ts Gro up elements c an be written in this

Symmetric Gro up pe rmutatio ns o f 3 o bje c ts Gro up elements c an be written in this

Rounding When we are rounding numbers to the nearest 10, how do we know whether to round up or

Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Matt tt Gro roen ening Matt tt Gro roen ening Matth thew Abram Matth thew Abram

U Gro Capital | An Overview December 2018 Executive Summary U Gro Capital An Overview Sector

GRO GRO www www.gr www.gr www .grogrant .grogrant ograntsbur ograntsbur sburg.c sburg.c

Ame ric a n I Ame ric a n I nte rna tio na l Gro up I nte rna tio na l Gro up, I nc nc . F

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Creating slides Marco Pessotto 2016 Syntax 9 10 . . . . . . . . . . . . . . . . . . . . . . .

Mining and Understanding Software Enclaves (MUSE) Suresh Jagannathan Information Innovation

MUSE: Multi-query Event Trend Aggregation Allison Rozet 1 , Olga Poppe 2 , Chuan Lei 3 , and Elke

M U S E Turn music into memories. Muse-ability Study Our team Alema F. Tifgany M. Nylah D.

Pushing the limits with spectroscopy: High-redshift overdensities at 2<z<6 in zCOSMOS

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng

Probing the gaseous environment of star-forming galaxies in absorption and emission Michele

What matter(s) around galaxies? Shining a bright light on the cold phase of the CGM Cantalupo et

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup - PowerPoint PPT Presentation

Vi Visual G Gro rounding i in Vi Video eo fo for r Un Unsup upervised Word Transla lation G. Sigurdsson, J-B. Alayrac, A. Nematzadeh, L. Smaira, M. Malinowski, J. Carreira, P. Blunsom, A. Zisserman 6/15/20 Video

Conventional Rounding Rules Conventional Rounding Rules Conventional Rounding Rules Conventional

F o x Ca nyo n Gro undwa te r Ma na g e me nt Ag e nc y s Gro undwa te r Susta ina b ility

U Gro Capital Q4 FY19 / FY19 Earnings Update U GRO Capital | Executive Summary U GRO Capital

Symmetric Gro up pe rmutatio ns o f 3 o bje c ts Gro up elements c an be written in this

Symmetric Gro up pe rmutatio ns o f 3 o bje c ts Gro up elements c an be written in this

Rounding When we are rounding numbers to the nearest 10, how do we know whether to round up or

Similarity Estimation Similarity Estimation Techniques from Rounding Techniques from Rounding

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Matt tt Gro roen ening Matt tt Gro roen ening Matth thew Abram Matth thew Abram

U Gro Capital | An Overview December 2018 Executive Summary U Gro Capital An Overview Sector

GRO GRO www www.gr www.gr www .grogrant .grogrant ograntsbur ograntsbur sburg.c sburg.c

Ame ric a n I Ame ric a n I nte rna tio na l Gro up I nte rna tio na l Gro up, I nc nc . F

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Creating slides Marco Pessotto 2016 Syntax 9 10 . . . . . . . . . . . . . . . . . . . . . . .

Mining and Understanding Software Enclaves (MUSE) Suresh Jagannathan Information Innovation

MUSE: Multi-query Event Trend Aggregation Allison Rozet 1 , Olga Poppe 2 , Chuan Lei 3 , and Elke

M U S E Turn music into memories. Muse-ability Study Our team Alema F. Tifgany M. Nylah D.

Pushing the limits with spectroscopy: High-redshift overdensities at 2&lt;z&lt;6 in zCOSMOS

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng

Probing the gaseous environment of star-forming galaxies in absorption and emission Michele

What matter(s) around galaxies? Shining a bright light on the cold phase of the CGM Cantalupo et

Pushing the limits with spectroscopy: High-redshift overdensities at 2<z<6 in zCOSMOS