On the Limitations of Unsupervised Bilingual Dictionary Induction - PowerPoint PPT Presentation

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Søgaard Sebastian Ruder Ivan Vuli ć

Background:   Unsupervised MT 2

Background:   Unsupervised MT ‣ Recently: Unsupervised neural machine translation (Artetxe et al., ICLR 2018; Lample et al., ICLR 2018) 2

Background:   Unsupervised MT ‣ Recently: Unsupervised neural machine translation (Artetxe et al., ICLR 2018; Lample et al., ICLR 2018) ‣ Key component: Initialization via unsupervised cross-lingual alignment of word embedding spaces 2

Background:   Cross-lingual word embeddings 3

Background:   Cross-lingual word embeddings ‣ Cross-lingual word embeddings enable cross-lingual transfer 3

Background:   Cross-lingual word embeddings ‣ Cross-lingual word embeddings enable cross-lingual transfer ‣ Most common approach: Project one word embedding space into another by learning a transformation matrix   W between source embeddings and their translations n x i y i 3

Background:   Cross-lingual word embeddings ‣ Cross-lingual word embeddings enable cross-lingual transfer ‣ Most common approach: Project one word embedding space into another by learning a transformation matrix   W between source embeddings and their translations n x i y i n ∑ ∥ Wx i − y i ∥ 2 (Mikolov et al., 2013) i =1 3

Background:   Cross-lingual word embeddings ‣ Cross-lingual word embeddings enable cross-lingual transfer ‣ Most common approach: Project one word embedding space into another by learning a transformation matrix   W between source embeddings and their translations n x i y i n ∑ ∥ Wx i − y i ∥ 2 (Mikolov et al., 2013) i =1 ‣ More recently: Use an adversarial setup to learn an unsupervised mapping 3

Background:   Cross-lingual word embeddings ‣ Cross-lingual word embeddings enable cross-lingual transfer ‣ Most common approach: Project one word embedding space into another by learning a transformation matrix   W between source embeddings and their translations n x i y i n ∑ ∥ Wx i − y i ∥ 2 (Mikolov et al., 2013) i =1 ‣ More recently: Use an adversarial setup to learn an unsupervised mapping ‣ Assumption: Word embedding spaces are approximately isomorphic , i.e. same number of vertices, connected the same way. 3

How similar are embeddings across languages? 4

How similar are embeddings across languages? ‣ Nearest neighbour (NN) graphs of top 10 most frequent words in English and German are not isomorphic. 4

How similar are embeddings across languages? ‣ Nearest neighbour (NN) graphs of top 10 most frequent words in English and German are not isomorphic. ‣ NN graphs of top 10 most frequent English words and their translations into German English German 4

How similar are embeddings across languages? ‣ Nearest neighbour (NN) graphs of top 10 most frequent words in English and German are not isomorphic. ‣ NN graphs of top 10 most frequent English words and their translations into German English German ‣ Not isomorphic 4

How similar are embeddings across languages? 5

How similar are embeddings across languages? ‣ NN graphs of top 10 most frequent English nouns and their translations English German 5

How similar are embeddings across languages? ‣ NN graphs of top 10 most frequent English nouns and their translations English German ‣ Not isomorphic 5

How similar are embeddings across languages? ‣ NN graphs of top 10 most frequent English nouns and their translations English German ‣ Not isomorphic Word embeddings are not approximately isomorphic across languages. 5

How do we quantify similarity? 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 ‣ Propose eigenvector similarity 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 ‣ Propose eigenvector similarity ‣ : adjacency matrices of A 1 , A 2 G 1 , G 2 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 ‣ Propose eigenvector similarity ‣ : adjacency matrices of A 1 , A 2 G 1 , G 2 ‣ : degree matrices of D 1 , D 2 G 1 , G 2 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 ‣ Propose eigenvector similarity ‣ : adjacency matrices of A 1 , A 2 G 1 , G 2 ‣ : degree matrices of D 1 , D 2 G 1 , G 2 ‣ : Laplacians of L 1 = D 1 − A 1 , L 2 = D 2 − A 2 G 1 , G 2 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 ‣ Propose eigenvector similarity ‣ : adjacency matrices of A 1 , A 2 G 1 , G 2 ‣ : degree matrices of D 1 , D 2 G 1 , G 2 ‣ : Laplacians of L 1 = D 1 − A 1 , L 2 = D 2 − A 2 G 1 , G 2 ‣ : eigenvalues (spectra) of λ 1 , λ 2 L 1 , L 2 6

How do we quantify similarity? ‣ Need a metric to measure how similar two NN graphs G 1 and of different languages are G 2 ‣ Propose eigenvector similarity ‣ : adjacency matrices of A 1 , A 2 G 1 , G 2 ‣ : degree matrices of D 1 , D 2 G 1 , G 2 ‣ : Laplacians of L 1 = D 1 − A 1 , L 2 = D 2 − A 2 G 1 , G 2 ‣ : eigenvalues (spectra) of λ 1 , λ 2 L 1 , L 2 ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 6

How do we quantify similarity? ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 7

How do we quantify similarity? ‣ Quantifies how much two NN graphs are isospectral, i.e. they have the same spectrum (same sets of eigenvalues). ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 7

How do we quantify similarity? ‣ Quantifies how much two NN graphs are isospectral, i.e. they have the same spectrum (same sets of eigenvalues). ‣ Isomorphic isospectral, but isospectral isomorphic → ↛ ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 7

How do we quantify similarity? ‣ Quantifies how much two NN graphs are isospectral, i.e. they have the same spectrum (same sets of eigenvalues). ‣ Isomorphic isospectral, but isospectral isomorphic → ↛ ‣ Δ : G 1 , G 2 → [0, ∞ ) ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 7

How do we quantify similarity? ‣ Quantifies how much two NN graphs are isospectral, i.e. they have the same spectrum (same sets of eigenvalues). ‣ Isomorphic isospectral, but isospectral isomorphic → ↛ ‣ Δ : G 1 , G 2 → [0, ∞ ) ‣ : are isospectral (very similar) G 1 , G 2 Δ = 0 ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 7

How do we quantify similarity? ‣ Quantifies how much two NN graphs are isospectral, i.e. they have the same spectrum (same sets of eigenvalues). ‣ Isomorphic isospectral, but isospectral isomorphic → ↛ ‣ Δ : G 1 , G 2 → [0, ∞ ) ‣ : are isospectral (very similar) G 1 , G 2 Δ = 0 ‣ : become less similar Δ → ∞ G 1 , G 2 ∑ k i =1 λ ji k ‣ Metric: where ∑ ( λ 1 i − λ 2 i ) 2 Δ = k = min j { > 0.9} ∑ n i =1 λ ji i =1 7

Unsupervised cross-lingual learning assumptions 8

Unsupervised cross-lingual learning assumptions ‣ Besides isomorphism, several other implicit assumptions 8

Unsupervised cross-lingual learning assumptions ‣ Besides isomorphism, several other implicit assumptions ‣ May or may not scale to low-resource languages 8

Unsupervised cross-lingual learning assumptions ‣ Besides isomorphism, several other implicit assumptions ‣ May or may not scale to low-resource languages Conneau et al. (2018) This work 8

Unsupervised cross-lingual learning assumptions ‣ Besides isomorphism, several other implicit assumptions ‣ May or may not scale to low-resource languages Conneau et al. (2018) This work Dependent-marking, Languages Agglutinative, many cases fusional and isolating 8

Unsupervised cross-lingual learning assumptions ‣ Besides isomorphism, several other implicit assumptions ‣ May or may not scale to low-resource languages Conneau et al. (2018) This work Dependent-marking, Languages Agglutinative, many cases fusional and isolating Corpora Comparable (Wikipedia) Different domains 8

On the Limitations of Unsupervised Bilingual Dictionary Induction - PowerPoint PPT Presentation

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian Ruder Ivan Vuli Background: Unsupervised MT 2 Background: Unsupervised MT Recently: Unsupervised neural machine translation (Artetxe

ArtsSemNet : From Bilingual Dictionary To Bilingual Semantic Network Ivanka Atanassova

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Intercultural Bilingual Preschool Mathematics What is mathematics skills? Bilingual preschool

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

Bilingual Education Department Monday, April 24, 2017 Objectives Overview of Bilingual

Bilingual Education: Policy into Practice Cambridge Horizons - Bilingual education: cognitive

Bilingual SSD & Intervention Leacox EBP Bilingual Phonology Therapy Therapy Learning

INTERNATIONAL 21 st Century Bilingual Education Kerry Neuman, Programme Director PBI: Bilingual

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Dictionaries A Good morning dictionary English: Good morning Spanish: Buenas das

Sparse Coding and Dictionary Learning for Image Analysis Part II: Dictionary Learning for signal

Hashing - Introduction Dictionary Dictionary = a dynamic set that supports the = a dynamic set

Multilevel refinement based on neighborhood similarity Alan Valejo, Jorge Valverde-Rebaza, Brett

Using Transportation Distances for Measuring Melodic Similarity Rainer Typke, Panos Giannopoulos,

A Multivariate Statistical Model for Multiple Images Acquired by Homogeneous or Heterogeneous

Clustering and Alignment Methods for Structural Comparison of Parallel Applications Scalable

Phylogenetics Eliran Avni, Reuven Cohen, Sagi Snir Presentation by Ashu Gupta Motivation

Distance Measure for Querying Arrangements of Temporal Intervals Orestis Kostakis, Panagiotis

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto

Paraphrase Recognition Using Machine Learning to Combine Similarity Measures Prodromos

On the Limitations of Unsupervised Bilingual Dictionary Induction - PowerPoint PPT Presentation

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian Ruder Ivan Vuli Background: Unsupervised MT 2 Background: Unsupervised MT Recently: Unsupervised neural machine translation (Artetxe

ArtsSemNet : From Bilingual Dictionary To Bilingual Semantic Network Ivanka Atanassova

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Intercultural Bilingual Preschool Mathematics What is mathematics skills? Bilingual preschool

The Dictionary ADT The dictionary ADT models a searchable collection findElement(k): if the

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

CMSC 206 Dictionaries and Hashing The Dictionary ADT n a dictionary (table) is an abstract

Bilingual Education Department Monday, April 24, 2017 Objectives Overview of Bilingual

Bilingual Education: Policy into Practice Cambridge Horizons - Bilingual education: cognitive

Bilingual SSD &amp; Intervention Leacox EBP Bilingual Phonology Therapy Therapy Learning

INTERNATIONAL 21 st Century Bilingual Education Kerry Neuman, Programme Director PBI: Bilingual

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

Dictionaries A Good morning dictionary English: Good morning Spanish: Buenas das

Sparse Coding and Dictionary Learning for Image Analysis Part II: Dictionary Learning for signal

Hashing - Introduction Dictionary Dictionary = a dynamic set that supports the = a dynamic set

Multilevel refinement based on neighborhood similarity Alan Valejo, Jorge Valverde-Rebaza, Brett

Using Transportation Distances for Measuring Melodic Similarity Rainer Typke, Panos Giannopoulos,

A Multivariate Statistical Model for Multiple Images Acquired by Homogeneous or Heterogeneous

Clustering and Alignment Methods for Structural Comparison of Parallel Applications Scalable

Phylogenetics Eliran Avni, Reuven Cohen, Sagi Snir Presentation by Ashu Gupta Motivation

Distance Measure for Querying Arrangements of Temporal Intervals Orestis Kostakis, Panagiotis

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto

Paraphrase Recognition Using Machine Learning to Combine Similarity Measures Prodromos

Bilingual SSD & Intervention Leacox EBP Bilingual Phonology Therapy Therapy Learning