State ‐ of ‐ the ‐ art in super supervised isedmappings Two sequences of (optional) linear transformations: S0 (opt.) Pre ‐ processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 Ort S2 S2 S2 Ort Orthog Orthog ogonal ogonal onal mapping: onal mapping: mapping: map into a mapping: map into a shared space (Procrustes) S3 (opt ) Re S3 (opt.) Re ‐ weight each component eight each component according to its cross ‐ correlation
State ‐ of ‐ the ‐ art in super supervised isedmappings Two sequences of (optional) linear transformations: S0 (opt.) Pre ‐ processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 S2 S2 Ort S2 Ort Orthog Orthog ogonal ogonal onal mapping: onal mapping: mapping: map into a mapping: map into a shared space (Procrustes) S3 (opt ) Re S3 (opt.) Re ‐ weight each component eight each component according to its cross ‐ correlation S4 (opt.) De ‐ whitening: restore original S4 ( t ) D hit i t i i l variance in every direction 61
State ‐ of ‐ the ‐ art in super supervised isedmappings Two sequences of (optional) linear transformations: S0 (opt.) Pre ‐ processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 Ort S2 S2 Ort S2 Orthog Orthog ogonal ogonal onal mapping: onal mapping: mapping: map into a mapping: map into a shared space (Procrustes) S3 (opt.) Re ‐ weight each component S3 (opt ) Re eight each component according to its cross ‐ correlation S4 (opt.) De ‐ whitening: restore original S4 ( t ) D hit i t i i l variance in every direction S5 (opt) Dimensionality reduction: keep the first n components only S5 ( ) Di i li d i k h fi l 62
State ‐ of ‐ the ‐ art in super supervised isedmappings S0 S0 (l) (l) S0 (m) S1 1 S S2 2 S S3 S4 (sr (src) S4 (tr (trg) S5 Mikolov et al. (2013) x x src trg trg OLS Shigeto et al. (2015) x x trg src src CCA Faruqui and Dyer (2014) q y ( ) x x x x x Xing et al. (2015) x x Artetxe et al. (2016) x x x Orth. Zhang et al. (2016) h l ( ) x Smith et al. (2017) x x x 70
State ‐ of ‐ the ‐ art in super supervised isedmappings S0 S0 (l) (l) S0 (m) S1 1 S S2 2 S3 S S4 (sr (src) S4 (tr (trg) S5 Mikolov et al. (2013) x x src trg trg OLS Shigeto et al. (2015) x x trg src src CCA Faruqui and Dyer (2014) q y ( ) x x x x x Xing et al. (2015) x x Artetxe et al. (2016) x x x Orth. Zhang et al. (2016) h l ( ) x Smith et al. (2017) x x x Our method (AAAI18) x x x x trg src trg x 71
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish 72
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) 73
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 word pairs d di ti 5 000 d i 74
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 pairs d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) 75
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 pairs d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) M t M th d Meth thod EN EN ‐ IT EN IT EN IT IT EN ‐ DE EN EN EN DE DE DE EN EN FI EN EN ‐ FI FI FI EN EN ‐ ES EN ES EN ES ES 76
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 pairs ⇒ S d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) M th d M t Meth thod EN EN IT EN EN ‐ IT IT IT EN DE EN EN EN ‐ DE DE DE EN EN FI EN ‐ FI EN FI FI EN EN EN ‐ ES EN ES ES ES 34.93 † 35.00 † 25.91 † 27.73 † Mikolov et al. (2013) 38.40 * 37.13 * 27.60 * 26.80 * Faruqui and Dyer (2014) 41.53 † 43.07 † 31.04 † 33.73 † Shigeto et al. (2015) 38.93 * 29.14 * 30.40 * Dinu et al. (2015) 37.7 Lazaridou et al. (2015) ( ) 40.2 ‐ ‐ ‐ 36.87 † 41.27 † 28.23 † 31.20 † Xing et al. (2015) 41.87 * 30.62 * 31.40 * Artetxe et al. (2016) 39.27 36 73 † 40 80 † 28 16 † 31 07 † Zhang et al. (2016) Zhang et al (2016) 36.73 40.80 28.16 31.07 43.33 † 29.42 † 35.13 † Smith et al. (2017) 43.1 † our publicly available reimplementa � on 78
Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 pairs d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) M th d Meth M t thod EN EN IT EN ‐ IT EN IT IT EN ‐ DE EN EN DE EN DE DE EN FI EN EN EN ‐ FI FI FI EN ES EN EN EN ‐ ES ES ES 34.93 † 35.00 † 25.91 † 27.73 † Mikolov et al. (2013) 38.40 * 37.13 * 27.60 * 26.80 * Faruqui and Dyer (2014) 41.53 † 43.07 † 31.04 † 33.73 † Shigeto et al. (2015) 38.93 * 29.14 * 30.40 * Dinu et al. (2015) 37.7 Lazaridou et al. (2015) ( ) 40.2 ‐ ‐ ‐ 36.87 † 41.27 † 28.23 † 31.20 † Xing et al. (2015) 41.87 * 30.62 * 31.40 * Ar Artetx txe et al. al. (2016) 2016) 39.27 39.27 41.87 30.62 31.40 36 73 † 40 80 † 28 16 † 31 07 † Zhang et al (2016) Zhang et al. (2016) 36.73 40.80 28.16 31.07 43.33 † 29.42 † 35.13 † Smith et al. (2017) 43.1 Our method (AAAI18) 45.27 45.27 44.13 44.13 32.94 32.94 36.60 36.60 79
Why does it work? 80
Why does it work? 81
Why does it work? 82
Why does it work? 83
Why does it work? 84
Why does it work? W 85
Why does it work? Languages are (to a large extent) isometric in word embedding space (!) isometric in word embedding space (!) W 86
Outline • Bilingual embedding mappings • Introduction to vector space models (embeddings) I t d ti t t d l ( b ddi ) • Bilingual embedding mappings (AAAI18) • Reduced supervision d d • Self ‐ learning, semi ‐ supervised (ACL17) • Self ‐ learning, fully unsupervised (ACL18) lf l f ll d ( ) • Conclusions • Unsupervised neural machine translation • Introduction to NMT • From bilingual embeddings to uNMT (ICLR18) • Unsupervised statistical MT (EMNLP18) p ( ) • Conclusions 87
Reducing supervision 88
Reducing supervision 89
Reducing supervision Previous work bilingual signal for training for training 91
Reducing supervision Previous work ‐ parallel corpora bilingual signal for training for training ‐ comparable corpora comparable corpora ‐ (big) dictionaries 94
Reducing supervision Previous work ‐ parallel corpora bilingual signal for training for training ‐ comparable corpora comparable corpora ‐ (big) dictionaries 95
Reducing supervision Previous work Our work ‐ parallel corpora ‐ 25 word dictionary bilingual signal for training for training ‐ comparable corpora comparable corpora ‐ numerals (1, 2, 3…) numerals (1, 2, 3…) ‐ (big) dictionaries ‐ nothing 99
Self ‐ learning 100
Self ‐ learning Monolingual embeddings 101
Self ‐ learning Monolingual embeddings Dictionary 102
Self ‐ learning Monolingual embeddings Dictionary 103
Self ‐ learning Monolingual embeddings Dictionary Mapping 104
Self ‐ learning Monolingual embeddings Dictionary Mapping 105
Self ‐ learning Monolingual embeddings Dictionary Mapping Dictionary 106
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary 107
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary 108
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary M Mapping i 109
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary M Mapping i 110
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary M Mapping i Di ti Dictionary 111
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary 112
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary 113
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary Mapping 114
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary Mapping 115
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary Mapping Dictionary 116
Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary eve even bet eve even bet better! better! er! er! Mapping M i Dictionary Di ti eve even bet better! er! Mapping Dictionary 117
Self ‐ learning Monolingual embeddings Dictionary Mapping Dictionary 118
Self ‐ learning Monolingual embeddings Dictionary Mapping Dictionary proposed self ‐ learning method Too good to be true? 120
Semi ‐ supervised experiments (ACL17) 121
Semi ‐ supervised experiments (ACL17) • Given monolingual embeddings plus seed bilingual dictionary ( train dictionary): l d bili l di ti ( t i di ti ) • 25 word pairs • Pairs of numerals 122
Semi ‐ supervised experiments (ACL17) • Given monolingual embeddings plus seed bilingual dictionary ( train dictionary): l d bili l di ti ( t i di ti ) • 25 word pairs • Pairs of numerals • Induce bilingual dictionary using self ‐ learning Induce bilingual dictionary using self learning for full vocabulary 123
Semi ‐ supervised experiments (ACL17) • Given monolingual embeddings plus seed bilingual dictionary ( train dictionary): l d bili l di ti ( t i di ti ) • 25 word pairs • Pairs of numerals • Induce bilingual dictionary using self ‐ learning Induce bilingual dictionary using self learning for full vocabulary • Evaluation • Compare translations to existing bilingual dictionary p g g y ( test dictionary) • Accuracy Accuracy 124
Semi ‐ supervised experiments (ACL17) English It English English ‐ It English Italian Italian alian alian wo word rd tr tran ansla slation ion induction duction 125
Why does it work? 138
Why does it work? Implicit objective: 139
Why does it work? Implicit objective: 140
Why does it work? Implicit objective: 141
Why does it work? Implicit objective: 142
Why does it work? Implicit objective: 143
Why does it work? Implicit objective: 144
Why does it work? Implicit objective: 145
Why does it work? Implicit objective: 146
Why does it work? Implicit objective: 147
Why does it work? Implicit objective: 148
Why does it work? Implicit objective: 149
Why does it work? Implicit objective: 150
Why does it work? Implicit objective: 151
Why does it work? Implicit objective: 152
Why does it work? Implicit objective: 153
Why does it work? Implicit objective: 154
Why does it work? Implicit objective: 155
Why does it work? Implicit objective: 156
Why does it work? Implicit objective: Independent from seed dictionary! 157
Recommend
More recommend