cross linguality and machine translation without
play

Cross linguality and machine translation without bilingual data ith - PowerPoint PPT Presentation

Cross linguality and machine translation without bilingual data ith t bili l d t Enek Eneko Agirr girre @ @eagirre i Joint work with Mikel Artetxe Gorka Labaka Joint work with: Mikel Artetxe, Gorka Labaka IXA NLP group University


  1. State ‐ of ‐ the ‐ art in super supervised isedmappings Two sequences of (optional) linear transformations: S0 (opt.) Pre ‐ processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 Ort S2 S2 S2 Ort Orthog Orthog ogonal ogonal onal mapping: onal mapping: mapping: map into a mapping: map into a shared space (Procrustes) S3 (opt ) Re S3 (opt.) Re ‐ weight each component eight each component according to its cross ‐ correlation

  2. State ‐ of ‐ the ‐ art in super supervised isedmappings Two sequences of (optional) linear transformations: S0 (opt.) Pre ‐ processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 S2 S2 Ort S2 Ort Orthog Orthog ogonal ogonal onal mapping: onal mapping: mapping: map into a mapping: map into a shared space (Procrustes) S3 (opt ) Re S3 (opt.) Re ‐ weight each component eight each component according to its cross ‐ correlation S4 (opt.) De ‐ whitening: restore original S4 ( t ) D hit i t i i l variance in every direction 61

  3. State ‐ of ‐ the ‐ art in super supervised isedmappings Two sequences of (optional) linear transformations: S0 (opt.) Pre ‐ processing: length normalization, mean centering S1 (opt ) Whitening : turn covariance S1 (opt.) Whitening : turn covariance matrices into the identity matrix S2 Ort S2 S2 Ort S2 Orthog Orthog ogonal ogonal onal mapping: onal mapping: mapping: map into a mapping: map into a shared space (Procrustes) S3 (opt.) Re ‐ weight each component S3 (opt ) Re eight each component according to its cross ‐ correlation S4 (opt.) De ‐ whitening: restore original S4 ( t ) D hit i t i i l variance in every direction S5 (opt) Dimensionality reduction: keep the first n components only S5 ( ) Di i li d i k h fi l 62

  4. State ‐ of ‐ the ‐ art in super supervised isedmappings S0 S0 (l) (l) S0 (m) S1 1 S S2 2 S S3 S4 (sr (src) S4 (tr (trg) S5 Mikolov et al. (2013) x x src trg trg OLS Shigeto et al. (2015) x x trg src src CCA Faruqui and Dyer (2014) q y ( ) x x x x x Xing et al. (2015) x x Artetxe et al. (2016) x x x Orth. Zhang et al. (2016) h l ( ) x Smith et al. (2017) x x x 70

  5. State ‐ of ‐ the ‐ art in super supervised isedmappings S0 S0 (l) (l) S0 (m) S1 1 S S2 2 S3 S S4 (sr (src) S4 (tr (trg) S5 Mikolov et al. (2013) x x src trg trg OLS Shigeto et al. (2015) x x trg src src CCA Faruqui and Dyer (2014) q y ( ) x x x x x Xing et al. (2015) x x Artetxe et al. (2016) x x x Orth. Zhang et al. (2016) h l ( ) x Smith et al. (2017) x x x Our method (AAAI18) x x x x trg src trg x 71

  6. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish 72

  7. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) 73

  8. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 word pairs d di ti 5 000 d i 74

  9. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 pairs d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) 75

  10. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 pairs d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) M t M th d Meth thod EN EN ‐ IT EN IT EN IT IT EN ‐ DE EN EN EN DE DE DE EN EN FI EN EN ‐ FI FI FI EN EN ‐ ES EN ES EN ES ES 76

  11. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ Seed dictionary: 5,000 pairs ⇒ S d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) M th d M t Meth thod EN EN IT EN EN ‐ IT IT IT EN DE EN EN EN ‐ DE DE DE EN EN FI EN ‐ FI EN FI FI EN EN EN ‐ ES EN ES ES ES 34.93 † 35.00 † 25.91 † 27.73 † Mikolov et al. (2013) 38.40 * 37.13 * 27.60 * 26.80 * Faruqui and Dyer (2014) 41.53 † 43.07 † 31.04 † 33.73 † Shigeto et al. (2015) 38.93 * 29.14 * 30.40 * Dinu et al. (2015) 37.7 Lazaridou et al. (2015) ( ) 40.2 ‐ ‐ ‐ 36.87 † 41.27 † 28.23 † 31.20 † Xing et al. (2015) 41.87 * 30.62 * 31.40 * Artetxe et al. (2016) 39.27 36 73 † 40 80 † 28 16 † 31 07 † Zhang et al. (2016) Zhang et al (2016) 36.73 40.80 28.16 31.07 43.33 † 29.42 † 35.13 † Smith et al. (2017) 43.1 † our publicly available reimplementa � on 78

  12. Evaluating via Bilingual Dictionary induction Dataset by Dinu et al. (2015) extended to German, Finnish, Spanish ⇒ Monolingual embeddings (CBOW + negative sampling) ⇒ S ⇒ Seed dictionary: 5,000 pairs d di ti 5 000 i ⇒ Test dictionary: 1,500 pairs (Nearest neighbor, P@1) M th d Meth M t thod EN EN IT EN ‐ IT EN IT IT EN ‐ DE EN EN DE EN DE DE EN FI EN EN EN ‐ FI FI FI EN ES EN EN EN ‐ ES ES ES 34.93 † 35.00 † 25.91 † 27.73 † Mikolov et al. (2013) 38.40 * 37.13 * 27.60 * 26.80 * Faruqui and Dyer (2014) 41.53 † 43.07 † 31.04 † 33.73 † Shigeto et al. (2015) 38.93 * 29.14 * 30.40 * Dinu et al. (2015) 37.7 Lazaridou et al. (2015) ( ) 40.2 ‐ ‐ ‐ 36.87 † 41.27 † 28.23 † 31.20 † Xing et al. (2015) 41.87 * 30.62 * 31.40 * Ar Artetx txe et al. al. (2016) 2016) 39.27 39.27 41.87 30.62 31.40 36 73 † 40 80 † 28 16 † 31 07 † Zhang et al (2016) Zhang et al. (2016) 36.73 40.80 28.16 31.07 43.33 † 29.42 † 35.13 † Smith et al. (2017) 43.1 Our method (AAAI18) 45.27 45.27 44.13 44.13 32.94 32.94 36.60 36.60 79

  13. Why does it work? 80

  14. Why does it work? 81

  15. Why does it work? 82

  16. Why does it work? 83

  17. Why does it work? 84

  18. Why does it work? W 85

  19. Why does it work? Languages are (to a large extent) isometric in word embedding space (!) isometric in word embedding space (!) W 86

  20. Outline • Bilingual embedding mappings • Introduction to vector space models (embeddings) I t d ti t t d l ( b ddi ) • Bilingual embedding mappings (AAAI18) • Reduced supervision d d • Self ‐ learning, semi ‐ supervised (ACL17) • Self ‐ learning, fully unsupervised (ACL18) lf l f ll d ( ) • Conclusions • Unsupervised neural machine translation • Introduction to NMT • From bilingual embeddings to uNMT (ICLR18) • Unsupervised statistical MT (EMNLP18) p ( ) • Conclusions 87

  21. Reducing supervision 88

  22. Reducing supervision 89

  23. Reducing supervision Previous work bilingual signal for training for training 91

  24. Reducing supervision Previous work ‐ parallel corpora bilingual signal for training for training ‐ comparable corpora comparable corpora ‐ (big) dictionaries 94

  25. Reducing supervision Previous work ‐ parallel corpora bilingual signal for training for training ‐ comparable corpora comparable corpora ‐ (big) dictionaries 95

  26. Reducing supervision Previous work Our work ‐ parallel corpora ‐ 25 word dictionary bilingual signal for training for training ‐ comparable corpora comparable corpora ‐ numerals (1, 2, 3…) numerals (1, 2, 3…) ‐ (big) dictionaries ‐ nothing 99

  27. Self ‐ learning 100

  28. Self ‐ learning Monolingual embeddings 101

  29. Self ‐ learning Monolingual embeddings Dictionary 102

  30. Self ‐ learning Monolingual embeddings Dictionary 103

  31. Self ‐ learning Monolingual embeddings Dictionary Mapping 104

  32. Self ‐ learning Monolingual embeddings Dictionary Mapping 105

  33. Self ‐ learning Monolingual embeddings Dictionary Mapping Dictionary 106

  34. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary 107

  35. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary 108

  36. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary M Mapping i 109

  37. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary M Mapping i 110

  38. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary M Mapping i Di ti Dictionary 111

  39. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary 112

  40. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary 113

  41. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary Mapping 114

  42. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary Mapping 115

  43. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary even bet eve even bet eve better! better! er! er! Mapping M i Di ti Dictionary Mapping Dictionary 116

  44. Self ‐ learning Monolingual embeddings bet better! er! Dictionary Mapping Dictionary eve even bet eve even bet better! better! er! er! Mapping M i Dictionary Di ti eve even bet better! er! Mapping Dictionary 117

  45. Self ‐ learning Monolingual embeddings Dictionary Mapping Dictionary 118

  46. Self ‐ learning Monolingual embeddings Dictionary Mapping Dictionary proposed self ‐ learning method Too good to be true? 120

  47. Semi ‐ supervised experiments (ACL17) 121

  48. Semi ‐ supervised experiments (ACL17) • Given monolingual embeddings plus seed bilingual dictionary ( train dictionary): l d bili l di ti ( t i di ti ) • 25 word pairs • Pairs of numerals 122

  49. Semi ‐ supervised experiments (ACL17) • Given monolingual embeddings plus seed bilingual dictionary ( train dictionary): l d bili l di ti ( t i di ti ) • 25 word pairs • Pairs of numerals • Induce bilingual dictionary using self ‐ learning Induce bilingual dictionary using self learning for full vocabulary 123

  50. Semi ‐ supervised experiments (ACL17) • Given monolingual embeddings plus seed bilingual dictionary ( train dictionary): l d bili l di ti ( t i di ti ) • 25 word pairs • Pairs of numerals • Induce bilingual dictionary using self ‐ learning Induce bilingual dictionary using self learning for full vocabulary • Evaluation • Compare translations to existing bilingual dictionary p g g y ( test dictionary) • Accuracy Accuracy 124

  51. Semi ‐ supervised experiments (ACL17) English It English English ‐ It English Italian Italian alian alian wo word rd tr tran ansla slation ion induction duction 125

  52. Why does it work? 138

  53. Why does it work? Implicit objective: 139

  54. Why does it work? Implicit objective: 140

  55. Why does it work? Implicit objective: 141

  56. Why does it work? Implicit objective: 142

  57. Why does it work? Implicit objective: 143

  58. Why does it work? Implicit objective: 144

  59. Why does it work? Implicit objective: 145

  60. Why does it work? Implicit objective: 146

  61. Why does it work? Implicit objective: 147

  62. Why does it work? Implicit objective: 148

  63. Why does it work? Implicit objective: 149

  64. Why does it work? Implicit objective: 150

  65. Why does it work? Implicit objective: 151

  66. Why does it work? Implicit objective: 152

  67. Why does it work? Implicit objective: 153

  68. Why does it work? Implicit objective: 154

  69. Why does it work? Implicit objective: 155

  70. Why does it work? Implicit objective: 156

  71. Why does it work? Implicit objective: Independent from seed dictionary! 157

Recommend


More recommend