translation without bilingual parallel corpora
play

Translation without bilingual parallel corpora Chris Callison-Burch - PowerPoint PPT Presentation

Translation without bilingual parallel corpora Chris Callison-Burch Lecture 20 with Ann Irvine, Alex Klementiev, and David Yarowsky How to Improve Machine Transla5on 30 25 Translation quality 20 Better models 15


  1. Translation without bilingual parallel corpora Chris Callison-Burch Lecture 20 with Ann Irvine, Alex Klementiev, and David Yarowsky

  2. How ¡to ¡Improve ¡Machine ¡Transla5on 30 25 Translation quality 20 ❶ Better models 15 ❷ More bilingual training data 10 ❸ Eliminate the need for bitexts 5 0 1 20000 40000 60000 82000 Bilingual training data 2

  3. Bilingual ¡data ¡varies ¡by ¡language 1000M 200M 50M 1.5M Arabic and Chinese French-English European Urdu DARPA GALE 10^9 word webcrawl Parliament 4

  4. Monolingual ¡data ¡is ¡more ¡common ¡ • Typically ¡we ¡have ¡orders ¡of ¡magnitude ¡more ¡ monolingual ¡data ¡ ¡ • Can ¡we ¡use ¡monolingual ¡data ¡to ¡learn ¡ transla5ons? ¡ • Is ¡that ¡a ¡crazy ¡idea? 5

  5. הקישנ

  6. Scoring ¡Transla5ons: ¡Time terrorist (en) similar Occurrences terrorista (es) terrorist (en) dissimilar Occurrences riqueza (es) Time 7

  7. Scoring ¡Transla5ons: ¡Time eólica estambul terrorista vacuno wind istanbul terrorist beef renewable erdogan terrorism cattle solar turkish terrorists bse sources turkey attacks compulsory renewables turks fight meat energy ankara attack cows energies membership terror veal electricity negotiations acts cow photovoltaic undcp threat labelling grid talks september papayannakis 8

  8. Distributional Hypothesis If we consider oculist and eye-doctor we find that, as our corpus of utterances grows, these two occur in almost the same environments. In contrast, there are many sentence environments in which oculist occurs but lawyer does not... It is a question of the relative frequency of such environments, and of what we will obtain if we ask an informant to substitute any word he wishes for oculist (not asking what words have the same meaning). These and similar tests all measure the probability of particular environments occurring with particular elements... If A and B have almost identical environments we say that they are synonyms. –Zellig Harris (1954)

  9. Vector ¡Space ¡Models ¡of ¡Word ¡Similarity Represent a word through the contexts that it has been observed in a 1 He found five fish swimming in an old bathtub. down 1 He slipped down in the bathtub. find 1 fish 1 five 1 water he 2 bathtub in 2 money slip 1 swim 1 the 1

  10. Vector ¡Space ¡Models ¡of ¡Word ¡Similarity Represent a word through the contexts that it has been observed in a 1 He found five fish swimming in an old bathtub. down 1 He slipped down in the bathtub. find 1 fish 1 cos(bathtub, water) five 1 water he 2 bathtub in 2 money slip 1 swim 1 the 1

  11. Scoring ¡Transla5ons: ¡Context rápidamente rápidamente rápidamente 2 1 1 ... este número podría crecer muy rápidamente si no se modifica ... planeta planeta planeta economías economías economías 1 1 ... nuestras economías a crecer y desarrollarse de forma saludable ... ... que nos permitirá crecer rápidamente cuando el contexto ... extranjero extranjero extranjero empleo empleo empleo crecer crecer crecer 12

  12. Scoring ¡Transla5ons: ¡Context rápidamente rápidamente 7 7 4 4 5 5 5 economic economic planeta planeta 1 1 1 growth growth economías economías 4 4 2 2 2 7 7 7 3 employment employment 3 dict. extranjero extranjero 1 1 1 9 9 9 7 7 quickly quickly empleo empleo 3 3 policy policy policy policy policy activity activity activity crecer crecer expand expand expand crecer crecer (projected) (projected) 13

  13. Scoring ¡Transla5ons: ¡Context eólica estambul admirable choque wind istanbul remarkable shock nuclear virginia wonderful shocks hydroelectric zagreb admirable clash geothermal london splendid disagreement photovoltaic oreja magnificent disparity purchasing rosales excellent link saving moscow outstanding contradiction efficiency attending fantastic divisions atomic washington producing confrontation wielded johannesburg commendable synergies 14

  14. Scoring ¡Transla5ons: ¡Orthography Etymologically related words often retain similar spelling across languages with the same writing system Spanish English democracia democracy Words with lower edit distances are sometimes good translations of each other 15

  15. Scoring ¡Transla5ons: ¡Spelling sanitario desarrollos volc á nica montana sanitary ferroalloy volcanic montana sanitation barrosos volcanism fontana unitario destroyers voltaic montane sanitarium mccarroll vacancy mentana sanitation disallows konica montagna sagittario disallow dominica montanha sanitarias scrolls veronica montan kantaro payrolls monica montano sanitorium carroll volcano montani santoro steamrolls vratnica montand 16

  16. Scoring ¡Transla5ons: ¡Orthography Measuring edit distance for languages which share the same writing system Spanish English democracia democracy We transliterate for languages with different writing systems Russian Transliterated English demokratiya democracy демократия Assign a similarity score with edit distance or with a discriminative transliteration model 17

  17. Translitera5on ¡using ¡SMT ا !! ل !! ی !! گ !! ز !! ی !! ن !! ڈ !! ر !! ی !! ا ! a!!l!!e!!x!!a!!n!!d!!r!!i!!a! 18

  18. Character-­‑based ¡transla5on • Instead ¡of ¡aligning ¡words ¡across ¡sentence ¡pairs, ¡ we ¡align ¡characters ¡across ¡name ¡pairs ¡ • Learn ¡transla5on ¡rules ¡for ¡sequences ¡of ¡leRers ¡ • Language ¡model ¡is ¡n-­‑graph ¡leRer ¡sequence ¡built ¡ from ¡English ¡names ¡ • Requires: ¡ – Many ¡pairs ¡of ¡foreign-­‑English ¡names ¡ – Many ¡names ¡wriRen ¡in ¡English ¡for ¡LM 19

  19. Translitera5on ¡training ¡data • Extracted ¡name ¡pairs ¡from ¡automa5cally ¡word ¡ aligned ¡parallel ¡corpus ¡ • Gathered ¡training ¡data ¡from ¡Wikipedia ¡ – 890 ¡ar5cles ¡about ¡people ¡w/inter-­‑language ¡links ¡ • Hired ¡Urdu ¡speakers ¡on ¡Mechanical ¡Turk ¡to ¡ transliterate ¡names ¡ – gathered ¡5,470 ¡English-­‑>Urdu ¡names ¡and ¡5,470 ¡Urdu-­‑ >English ¡names ¡ – 2/3 ¡of ¡the ¡data ¡was ¡high ¡quality ¡ – 12,384 ¡addi5onal ¡names ¡for ¡<$300 20

  20. Learning ¡Curve GHN, [A,K9&)#$,%-R%)(,+<, J"+\)9()2,5B%,+F)", GHV, #&'.$6)$(%L, Avg edit distance GH?, GHG, G, [MN, ,,,,[GGU, Q':'J)2'#, IHN, 2#(#,#22)2, [II?, [@ [I=M, [GNA, [GGU, V, IHV, IH?, A, IAAA, GAAA, MAAA, ?AAA, =AAA, VAAA, @AAA, !"#$%$%&'()*'($+)' Training data size

  21. Example ¡translitera5ons 22

  22. Scoring ¡Transla5ons: ¡Orthography Etymologically related words often retain similar spelling across languages with the same writing system Spanish English democracia democracy We transliterate for languages with different writing systems Russian Transliterated English demokratiya democracy демократия Assign a similarity score with edit distance or with a discriminative transliteration model 23

  23. Scoring ¡Transla5ons: ¡Topics Phrases and their translations used to describe the same topics. The more similar the set of topics two phrases appear in, the more likely they are translations. We treat Wikipedia article pairs with interlingual links as topics. Topic 1 Topic 2 Topic 3 Topic N L1 L2

  24. Scoring ¡Transla5ons: ¡Context rápidamente 7 4 5 economic planeta 1 growth economías 4 2 7 employment 3 dict. extranjero 1 9 7 quickly empleo 3 policy policy activity crecer expand crecer (projected)

  25. Scoring ¡Transla5ons: ¡Topics Wikipedia 15 Barack_Obama Обама ,_ Барак 8 1 0 32 Virginia Виргиния 15 0 2 10 Iraq_War Иракская _ война 8 0 0 0 Ückeritz Иккериц 0 0 0 1 Otto_von_Bismarck Бисмарк ,_ Отто _ фон 0 0 0 4 Music Музыка 5 7 0 troops цветок войска завтра

  26. Scoring ¡Transla5ons: ¡Context sanitario desarrollos volc á nica montana health developments volcanic montana transcultural developed eruptions miley medical development volcanism hannah sanitation used lava beartooth patient using plumes cyrus deliverables modern eruption crazier pharmaceutica based volcano bozeman sewerage important volcanoes chelsom healthcare history breakouts absaroka care different volcanically baucus 27

  27. How ¡good ¡is ¡each ¡approach? We have a wide variety of using monolingual texts to measure translation equivalence. Which is the best? We measured the accuracy on 24 languages: Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese and Welsh. For each foreign word we computed a ranked list of English words using each signal of translation equivalence. The number of candidate English words varied by language, from 34,000 to 287,000.

Recommend


More recommend