machine t ranslation between languages with significant
play

Machine T ranslation between Languages with Significant Word - PowerPoint PPT Presentation

Machine T ranslation between Languages with Significant Word Reordering and Rich T arget-side Morphology Machine Translation between Languages with Significant Word Reordering and Rich T arget-side Morphology th Week of Doctoral Students,


  1. Machine T ranslation between Languages with Significant Word Reordering and Rich T arget-side Morphology Machine Translation between Languages with Significant Word Reordering and Rich T arget-side Morphology th Week of Doctoral Students, June 3 rd , 2011 20 ÚFAL, Charles University in Prague Bushra Jawaid RNDr. Ondřej Bojar (PhD. Advisor)

  2. ُ Language Pair & Properties ● Language Pair → English-Urdu ● English is SVO language and has strict word order. ● Urdu is restricted free word order language and mostly follows SOV structure by default. English Sentence: I understand English and Urdu? Urdu روایزییرگناںییم Translation: ںوہ یتھججمسودُرا Transliteration: meñ angrezī aor Urdū samjhte hūñ Gloss: I English and Urdu understand (Auxiliary) 2

  3. Language Pair & Prop (Cont) ● Urdu has concatenative inflective morphological system. ● For example, verbs in Urdu inflects for tense, mood, aspect, gender and number. ● Table below shows three different masculine forms of verb (be made) Root Infinitive Oblique ə ə ɑ ə Intransitive/ b n b nn b nne (di) Transitive نب اننب ے ننننب Direct Causative b n ə ɑ b n n ə ɑ ɑ b n ne ə ɑ ےنانب انب انانب ə ɑ ə ɑ ɑ ə ɑ Indirect Causative b nw b nw n b nw ne ےناونب اونب اناونب 3

  4. Research Focus .. ● Exploring methods and techniques when translating into the direction of morphologically richer languages. ● Reduce the word order differences in source and target languages. ● Main motivation: ● Model the problem of reordering. ● Deal with word form choice separately. ● Improve generalization. 4

  5. Possible Solutions Translate+Generate (T+T+G) Setup (Bojar et al., 2010): English Czech Form Form +LM Lemma Lemma +LM Morphology Morphology +LM Issues with this setup: ● Factors in Moses synchronous → all factors have to be fully constructed before main search. ● Many possible options of lemma, tag and final word form → Pruning strikes hard. 5

  6. Possible Solutions (Cont) .. ● Translation options of German word “haus”, (Koehn et al. 2007) ● Translation: Mapping lemmas { ?|house|?|?, ?|home|?|?, ?|building|?|?, ?|shell|?|? } ● Translation: Mapping morphology { ?|house|NN|plural, ?|home|NN|plural, ?|building|NN|plural, ?|shell| NN|plural, ?|house|NN|singular,... } ● Generation: Generating surface forms { houses|house|NN|plural, homes|home|NN|plural, buildings|building| NN|plural, shells|shell|NN|plural, house|house|NN|singular, ... } 6

  7. Two-Step Architecture.. Middle layer 1 st step 2 nd step Middle Morphology Reordering Language Plain text Plain text Output Input (Fraser, 2009) and (Bojar, 2010) 7

  8. Possible Solutions (Cont) .. ● Two-Step Setup (to avoid explosion of translation options): ● First step translates from source to augmented lemmatized target word. ● Monolingual features are *not* represented, for example the gender for adjectives. Src good book Mid A1XX. اھچا NSNX. باتک Gloss adj+1stdeg...good noun+sg+nom…book 8

  9. Possible Solutions (Cont) .. ● The second step is monotone translation from lemmatized target word to fully inflected target word. Src good book Mid A1XX. اھچا NSNX. باتک Gloss adj+1stdeg...good noun+sg+nom…book یھچا (achi) باتک (kitab) Out Idea behind 2-step architecture → Model target-side morphology separately if not dependent on source morphology. 9

  10. Basic Two-Step Setup.. Monotone Strings Moses - - - - - - Moses - - - - - - Plain text Plain text Distance, Lexicalized Output Input reordering 10

  11. Two-Step Variants Monotone Joshua Strings Moses - - - - - - - - - - - - Moses-chart Plain text Plain text 1-best output Output Input 1. Reordering options: ● Using moses-chart or Joshua or manual reordering on 1st step for improved reordering. ● Moses-chart and joshua are hierarchical, i.e. allow block movements. 11

  12. Two-Step Variants (Cont ..) Monotone Transformation System - - - Strings - - - Joshua Strings Moses - - - - - - - - - - - - Moses-chart Plain text 1-best output Output Plain text 1-best output Input ● Pre-reorder input sentences using Transformation system (Jawaid, 2010) and pass 1-best reordered output to 1 st layer. 12

  13. Two-Step Variants (Cont ..) Monotone Transformation System s Strings e c Moses Moses - - - - - - i t t a - - - - - - L Plain text 1-best output Output Plain text N-best output Input ● Generate input lattice from multiple reorderings of each sentence. ● Use of lattices (Niehues et al. 2009) and (Bisazza et al. 2010). 13

  14. Two-Step Variants (Cont ..) Monotone Transformation System s e Moses c i t t Lattices a L Moses-chart Moses / s g Joshua n i r t S Plain text N-best output Output 1 or N best Plain text output Input 2. Middle Layer options: Passing lattices of possible hypothesis from1st step to 2nd step instead of passing hypothesis of simple string. Multiple reorderings are considered and 2nd step is free to choose the one that is the easiest to inflect. 14

  15. Two-Step Variants (Cont ..) Monotone Transformation System s e Moses c i t t Lattices a L Moses-chart Classifier / s g Joshua n i r t S Plain text N-best output Output 1 or N best Plain text output Input nd Layer options: 3. 2 Adding a classifier on 2nd step to get the best hypothesis. 15

  16. Main Issues .. ● Urdu is under-resourced language. ● Current research work: ● Finding and Improving Taggers – Collecting tools such as tagger and morphological analyzer for Urdu. – Trying to combine the taggers to improve precision. – Need to merge the different tagsets. ● Collecting more data. 16

  17. Questions? Feel free to ask questions. 17

Recommend


More recommend