machine translation 1 introduction approaches evaluation

Machine Translation 1: Introduction, Approaches, Evaluation, Word - PowerPoint PPT Presentation

Machine Translation 1: Introduction, Approaches, Evaluation, Word Alignment Ond rej Bojar Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT1:

  1. Machine Translation 1: Introduction, Approaches, Evaluation, Word Alignment Ondˇ rej Bojar Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague December 2018 MT1: Intro, Eval and Word Alignment

  2. Outline of Lectures on MT 1. Introduction. • Why is MT difficult. • MT evaluation. • Approaches to MT. • First peek into phrase-based MT • Document, sentence and word alignment. 2. Statistical Machine Translation. • Phrase-based: Assumptions, beam search, key issues. • Neural MT: Sequence-to-sequence, attention, self-attentive. 3. Advanced Topics. • Linguistic Features in SMT and NMT. • Multilinguality, Multi-Task, Learned Representations. December 2018 MT1: Intro, Eval and Word Alignment 1

  3. Supplementary Materials Videolectures & Wiki: Slides and Lectures from MT Marathon (see Programme): and the neural /mtm16 Books: rej Bojar: ˇ reklad. ´ • Ondˇ Ceˇ stina a strojov´ y pˇ UFAL, 2012. • Philipp Koehn: Statistical Machine Translation. Cambridge University Press, 2009. With some slides: NMT: December 2018 MT1: Intro, Eval and Word Alignment 2

  4. Why is MT Difficult? • Ambiguity and word senses. • Target word forms. • Negation. • Pronouns. • Co-ordination and apposition; word order. • Space of possible translations. . . . aside from the well-known hard things like idioms: John kicked the bucket. December 2018 MT1: Intro, Eval and Word Alignment 3

  5. Ambiguity and Word Senses The plant is next to the bank. He is a big data scientist: (big data) scientist or big (data scientist)? Put it on the rusty/velvety coat rack. Spal celou Petkeviˇ covu pˇ redn´ aˇ sku. ˇ Zenu hol´ ı stroj. Dictionary entries are not much better: kniha ´ uˇ cetn´ ı, napˇ et´ ı dovolen´ e, pl´ an prac´ ı, tˇ ri prdele A real-world example: SRC One tap and the machine issues a slip with a number. Jedno ˇ REF tuknut´ ı a ze stroje vyjede pap´ ırek s ˇ c´ ıslem. Moses 1 Z jednoho kohoutku a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Moses 2 Jeden ´ uder a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Google Jedn´ ım klepnut´ ım a stroj probl´ emy skluzu s ˇ c´ ıslem. December 2018 MT1: Intro, Eval and Word Alignment 4

  6. Ambiguity and Word Senses The plant is next to the bank. He is a big data scientist: (big data) scientist or big (data scientist)? Put it on the rusty/velvety coat rack. Spal celou Petkeviˇ covu pˇ redn´ aˇ sku. ˇ Zenu hol´ ı stroj. Dictionary entries are not much better: kniha ´ uˇ cetn´ ı, napˇ et´ ı dovolen´ e, pl´ an prac´ ı, tˇ ri prdele A real-world example: SRC One tap and the machine issues a slip with a number. Jedno ˇ REF tuknut´ ı a ze stroje vyjede pap´ ırek s ˇ c´ ıslem. Moses 1 Z jednoho kohoutku a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Moses 2 Jeden ´ uder a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Google Jedn´ ım klepnut´ ım a stroj probl´ emy skluzu s ˇ c´ ıslem. December 2018 MT1: Intro, Eval and Word Alignment 5

  7. Ambiguity and Word Senses The plant is next to the bank. He is a big data scientist: (big data) scientist or big (data scientist)? Put it on the rusty/velvety coat rack. Spal celou Petkeviˇ covu pˇ redn´ aˇ sku. ˇ Zenu hol´ ı stroj. Dictionary entries are not much better: kniha ´ uˇ cetn´ ı, napˇ et´ ı dovolen´ e, pl´ an prac´ ı, tˇ ri prdele A real-world example: SRC One tap and the machine issues a slip with a number. Jedno ˇ REF tuknut´ ı a ze stroje vyjede pap´ ırek s ˇ c´ ıslem. Moses 1 Z jednoho kohoutku a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Moses 2 Jeden ´ uder a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Google Jedn´ ım klepnut´ ım a stroj probl´ emy skluzu s ˇ c´ ıslem. December 2018 MT1: Intro, Eval and Word Alignment 6

  8. Ambiguity and Word Senses The plant is next to the bank. He is a big data scientist: (big data) scientist or big (data scientist)? Put it on the rusty/velvety coat rack. Spal celou Petkeviˇ covu pˇ redn´ aˇ sku. ˇ Zenu hol´ ı stroj. Dictionary entries are not much better: kniha ´ uˇ cetn´ ı, napˇ et´ ı dovolen´ e, pl´ an prac´ ı, tˇ ri prdele A real-world example: SRC One tap and the machine issues a slip with a number. Jedno ˇ REF tuknut´ ı a ze stroje vyjede pap´ ırek s ˇ c´ ıslem. Moses 1 Z jednoho kohoutku a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Moses 2 Jeden ´ uder a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Google Jedn´ ım klepnut´ ım a stroj probl´ emy skluzu s ˇ c´ ıslem. December 2018 MT1: Intro, Eval and Word Alignment 7

  9. Ambiguity and Word Senses The plant is next to the bank. He is a big data scientist: (big data) scientist or big (data scientist)? Put it on the rusty/velvety coat rack. Spal celou Petkeviˇ covu pˇ redn´ aˇ sku. ˇ Zenu hol´ ı stroj. Dictionary entries are not much better: kniha ´ uˇ cetn´ ı, napˇ et´ ı dovolen´ e, pl´ an prac´ ı, tˇ ri prdele A real-world example: SRC One tap and the machine issues a slip with a number. Jedno ˇ REF tuknut´ ı a ze stroje vyjede pap´ ırek s ˇ c´ ıslem. Moses 1 Z jednoho kohoutku a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Moses 2 Jeden ´ uder a stroj vyd´ a sloˇ zenky s ˇ c´ ıslem. Google Jedn´ ım klepnut´ ım a stroj probl´ emy skluzu s ˇ c´ ıslem. December 2018 MT1: Intro, Eval and Word Alignment 8

  10. Target Word Form Tense: • English present perfect for recent past events. • Spanish has two types of past tense: a specific and indetermined time in the past. Cases, genders, . . . : • Czech has 7 cases, 3 numbers and 4 genders: The cat is on the mat. → koˇ cka He saw a cat. → koˇ cku He saw a dog with a cat. → koˇ ckou He talked about a cat. → koˇ cce ⇒ Need to choose the right form when producing Czech. December 2018 MT1: Intro, Eval and Word Alignment 9

  11. Context Needed to Choose Right I saw two green striped cats . j´ a pila dva zelen´ y pruhovan´ y koˇ cky . pily dvˇ e zelen´ a pruhovan´ a koˇ cek . . . dvou zelen´ e pruhovan´ e koˇ ck´ am vidˇ el dvˇ ema zelen´ ı pruhovan´ ı koˇ ck´ ach vidˇ ela dvˇ emi zelen´ eho pruhovan´ eho koˇ ckami . . . zelen´ ych pruhovan´ ych uvidˇ el zelen´ emu pruhovan´ emu uvidˇ ela zelen´ ym pruhovan´ ym . . . zelenou pruhovanou vidˇ el jsem zelen´ ymi pruhovan´ ymi vidˇ ela jsem . . . . . . December 2018 MT1: Intro, Eval and Word Alignment 10

  12. Context Needed to Choose Right I saw two green striped cats . j´ a pila dva zelen´ y pruhovan´ y koˇ cky . pily dvˇ e zelen´ a pruhovan´ a koˇ cek . . . dvou zelen´ e pruhovan´ e koˇ ck´ am vidˇ el dvˇ ema zelen´ ı pruhovan´ ı koˇ ck´ ach vidˇ ela dvˇ emi zelen´ eho pruhovan´ eho koˇ ckami . . . zelen´ ych pruhovan´ ych uvidˇ el zelen´ emu pruhovan´ emu uvidˇ ela zelen´ ym pruhovan´ ym . . . zelenou pruhovanou vidˇ el jsem zelen´ ymi pruhovan´ ymi vidˇ ela jsem . . . . . . December 2018 MT1: Intro, Eval and Word Alignment 11

  13. Context Needed to Choose Right I saw two green striped cats . j´ a pila dva zelen´ y pruhovan´ y koˇ cky . pily dvˇ e zelen´ a pruhovan´ a koˇ cek . . . dvou zelen´ e pruhovan´ e koˇ ck´ am vidˇ el dvˇ ema zelen´ ı pruhovan´ ı koˇ ck´ ach vidˇ ela dvˇ emi zelen´ eho pruhovan´ eho koˇ ckami . . . zelen´ ych pruhovan´ ych zrak mi utkvˇ el na zelen´ emu pruhovan´ emu uvidˇ ela zelen´ ym pruhovan´ ym . . . zelenou pruhovanou vidˇ el jsem zelen´ ymi pruhovan´ ymi vidˇ ela jsem . . . . . . December 2018 MT1: Intro, Eval and Word Alignment 12

  14. Negation • French negation is around the verb: Je ne parle pas fran¸ cais. • Czech negation is doubled: Nem´ am ˇ z´ adn´ e n´ amitky. • Northern and southern Italy supposedly differ in the semantics of what you’re doing with your public transport ticket upon entering the bus: make valid or invalid (in/validare). • Some sentences even ambiguous with respect to negation: Baterky uˇ z doˇ sly. (No batteries left. Batteries just arrived.) Z pr´ ace odch´ az´ ım dobita. (I leave the work exhausted/recharged.) December 2018 MT1: Intro, Eval and Word Alignment 13

  15. Pronouns • English requires the subject explicit ⇒ guess from the verb: ˇ Cetl knihu. = He read a book. Spal jsem. = I slept. • The gender must match the referent: He saw a book. It was red. Vidˇ el knihu. Byla ˇ cern´ a. He saw a pen. It was red. Vidˇ el pero. Bylo ˇ cern´ e. • Czech agreement with subject: Source Could I use your cell phone? Google Mohl bych pouˇ z´ ıvat sv˚ uj mobiln´ ı telefon? Moses Mohl jsem pouˇ z´ ıt sv˚ uj mobil? December 2018 MT1: Intro, Eval and Word Alignment 14


More recommend