Machine Translation The noisy channel model [Brown et al. 1990, - - PowerPoint PPT Presentation

machine translation
SMART_READER_LITE
LIVE PREVIEW

Machine Translation The noisy channel model [Brown et al. 1990, - - PowerPoint PPT Presentation

Week 2: Overview Data-driven, statistical approaches to MT Machine Translation The noisy channel model [Brown et al. 1990, Knight 1999] Classical and Statistical Approaches Language modeling Translation modeling Word


slide-1
SLIDE 1

Machine Translation

– Classical and Statistical Approaches

Session 6: Statistical MT – Intro (1)

Jonas Kuhn Universität des Saarlandes, Saarbrücken The University of Texas at Austin jonask@coli.uni-sb.de DGfS/CL Fall School 2005, Ruhr-Universität Bochum, September 19-30, 2005

Jonas Kuhn: MT 2

Week 2: Overview

Data-driven, statistical approaches to MT

The noisy channel model

[Brown et al. 1990, Knight 1999]

Language modeling Translation modeling

Word alignment Phrase alignment

[Koehn et al. 2003]

Decoding

[Koehn 1994]

Lab exercise: building a phrase-based statistical MT

system from parallel texts taken from the Internet

Evaluation methods Other uses of word alignments

[Yarowsky et al. 2001]

Jonas Kuhn: MT 3

Sessions 6/7: Statistical MT – Intro

Acknowledgements:

Some slides are borrowed from Kevin Knight,

University of Southern California, from Colin Cherry, Alberta (see http://www.cs.ualberta.ca/~colinc) and from Leila Kosseim (http://www.cs.concordia.ca/~kosseim/) “Translation without understanding”

Very brief introduction to probabilities The noisy channel model for translation

Language modeling Translation modeling Decoding

Jonas Kuhn: MT 4

Translation without understanding?

Translation is easy for (bilingual) people Process:

Read the text in French Understand it Write it down in English

slide-2
SLIDE 2

Jonas Kuhn: MT 5

Translation without understanding?

Translation is easy for (bilingual) people Process:

Read the text in French Understand it Write it down in English

Hard for computers

The human process is invisible, intangible

Jonas Kuhn: MT 6

One approach: Rule-based MT

Compare week 1 Problems:

Building a broad-coverage system is an

enormous engineering challenge

Adding new languages/text domains is very

costly

Many disambiguation decisions cannot be

made without world knowledge/contextual knowledge

Jonas Kuhn: MT 7

Alternative Approach: Statistical MT

Go back to Warren Weaver’s idea of using

statistical techniques find the most probable translation of a given sentence

We want to translate from French to English

Task: given a French sentence, what is the

most probable English translation?

Notation: Find E* = arg maxE P(E|F)

Jonas Kuhn: MT 8

Hmm, every time he sees “banco”, he either types “bank” or “bench” … but if he sees “banco de…”, he always types “bank”, never “bench”… Man, this is so boring. Translated documents Slide from Kevin Knight

Data-Driven Machine Translation

slide-3
SLIDE 3

9

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Exercise: translate this to Arcturan:

farok crrrok hihok yorok clok kantok ok-yurp

Jonas Kuhn: MT 10

insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya

  • f Security Council decision trace

international the imposed ban comment . And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " . Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as

  • f tomorrow, Wednesday its flights to

Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. " The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

2002 2002 2003 2003

Slide from C. Wayne, DARPA

Recent Progress in Statistical MT

Jonas Kuhn: MT 11

Very brief intro to probabilities

Using “common sense”, we are pretty good at

dealing with the likelihood of (random) events

Probability functions assign a value between

0 and 1 to the occurrence of a particular

  • utcome of a random event

Example: rolling a die – P() = 1/6 = 0.1667

We need some terminology and notation

Jonas Kuhn: MT 12

Pop star example

Assume you are a photo reporter and want to

take an exclusive picture of an international pop star who’s on tour in Germany

There are rumors that certain concerts will get

cancelled

You want to guess what route the pop star will

take through Germany Each route has a certain probability Wait at a

location along the route with the highest probability to take the picture

slide-4
SLIDE 4

Jonas Kuhn: MT 13

Probabilities

  • Simple probability (Prior probability) P(A)

You call up the tour manager and ask whether the concert in Berlin

will be cancelled or not

“With 60% probability the concert will take place”

  • Conditional probability (Posterior probability) P(A|B)

If the pop star has a concert in Berlin how likely is it that she will

visit the Reichstagsgebäude?

One out of four pop stars who gives a concert in Berlin also visits

Reichstagsgebäude

Only 10% of the pop stars who don‘t give a concert in Berlin visit

the Reichstagsgebäude

0.4 P(nCiB) = 0.6 P(CiB) = 0.1 P(Rtg | nCiB) = 0.25 P(Rtg | CiB) =

Jonas Kuhn: MT 14

Calculations with probabilities

  • How likely is it that the pop star will show up at the Reichstag

[ What is P(Rtg) ]?

All we have are conditional probabilities for the pop star visiting the

Reichstag, so we have to consider both options for the precondition

  • Joint probability P(A,B)

P(Rtg, CiB) = P(CiB) × P(Rtg | CiB) = 0.6 × 0.25 = 0.15 P(Rtg, nCiB) = P(nCiB) × P(Rtg | nCiB) = 0.4 × 0.1 = 0.04 Since CiB and nCiB cover the full space of probabilities we get:

P(Rtg) = P(Rtg, CiB) + P(Rtg, nCiB) = 0.15 + 0.04 = 0.19

  • What‘s the use of an exact value like this?

Comparison with alternative options, e.g., P(FRA_Airport) = 0.25

0.4 P(nCiB) = 0.6 P(CiB) = 0.1 P(Rtg | nCiB) = 0.25 P(Rtg | CiB) =

Jonas Kuhn: MT 15

Calculations with probabilities

We just exploited the fact that joint probabilities [i.e., P(A,B)] can be

calculated by multiplying the prior probability for one event with the conditional probability for the other event, given the first event

This is called the “chain rule”

We can go either way (because P(A,B)= P(B,A)): P(A,B) = P(A) × P( B | A ) or P(A,B) = P(B) × P( A | B ) So:

P( B ) × P( A | B ) = P( A ) × P( B | A )

Divide both sides of the equation by P( B ):

) ( ) ( ) | ( ) | ( B P A P A B P B A P × =

Jonas Kuhn: MT 16

Bayes’ Law

This is called Bayes’ Law

Importance: Often, training …

[i.e., statistical parameter estimation from a sample of random experiments] … for one of the two conditional probabilities can be done much more reliably than for the other one

) ( ) ( ) | ( ) | ( B P A P A B P B A P × =

slide-5
SLIDE 5

Jonas Kuhn: MT 17

Bayes’ Law

When we are only looking for the most likely outcome

A* for an event, given a fixed event B, the denominator doesn’t play a role:

) ( ) ( ) | ( ) | ( B P A P A B P B A P × =

) ( ) | ( max arg ) ( ) ( ) | ( max arg ) | ( max arg * A P A B P B P A P A B P B A P A

A A A

× = × = =

Jonas Kuhn: MT 18

Crime Scene Analogy

B is a crime scene. A is a person who may have

committed the crime

P(A|B) - look at the scene - who did it? P(A) - who had a motive? (Profiler) P(B|A) - could they have done it? (transportation,

access to weapons, alibi) Some people might have great motives, but no

means - you need both!

Jonas Kuhn: MT 19

Back to translation

We want to translate from French to English

Task: given a French sentence, what is the most

probable English translation?

Notation: Find E* = arg maxE P(E|F)

With Bayes’ law we can search the E that maximizes

P(F|E) × P(E)

Find the English string E for which the product of

P(E)

[language model probability] times

P(F|E) [translation model probability E F]

is maximal

Jonas Kuhn: MT 20

Why Bayes rule at all?

Why not model P(E|F) directly? P(F|E) × P(E) decomposition allows us to be sloppy

P(E) worries about good English:

Fluency

P(F|E) worries about French that matches English:

Faithfulness

The two can be trained independently

slide-6
SLIDE 6

Jonas Kuhn: MT 21

On voit Jon à la télévision

  • Jon was not happy.

TV in Jon appeared.

  • TV appeared on Jon.
  • Jon appeared on TV.
  • Jon is happy today.
  • In Jon appeared TV.

Appeared on Jon TV.

  • Jon appeared in TV.

good match to French? P(F|E) good English? P(E)

Table borrowed from Jason Eisner

Jonas Kuhn: MT 22

Fluency vs. Faithfulness

Note that even theoretically, it is sometimes impossible

to have a translation that is

maximally faithful to the source language, but also fluent in the target language

Example

Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past

behaviour, and what we did wrong, and how to avoid the problem next time)”

Jonas Kuhn: MT 23

The Noisy Channel Model

Statistical MT is based on the noisy channel

model

Developed by Shannon to model

communication (e.g., over a phone line)

Jonas Kuhn: MT 24

The Noisy Channel Model

Noisy channel model in SMT (ex. F E):

Assume that the true text is in English But when it was transmitted over the noisy channel, it

somehow got corrupted and came out in French

i.e. the noisy channel has deformed/corrupted

the original English input into French

So really… French is a form of noisy English ☺

The task is to recover the original English sentence (or

to decode the French into English)

slide-7
SLIDE 7

Jonas Kuhn: MT 25

We need three things (for FE)

1.

A Language Model of English: P( E )

  • Measures fluency
  • Probability of an English sentence
  • ~ Provides a set of fluent sentences to test for potential

translation

2.

A Translation Model: P( F | E )

  • Measures faithfulness
  • Probability of an (French, English) pair (given English

sentence)

  • ~Tests if a given fluent sentence is a translation

3.

A Decoder: arg max

  • An effective and efficient search technique to find E*
  • The search space is infinite and rather unstructured, so

heuristic search has to be applied

Jonas Kuhn: MT 26

Where will we get P(E)?

Language modeling is a common task in Natural

Language Processing

Application contexts (besides MT):

Speech recognition Hand-writing recognition Augmentative communication systems for the disabled Context-sensitive spelling error correction (see example on

next slide)

Introduction in chapter 6 of Jurafsky, D. and J. H. Martin (2000):

Speech and language processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Upper Saddle River, NJ: Prentice-Hall.

Jonas Kuhn: MT 27

N-gram language models (Quick intro)

Given a sequence of words, what will be the

next word?

Hard to guess – but if we don’t demand

extremely high accuracy, it is not that hard

I’d like to make a collect …

  • … call
  • … telephone
  • … international

Jonas Kuhn: MT 28

Probability of a sequence of words

Two closely related problems:

Guessing the next word Computing the probability of a sequence of

words

slide-8
SLIDE 8

Jonas Kuhn: MT 29

Counting words in corpora

To estimate probabilities, we need to count

frequencies

What do people count?

Word forms Lemmas

The type/token distinction

Number of (word form) types: distinct words in a

corpus (i.e., the size of the vocabulary)

Number of (word form) tokens: total number of running

words

Jonas Kuhn: MT 30

Counting words in corpora

Switchboard corpus (spoken English)

2.4 million word form tokens

  • c. 20,000 word form types

Shakespeare’s complete words

884,647 word form tokens 29,066 word form types

Brown corpus

1 million word form tokens 61,805 word form types (37,851 lemma types)

Jonas Kuhn: MT 31

Estimating word probabilities

How probable is an English word (form) w1 as the

next word in a sequence?

Simplest model: every word has the same probability

  • f occurring

Assume vocabulary size 100,000 Single word: the probability of finding w1 is Word wn in a sequence, assuming conditional

independence from the context 000 , 100 1 ) (

1 =

w P

000 , 100 1 ) ( ) ... | (

1 1

= =

− n n n

w P w w w P

Jonas Kuhn: MT 32

Estimating word probabilities

Sequence of two words w1 w2, still assuming

that each word form is equally likely that w1 w2 are conditionally independent from each other that w1 w2 are conditionally independent from the

context

10 2 1 1 2 1 2 1

10 ,000 10,000,000 1 000 , 100 1 000 , 100 1 ) ( ) ( ) | ( ) ( ) , (

= = ⋅ = ⋅ = ⋅ = w P w P w w P w P w w P

slide-9
SLIDE 9

Jonas Kuhn: MT 33

A slightly more complex model

Still assume that any word can follow any

  • ther word

Take into account that different word forms

  • ccur with different frequencies

“the” occurs 69,971 times in the 1,000,000

tokens of the Brown corpus

“rabbit” occurs 11 times in the Brown corpus “Austin” occurs 20 times “linguist” occurs 13 times

Jonas Kuhn: MT 34

A slightly more complex model

Estimating probabilities based on relative frequency Sample: 1,000,000 trials of producing a random

English word (N=1,000,000)

Relative frequency of outcome u: 00002 . 000 , 000 , 1 20 ) ( 00001 . 000 , 000 , 1 11 ) ( 07 . 000 , 000 , 1 971 , 69 ) ( ) ( = = = = = = = = = = = = =

= = =

N Austin w C f N rabbit w C f N the w C f N u C f

i Austin w i rabbit w i the w u

i i i

Jonas Kuhn: MT 35

Conditional probability of a word

Relative frequencies are not a good model for the

probability of words in a given context

Just then, the white … P(the)=.07 P(rabbit)=.00001

We should take the previous words that have

  • ccurred into account

We will get: P(rabbit | white) > P(rabbit)

) ( ) , ( ) | ( white P white rabbit P white rabbit P =

Jonas Kuhn: MT 36

Probability of a string of words

Using the chain rule of probability: But how can we estimate such probabilities?

If we wanted to count the frequency of every word appearing

after a long sequence of other words, we would need a far too large corpus as a sample

[ ]

) ( ) , ..., , , (

1 1 3 2 1 n n n

w P w w w w w P =

= − − −

= ⋅ ⋅ ⋅ ⋅ = ⋅ ⋅ ⋅ ⋅ =

n k k k n n n n n

w w P w w P w w P w w P w P w w w w P w w w P w w P w P w P

1 1 1 1 1 2 1 3 1 2 1 1 2 1 2 1 3 1 2 1 1

) | ( ) | ( ... ) | ( ) | ( ) ( ) ... | ( ... ) | ( ) | ( ) ( ) (

slide-10
SLIDE 10

Jonas Kuhn: MT 37

Chain of probabilities

P(USA, Ffm, Köln, Berlin, Potsdam) = P(USA) × P(Ffm | USA) × P(Köln | USA, Ffm) × P(Berlin | USA, Ffm, Köln) × P(Potsdam | USA, Ffm, Köln, Berlin)

München City Stuttgart München Flughafen Düsseldorf Köln USA Dresden Frankfurt/M. Flughafen Hamburg City Potsdam Berlin City Berlin/Tegel

Jonas Kuhn: MT 38

Probability of a string of words

Approximate the probability We have to form equivalence classes over

word contexts, so we get a larger sample from which we estimate probabilities

Simple approximation: look only at one

preceding word

) | (

1 1 − n n w

w P ) | ( ) | (

1 1 1 − −

n n n n

w w P w w P

Jonas Kuhn: MT 39

Bigram model

Approximate P(rabbit | Just the other day I saw a)

by

P(rabbit | a) Markov assumption: predicting a future event based

  • n a limited window of past events

Bigrams: first-order Markov model (looking back one

token into the past)

Jonas Kuhn: MT 40

N-gram models

Bigram model:

first-order Markov model looking back one token

Trigram model:

second-order Markov model looking back two tokens

N-gram model:

N-1th order Markov model looking back N-1 tokens

[ ]

) ... | ( ) | ( ) | (

1 2 2 1 1 1 1 1 − − + − + − − + − −

= ≈

n n N n N n n n N n n n n

w w w w w P w w P w w P

) | ( ) | (

1 1 1 − −

n n n n

w w P w w P

) | ( ) | (

1 2 1 1 − − −

n n n n n

w w w P w w P

slide-11
SLIDE 11

Jonas Kuhn: MT 41

Bigram approximation of string prob. ∏

= − −

= ⋅ ⋅ ⋅ =

n k k k n n n

w w P w w P w w P w w P w P w P

1 1 1 1 1 2 1 3 1 2 1 1

) | ( ) | ( )... | ( ) | ( ) ( ) (

) | ( ) | (

1 1 1 − −

n n n n

w w P w w P

= −

n k k k n

w w P w P

1 1 1

) | ( ) (

Simplifying assumption: Resulting equation (bigram language model):

Jonas Kuhn: MT 42

Bigram language model example

Berkeley Restaurant Project (corpus of c. 10,000 sentences)

Most likely words to follow “eat”

eat on .16 eat some .06 eat lunch .06 eat dinner .05 eat at .04 eat a .04 eat Indian .04 eat today .03 eat Thai .03 eat breakfast .03 eat in .02 eat Chinese .02 eat Mexican .02 eat tomorrow .01 eat dessert .007 eat British .001

Jonas Kuhn: MT 43

Bigram probabilities

<s> I .25 I want .32 want to .65 <s> I’d .06 I would .29 want a .05 <s> Tell .04 I don’t .08 want some .04 <s> I’m .02 I have .04 want thai .01 to eat .26 British food .60 to have .14 British restaurant .15 to spend .09 British cuisine .01 to be .02 British lunch .01

Jonas Kuhn: MT 44

Computing the sentence probability

000016 . 60 . 002 . 26 . 65 . 32 . 25 . ) British | food ( ) eat | British ( ) to | eat ( ) want | to ( ) I | want ( ) s | I ( ) food British eat want to I ( = ∗ ∗ ∗ ∗ ∗ = ⋅ ⋅ ⋅ ⋅ ⋅ > < = P P P P P P P

slide-12
SLIDE 12

Jonas Kuhn: MT 45

Training N-gram models

Counting and normalizing

Count occurrences of a bigram (say, “eat

lunch”)

Divide by total count of bigrams sharing the

first word (i.e., “eat w” for some w)

) ( ) ( ) ( ) ( ) | (

1 1 1 1 1 − − − − −

= = ∑

n n n w n n n n n

w C w w C w w C w w C w w P

Jonas Kuhn: MT 46

Training N-gram models

General case of N-gram parameter

estimation

Relative frequency Example of Maximum Likelihood Estimation

(MLE) technique

) ( ) ( ) | (

1 1 1 1 1 1 − + − − + − − + −

=

n N n n n N n n N n n

w C w w C w w P

Jonas Kuhn: MT 47

Relative frequency: example

Bigram counts from Berkeley Restaurant

Project

I want to eat Chinese food lunch I 8 1087 13 want 3 786 6 8 6 to 3 10 860 3 12 eat 2 19 2 52 Chinese 2 120 1 food 19 17 lunch 4 1

Jonas Kuhn: MT 48

Relative frequency: example

Unigram counts from corpus

I 3437 want 1215 to 3256 eat 938 Chinese 213 food 1506 lunch 459

slide-13
SLIDE 13

Jonas Kuhn: MT 49

Relative frequency: example

Bigram probabilities (after normalizing, i.e., through dividing

by unigram counts):

I want to eat

Chinese food

lunch I .0023 .32 .0038 want .0025 .65 .0049 .0066 .0049 to .00092 0 .0031 .26 .00092 0 .0037 eat .0021 .020 .0021 .055 Chinese .0094 0 .56 .0047 food .013 .011 lunch .0087 .0022

) ( ) ( ) | (

1 1 1 − − −

=

n n n n n

w C w w C w w P

Jonas Kuhn: MT 50

Language modeling: end of intro…

Plain relative frequency estimation is

problematic

Unobserved N-grams are assigned zero

probability

Problematic with lower-frequency words

“Smoothing” techniques reserve some

probability mass for unobserved events

Build your own language model:

CMU Statistical Language Modeling Toolkit

http://mi.eng.cam.ac.uk/~prc14/toolkit.html

51

Centauri/Arcturan [Knight, 1997]

  • 1a. ok-voon ororok sprok .
  • 1b. at-voon bichat dat .
  • 7a. lalok farok ororok lalok sprok izok enemok .
  • 7b. wat jjat bichat wat dat vat eneat .
  • 2a. ok-drubel ok-voon anok plok sprok .
  • 2b. at-drubel at-voon pippat rrat dat .
  • 8a. lalok brok anok plok nok .
  • 8b. iat lat pippat rrat nnat .
  • 3a. erok sprok izok hihok ghirok .
  • 3b. totat dat arrat vat hilat .
  • 9a. wiwok nok izok kantok ok-yurp .
  • 9b. totat nnat quat oloat at-yurp .
  • 4a. ok-voon anok drok brok jok .
  • 4b. at-voon krat pippat sat lat .
  • 10a. lalok mok nok yorok ghirok clok .
  • 10b. wat nnat gat mat bat hilat .
  • 5a. wiwok farok izok stok .
  • 5b. totat jjat quat cat .
  • 11a. lalok nok crrrok hihok yorok zanzanok .
  • 11b. wat nnat arrat mat zanzanat .
  • 6a. lalok sprok izok jok stok .
  • 6b. wat dat krat quat cat .
  • 12a. lalok rarok nok izok hihok mok .
  • 12b. wat nnat forat arrat vat gat .

Exercise: translate this to Arcturan:

farok crrrok hihok yorok clok kantok ok-yurp