CRF Word Alignment & Noisy Channel Translation Machine - PowerPoint PPT Presentation

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn

Last Time ... X Translation Translation Alignment p ( p ( ) = ) , Alignment X Alignment p ( p ( Translation | Alignment ) ) × = | {z } | {z } Alignment { z }| { m z }| X Y p ( e | f , m ) = p ( a | f , m ) × p ( e i | f a i ) i =1 a ∈ [0 ,n ] m

MAP alignment IBM Model 4 alignment Our model's alignment

A few tricks... p(f|e) p(e|f)

Another View With this model: m X Y p ( e | f , m ) = p ( a | f , m ) × p ( e i | f a i ) i =1 a ∈ [0 ,n ] m The problem of word alignment is as: a ∗ = arg a ∈ [0 ,n ] m p ( a | e , f , m ) max Can we model this distribution directly?

Markov Random Fields (MRFs) p ( A, B, C, X, Y, Z ) = A B C p ( A ) × p ( B | A ) × p ( C | B ) × p ( X | A ) p ( Y | B ) p ( Z | C ) X Y Z p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z “Factors”

Computing Z X X Z = Ψ 1 ( x, y ) Ψ 2 ( x ) Ψ 3 ( y ) X Y x ∈ X y ∈ X When the graph has certain X = { a , b , c } structures (e.g., chains), you can X ∈ X factor to get polynomial time Y ∈ X dynamic programming algorithms. X X Z = Ψ 2 ( x ) Ψ 1 ( x, y ) Ψ 3 ( y ) x ∈ X y ∈ X

Log-linear models p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z X Ψ 1 , 2 , 3 ( x, y ) = exp w k f k ( x, y ) k Weights (learned) Feature functions   (specified)

Random Fields • Benefits • Potential functions can be defined with respect to arbitrary features (functions) of the variables • Great way to incorporate knowledge • Drawbacks • Likelihood involves computing Z • Maximizing likelihood usually requires computing Z (often over and over again!)

Conditional Random Fields • Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the “input” 1 X X p ( y | x ) = Z w ( y ) exp w k f k ( F, x ) F ∈ G k All factors in the graph of y

Parameter Learning • CRFs are trained to maximize conditional likelihood Y w MLE = arg max ˆ p ( y i | x i ; w ) w ( x i , y i ) ∈ D • Recall we want to directly model p ( a | e , f ) • The likelihood of what alignments? Gold reference alignments!

CRF for Alignment • One of many possibilities, due to Blunsom & Cohn (2006) | e | 1 X X p ( a | e , f ) = Z w ( e , f ) exp w k f ( a i , a i − 1 , i, e , f ) i =1 k • a has the same form as in the lexical translation models (still make a one-to-many assumption) • w k are the model parameters • f k are the feature functions O ( n 2 m ) ≈ O ( n 3 )

Model • Labels (one per target word) index source sentence • Train model (e,f) and (f,e) [inverting the reference alignments]

Alignment Experiments • French-English Canadian Hansards corpus • 484 manually word-aligned sentence pairs (100 training, 37 development, 347 testing) • 1.1 million sentence-aligned pairs • Baseline for comparison: Giza++ implementation of IBM Model 4 • (Also experimented on Romanian-English)

pervez musharrafs langer abschied Identical word pervez musharraf ’s long goodbye Identical word 17

pervez musharrafs langer abschied Matching prefix pervez musharraf ’s long goodbye Identical word Matching prefix 18

pervez musharrafs langer abschied Matching suffix pervez musharraf ’s long goodbye Identical word Matching prefix Matching suffix 19

pervez musharrafs langer abschied Orthographic similarity pervez musharraf ’s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 20

pervez musharrafs langer abschied In dictionary pervez musharraf ’s long goodbye Identical word In dictionary Matching prefix ... Matching suffix Orthographic similarity 21

Lexical Features • Word ↔ word indicator features • Various word ↔ word co-occurrence scores • IBM Model 1 probabilities ( t → s , s → t ) • Geometric mean of Model 1 probabilities • Dice’s coefficient [binned] • Products of the above

Lexical Features • Word class ↔ word class indicator • NN translates as NN ( NN_NN =1 ) • NN does not translate as MD ( NN_MD =1 ) • Identical word feature • 2010 = 2010 ( IdentWord =1 IdentNum =1 ) • Identical prefix feature • Obama ~ Obamu ( IdentPrefix =1 ) • Orthographic similarity measure [binned] • Al-Qaeda ~ Al-Kaida ( OrthoSim050_080=1 )

Other Features • Compute features from large amounts of unlabeled text • Does the Model 4 alignment contain this alignment point? • What is the Model 1 posterior probability of this alignment point?

Results

Summary • CRFs outperform unsupervised / latent variable alignment models, even when only a small number of word-aligned sentences are available • Diverse range of features can be incorporated and are beneficial to word- alignment quality • Features from unsupervised models can also be incorporated Unfortunately, you need gold alignments !

Putting the pieces together • We have seen how to model the following: p ( e ) p ( e | f , m ) p ( e , a | f , m ) p ( a | e , f )

Putting the pieces together • We have seen how to model the following: p ( e ) p ( e | f , m ) p ( e , a | f , m ) p ( a | e , f ) • Goal: a better model of that knows about p ( e | f , m ) p ( e )

One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘ This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode . ’ Warren Weaver to Norbert Wiener, March, 1947

M 0 M Y X “Noisy” Encoder Decoder channel Sent   Received   Recovered   Message transmission transmission message Claude Shannon. “A Mathematical Theory of Communication” 1948.

M 0 M Y X “Noisy” Encoder Decoder channel Sent   Received   Recovered   Message transmission transmission message p ( x | y ) p ( x ) p ( y ) Claude Shannon. “A Mathematical Theory of Communication” 1948.

M 0 M Y X “Noisy” Encoder Decoder channel Sent   Received   Recovered   Message transmission transmission message p ( x | y ) p ( y ) Shannon’s theory tells us: 1) how much data you can send   2) the limits of compression   3) why your download is so slow 4) how to translate   Claude Shannon. “A Mathematical Theory of Communication” 1948.

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message p ( x | y ) p ( y )

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message p ( x | y ) p ( y ) y 0 = arg max p ( y | x ) y p ( x | y ) p ( y ) = arg max p ( x ) y = arg max p ( x | y ) p ( y ) y

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message p ( x | y ) p ( y ) 6 = y 0 = arg max p ( y | x ) y p ( x | y ) p ( y ) I can help. = arg max p ( x ) y = arg max p ( x | y ) p ( y ) y

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message y 0 = arg max p ( y | x ) y p ( x | y ) p ( y ) = arg max p ( x ) y = arg max p ( x | y ) p ( y ) y

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message y 0 = arg max p ( y | x ) y p ( x | y ) p ( y ) = arg max p ( x ) y = arg max p ( x | y ) p ( y ) Denominator doesn’t depend on . y y

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message y 0 = arg max p ( y | x ) y p ( x | y ) p ( y ) = arg max p ( x ) y = arg max p ( x | y ) p ( y ) y

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message y 0 = arg max p ( x | y ) p ( y ) y

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message English “French” English’ y 0 = arg max p ( x | y ) p ( y ) y e 0 = arg max p ( f | e ) p ( e ) e

Y 0 M 0 Y X “Noisy” Decoder channel Sent   Received   Recovered   transmission transmission message English “French” English’ y 0 = arg max p ( x | y ) p ( y ) y e 0 = arg max p ( f | e ) p ( e ) e translation model

CRF Word Alignment & Noisy Channel Translation Machine - PowerPoint PPT Presentation

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time ... X Translation Translation Alignment p ( p ( ) = )

CRF Word Alignment & Noisy Channel Translation January 31, 2013 Tuesday, February 19, 13

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 2013 M.

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Discriminative word alignment by learning the Discriminative word alignment by learning the

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Phrase-Based Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Math236 Discrete Maths with Applications P. Ittmann UKZN, Pietermaritzburg Semester 1, 2012

Electronic Mail: SMTP & % smtp1.tex February 3, 1998 ' $ AIS 2 Electronic mail

Operating Systems Review ENCE 360 High level Concepts What are three conceptual pieces

tr Prr trt

vert.x Effortless asynchronous application development for the modern web and enterprise Stuart

Delay and Disruption Tolerant Networks An Overview NASA through the Delay Tolerant Network

6/18/2018 The PIWI Experience Nebraska Young Child Institute June 27, 2018 Linda Esterling

#6: Booleans and If Statements SAMS SENIOR NON-CS TRACK Last Time Use functions to hold and

CRF Word Alignment & Noisy Channel Translation Machine - PowerPoint PPT Presentation

CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time ... X Translation Translation Alignment p ( p ( ) = )

CRF Word Alignment &amp; Noisy Channel Translation January 31, 2013 Tuesday, February 19, 13

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

CT CTC-CRF CRF CRF-based sin ingle-stage acoustic ic modeli ling wit ith CT CTC topology

Statistical Machine Translation The Main Idea Treat translation as a noisy channel problem:

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Machine Translation: Word Alignment Problem Marcello Federico FBK, Trento - Italy 2013 M.

Statistical Machine Translation Overview p EM algorithm Lecture 3 Improved word alignment

Discriminative word alignment by learning the Discriminative word alignment by learning the

Noisy Channel Coding: Correlated Random Variables &amp; Communication over a Noisy Channel Toni

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Phrase-Based Machine Translation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence Alignment (chapter 6) p The biological problem p Global alignment p Local alignment p

Math236 Discrete Maths with Applications P. Ittmann UKZN, Pietermaritzburg Semester 1, 2012

Electronic Mail: SMTP &amp; % smtp1.tex February 3, 1998 ' $ AIS 2 Electronic mail

Operating Systems Review ENCE 360 High level Concepts What are three conceptual pieces

tr Prr trt

vert.x Effortless asynchronous application development for the modern web and enterprise Stuart

Delay and Disruption Tolerant Networks An Overview NASA through the Delay Tolerant Network

6/18/2018 The PIWI Experience Nebraska Young Child Institute June 27, 2018 Linda Esterling

#6: Booleans and If Statements SAMS SENIOR NON-CS TRACK Last Time Use functions to hold and

CRF Word Alignment & Noisy Channel Translation January 31, 2013 Tuesday, February 19, 13

Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni

Electronic Mail: SMTP & % smtp1.tex February 3, 1998 ' $ AIS 2 Electronic mail