A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 - - PDF document
A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 - - PDF document
A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 Why this lecture? For about 15 years, the MT world was relatively static. State of the art defined by phrase-based SMT and syntax-based SMT . Well-known strengths and weaknesses.
Deep Learning
Image source: http://deeplearning.stanford.edu/wiki/index.php/Exercise%3AVectorization
Continuous-Space Methods
NLP traditionally treated words as discrete, incomparable units. Continuous-space methods map them into a vector space where you can compute similarities. Methods: Word cooccurrence or deep learning. With deep learning, we can train word embeddings for specific objectives.
Discrete Words
Lockheed ICCAT closed pride OHIO shadowy Ernest hangout solution homicidal Pacific things far-ranging enables Akram communicates triangle taxed secrets receipts taken Spinelli dates Cost clash district relative visa captains abilities Organization Austrian inflows Loyola whatever Primakov upstaging guidelines authors complaining
- ath
marched soldiers geology drifts seen provide adaptation enterprises Valdis un-associated misguided non-Serb writing doubtless frankly anti-Semitism 10-1
- perators
Genocide camouflage gathered adopts bags shunning approaching aspirin maximum expenditure some footsteps Dutch stressed writers between mischief undertake attention degraded
- bscene
Word Embeddings (Projected)
the NUMBER a for ’’ ‘‘ that is
- n
’s was −lrb− −rrb− with as by it he at said from his be are an ; have : has but this not i were they had who
- r
which their will −− its new
- ne
after also we more would about − you first two up when ’ all there her she
- ther
- ut
people can than into some
- ver
do if time $ last no _ world
- nly
what could most so may president three
(courtesy of Ali Basirat)
Neural networks
Neural networks are the machine learning paradigm in which most of this happens. Biologically inspired, but doesn’t matter very much. Very popular in the early 1980’s, but the time wasn’t ripe. The elementary “neuron” is just a nonlinear with some trainable parameters. Neurons are combined into a network by function composition.
Logistic Regression
x1 x2 x3 x4 x5 y ·λ1 ·λ2 ·λ3 ·λ4 ·λ5 +
f (x) = 1 1 + e−x
y = f
- i
λi xi
Logistic Regression
x1 x2 x3 x4 x5 y λ1 λ2 λ3 λ4 λ5 y = f
- i
λi xi
- Multiple Decision Steps
Inputs x Outputs y Latent features h h = f (W1x) y = f (W2h)
Multiple Decision Steps
Inputs x Outputs y Latent features h h = f (W1x) y = f (W2h)
Training the Network
Neural networks are trained by numerically minimising the error of the output for a training set. The algorithms used are variants of gradient descent. The gradients with respect to all weights can be computed efficiently with a dynamic programming algorithm called back-propagation.
Word Embeddings in Neural Networks
1 .10 .32 .95 .02 .51 .33 .18 .83 .81 .67 .22 Sparse inputs x Remaining network Dense embeddings e
Sequence Length Limits
A given network takes a fixed number of inputs. In MT, we need to process input sentences of arbitrary length and produce output of arbitrary length. Input and output length are not necessarily the same.
Input length Output length Compression Network type fixed fixed feed-forward variable = input (or fixed) recurrent variable unconstrained to fixed size encoder-decoder variable unconstrained no compression attention-based
Adding a Time Dimension: Recurrent Nets
xt
1
xt
2
xt
3
yt−1 yt
Processing Sequences
x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5 x6 h6 y6 x7 h7 y7 x8 h8 y8 recurrent connections forward connections
Unequal Sequence Length
In this architecture, there are equally many inputs xi as outputs yi. Useful for sequence labelling tasks such as POS tagging. In machine translation, the length of the input and output sequences differ.
Encoder-Decoder Architecture
e1 d1 e2 d2 e3 d3 e4 d4 e5 d5 e6 d6 e7 d7 e8 d8 x1 x2 x3
EOS
y1 y2 y3 y4 y1 y2 y3 y4
EOS
One set of layers of fixed size must hold the contents
- f the whole input sentence.
Attention Mechanism
x1 x2 x3
EOS
e1 e2 e3 e4 + d1 + d2 + d3 + d4 + d5 y1 y2 y3 y4
EOS