A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 - - PDF document

a quick introduction to neural mt
SMART_READER_LITE
LIVE PREVIEW

A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 - - PDF document

A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 Why this lecture? For about 15 years, the MT world was relatively static. State of the art defined by phrase-based SMT and syntax-based SMT . Well-known strengths and weaknesses.


slide-1
SLIDE 1

A Quick Introduction to Neural MT

Christian Hardmeier 2016-05-16

Why this lecture?

For about 15 years, the MT world was relatively static. State of the art defined by phrase-based SMT and syntax-based SMT. Well-known strengths and weaknesses. Neural MT is a new, quite different approach to MT that seems to outperform the previous methods. Deep Learning Continuous-space NLP Neural Networks

Deep Learning

Machine learning paradigm that gained popularity very recently. First breakthroughs in computer vision. Multiple layers of prediction: “Automated feature engineering”

slide-2
SLIDE 2

Deep Learning

Image source: http://deeplearning.stanford.edu/wiki/index.php/Exercise%3AVectorization

Continuous-Space Methods

NLP traditionally treated words as discrete, incomparable units. Continuous-space methods map them into a vector space where you can compute similarities. Methods: Word cooccurrence or deep learning. With deep learning, we can train word embeddings for specific objectives.

Discrete Words

Lockheed ICCAT closed pride OHIO shadowy Ernest hangout solution homicidal Pacific things far-ranging enables Akram communicates triangle taxed secrets receipts taken Spinelli dates Cost clash district relative visa captains abilities Organization Austrian inflows Loyola whatever Primakov upstaging guidelines authors complaining

  • ath

marched soldiers geology drifts seen provide adaptation enterprises Valdis un-associated misguided non-Serb writing doubtless frankly anti-Semitism 10-1

  • perators

Genocide camouflage gathered adopts bags shunning approaching aspirin maximum expenditure some footsteps Dutch stressed writers between mischief undertake attention degraded

  • bscene
slide-3
SLIDE 3

Word Embeddings (Projected)

the NUMBER a for ’’ ‘‘ that is

  • n

’s was −lrb− −rrb− with as by it he at said from his be are an ; have : has but this not i were they had who

  • r

which their will −− its new

  • ne

after also we more would about − you first two up when ’ all there her she

  • ther
  • ut

people can than into some

  • ver

do if time $ last no _ world

  • nly

what could most so may president three

(courtesy of Ali Basirat)

Neural networks

Neural networks are the machine learning paradigm in which most of this happens. Biologically inspired, but doesn’t matter very much. Very popular in the early 1980’s, but the time wasn’t ripe. The elementary “neuron” is just a nonlinear with some trainable parameters. Neurons are combined into a network by function composition.

Logistic Regression

x1 x2 x3 x4 x5 y ·λ1 ·λ2 ·λ3 ·λ4 ·λ5 +

f (x) = 1 1 + e−x

y = f

  • i

λi xi

slide-4
SLIDE 4

Logistic Regression

x1 x2 x3 x4 x5 y λ1 λ2 λ3 λ4 λ5 y = f

  • i

λi xi

  • Multiple Decision Steps

Inputs x Outputs y Latent features h h = f (W1x) y = f (W2h)

Multiple Decision Steps

Inputs x Outputs y Latent features h h = f (W1x) y = f (W2h)

slide-5
SLIDE 5

Training the Network

Neural networks are trained by numerically minimising the error of the output for a training set. The algorithms used are variants of gradient descent. The gradients with respect to all weights can be computed efficiently with a dynamic programming algorithm called back-propagation.

Word Embeddings in Neural Networks

1 .10 .32 .95 .02 .51 .33 .18 .83 .81 .67 .22 Sparse inputs x Remaining network Dense embeddings e

Sequence Length Limits

A given network takes a fixed number of inputs. In MT, we need to process input sentences of arbitrary length and produce output of arbitrary length. Input and output length are not necessarily the same.

Input length Output length Compression Network type fixed fixed feed-forward variable = input (or fixed) recurrent variable unconstrained to fixed size encoder-decoder variable unconstrained no compression attention-based

slide-6
SLIDE 6

Adding a Time Dimension: Recurrent Nets

xt

1

xt

2

xt

3

yt−1 yt

Processing Sequences

x1 h1 y1 x2 h2 y2 x3 h3 y3 x4 h4 y4 x5 h5 y5 x6 h6 y6 x7 h7 y7 x8 h8 y8 recurrent connections forward connections

Unequal Sequence Length

In this architecture, there are equally many inputs xi as outputs yi. Useful for sequence labelling tasks such as POS tagging. In machine translation, the length of the input and output sequences differ.

slide-7
SLIDE 7

Encoder-Decoder Architecture

e1 d1 e2 d2 e3 d3 e4 d4 e5 d5 e6 d6 e7 d7 e8 d8 x1 x2 x3

EOS

y1 y2 y3 y4 y1 y2 y3 y4

EOS

One set of layers of fixed size must hold the contents

  • f the whole input sentence.

Attention Mechanism

x1 x2 x3

EOS

e1 e2 e3 e4 + d1 + d2 + d3 + d4 + d5 y1 y2 y3 y4

EOS

Neural MT: Summary

Very new area: First large-scale systems in 2014. Promising results in public evaluations. We know little about its strengths and weaknesses yet, but they seem to be very different from earlier approaches. I’ll tell you more in a few years. . .