A Quick Introduction to Neural MT Christian Hardmeier 2016-05-16 Why this lecture? For about 15 years, the MT world was relatively static. State of the art defined by phrase-based SMT and syntax-based SMT . Well-known strengths and weaknesses. Neural MT is a new, quite different approach to MT that seems to outperform the previous methods. Deep Learning Continuous-space NLP Neural Networks Deep Learning Machine learning paradigm that gained popularity very recently. First breakthroughs in computer vision. Multiple layers of prediction: “Automated feature engineering”
Deep Learning Image source: http://deeplearning.stanford.edu/wiki/index.php/Exercise%3AVectorization Continuous-Space Methods NLP traditionally treated words as discrete, incomparable units. Continuous-space methods map them into a vector space where you can compute similarities. Methods: Word cooccurrence or deep learning. With deep learning, we can train word embeddings for specific objectives. Discrete Words Lockheed ICCAT closed pride OHIO shadowy Ernest hangout solution homicidal Pacific things far-ranging enables Akram communicates triangle taxed secrets receipts taken Spinelli dates Cost clash district relative visa captains abilities Organization Austrian inflows Loyola whatever Primakov upstaging guidelines authors complaining oath marched soldiers geology drifts seen provide adaptation enterprises Valdis un-associated misguided non-Serb writing doubtless frankly anti-Semitism 10-1 operators Genocide camouflage gathered adopts bags shunning approaching aspirin maximum expenditure some footsteps Dutch stressed writers between mischief undertake attention degraded obscene
Word Embeddings (Projected) world new people time president her its their $ other one his the what an some three it most two a you all no ‘‘ first more this NUMBER only up about i not so than we if there she they but out last when or he over by − do ’ be _ after as ’’ who : into −− which may that with at − rrb − for ; can also from − lrb − could on will would ’s are have is has said were was had (courtesy of Ali Basirat) Neural networks Neural networks are the machine learning paradigm in which most of this happens. Biologically inspired, but doesn’t matter very much. Very popular in the early 1980’s, but the time wasn’t ripe. The elementary “neuron” is just a nonlinear with some trainable parameters. Neurons are combined into a network by function composition. Logistic Regression x 1 · λ 1 1 f ( x ) = 1 + e − x x 2 · λ 2 + x 3 · λ 3 y x 4 · λ 4 y = f � λ i x i � � � � x 5 · λ 5 i
Logistic Regression x 1 λ 1 λ 2 x 2 λ 3 x 3 y λ 4 x 4 y = f � λ i x i � λ 5 � � � x 5 i Multiple Decision Steps Latent features h Outputs y Inputs x h = f ( W 1 x ) y = f ( W 2 h ) Multiple Decision Steps Latent features h Outputs y Inputs x h = f ( W 1 x ) y = f ( W 2 h )
Training the Network Neural networks are trained by numerically minimising the error of the output for a training set. The algorithms used are variants of gradient descent . The gradients with respect to all weights can be computed efficiently with a dynamic programming algorithm called back-propagation . Word Embeddings in Neural Networks Dense embeddings e 0 0 .10 0 .32 0 .95 Sparse inputs x 0 .02 1 .51 Remaining 0 .33 network 0 .18 0 .83 0 .81 0 .67 0 .22 0 Sequence Length Limits A given network takes a fixed number of inputs. In MT, we need to process input sentences of arbitrary length and produce output of arbitrary length. Input and output length are not necessarily the same. Input length Output length Compression Network type fixed fixed feed-forward variable = input (or fixed) recurrent variable unconstrained to fixed size encoder-decoder variable unconstrained no compression attention-based
Adding a Time Dimension: Recurrent Nets y t − 1 x t 1 x t y t 2 x t 3 Processing Sequences y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 forward connections recurrent connections Unequal Sequence Length In this architecture, there are equally many inputs x i as outputs y i . Useful for sequence labelling tasks such as POS tagging. In machine translation, the length of the input and output sequences differ.
Encoder-Decoder Architecture One set of layers of fixed size y 1 y 2 y 3 y 4 EOS must hold the contents of the whole input sentence. d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 e 1 e 2 e 3 e 4 e 5 e 6 e 7 e 8 x 1 x 2 x 3 EOS y 1 y 2 y 3 y 4 Attention Mechanism y 1 y 3 y 4 y 2 EOS d 1 d 2 d 3 d 4 d 5 + + + + + e 4 e 1 e 2 e 3 x 1 x 2 x 3 EOS Neural MT: Summary Very new area: First large-scale systems in 2014. Promising results in public evaluations. We know little about its strengths and weaknesses yet, but they seem to be very different from earlier approaches. I’ll tell you more in a few years. . .
Recommend
More recommend