An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre for Digital Content Technology Dublin Institute of Technology, Ireland June 25, 2018 1 / 57
Outline The Neural Machine Translation Revolution Neural Networks 101 Word Embeddings Language Models Neural Language Models Neural Machine Translation Beyond NMT: Image Annotation
Image from https://blogs.msdn.microsoft.com/translation/
Image from https://www.blog.google/products/translate
Image from https://techcrunch.com/2017/08/03/facebook-finishes-its-move-to-neural-machine-translation/
Image from https: //slator.com/technology/linguees-founder-launches-deepl-attempt-challenge-google-translate/
Neural Networks 101 7 / 57
What is a function? A function maps a set of inputs (numbers) to an output (number) 1 sum (2 , 5 , 4) → 11 1 This introduction to neural network and machine translation is based on: Kelleher (2016) 8 / 57
What is a weightedSum function? weightedSum ([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] ) � �� � � �� � Input Numbers Weights = ( x 1 × w 1 ) + ( x 2 × w 2 ) + · · · + ( x m × w m ) weightedSum ([3 , 9] , [ − 3 , 1]) = (3 × − 3) + (9 × 1) = − 9 + 9 = 0 9 / 57
What is an activation function? An activation function takes the output of our weightedSum function and applies another mapping to it. 10 / 57
What is an activation function? 1.0 1.0 0.5 0.5 1 activation(z) logistic ( z ) = activation(z) 1 + e − z rectifier(z) = max(0,z) 0.0 0.0 −0.5 −0.5 tanh ( z ) = e z − e − z e z + e − z logistic(z) −1.0 tanh(z) −1.0 −10 −5 0 5 10 −1.0 −0.5 0.0 0.5 1.0 z z 11 / 57
What is an activation function? activation = logistic ( weightedSum (([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] )) � �� � � �� � Input Numbers Weights logistic ( weightedSum ([3 , 9] , [ − 3 , 1])) = logistic ((3 × − 3) + (9 × 1)) = logistic ( − 9 + 9) = logistic (0) = 0 . 5 12 / 57
What is a Neuron ? The simple list of operations that we have just described defines the fundamental building block of a neural network: the Neuron . Neuron = activation ( weightedSum (([ x 1 , x 2 , . . . , x m ] , [ w 1 , w 2 , . . . , w m ] )) � �� � � �� � Input Numbers Weights 13 / 57
What is a Neuron ? x 0 w 0 w 1 x 1 w 2 Activation Σ ϕ x 2 w 3 x 3 . . w m . x m 14 / 57
What is a Neural Network ? Input Hidden Hidden Hidden Output Layer Layer 1 Layer 2 Layer 3 Layer 4 15 / 57
Training a Neural Network ◮ We train a neural network by iteratively updating the weights ◮ We start by randomly assigning weights to each edge ◮ We then show the network examples of inputs and expected outputs and update the weights using Backpropogation so that the network outputs match the expected outputs ◮ We keep updating the weights until the network is working the way we want 16 / 57
Word Embeddings 17 / 57
Word Embeddings ◮ Language is sequential and has lots of words. 18 / 57
“a word is characteriezed by the company it keeps” — Firth, 1957 19 / 57
Word Embeddings 1. Train a network to predict the word that is missing from the middle of an n-gram (or predict the n-gram from the word) 2. Use the trained network weights to represent the word in vector space. 20 / 57
Word Embeddings Each word is represented by a vector of numbers that positions the word in a multi-dimensional space, e.g.: king = < 55 , − 10 , 176 , 27 > man = < 10 , 79 , 150 , 83 > woman = < 15 , 74 , 159 , 106 > queen = < 60 , − 15 , 185 , 50 > 21 / 57
Word Embeddings vec ( King ) − vec ( Man ) + vec ( Woman ) ≈ vec ( Queen ) 2 2 Linguistic Regularities in Continuous Space Word Representations Mikolov et al. (2013) 22 / 57
Language Models 23 / 57
Language Models ◮ Language is sequential and has lots of words. 24 / 57
1,2,? 25 / 57
0 . 2 0 . 18 0 . 16 0 . 14 0 . 12 0 . 1 8 · 10 − 2 0 1 2 3 4 5 6 7 8 9
Th? 27 / 57
0 . 4 0 . 3 0 . 2 0 . 1 0 a b c d e f g h i j k l m n o p q r s t u v w x y z
◮ A language model can compute: 1. the probability of an upcoming symbol: P ( w n | w 1 , . . . , w n − 1 ) 2. the probability for a sequence of symbols 3 P ( w 1 , . . . , w n ) 3 We can go from 1. to 2. using the Chain Rule of Probability P ( w 1 , w 2 , w 3) = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 1 , w 2 ) 29 / 57
◮ Language models are useful for machine translation because they help with: 1. word ordering P ( Yes I can help you ) > P ( Help you I can yes ) 4 2. word choice P ( Feel the Force ) > P ( Eat the Force ) 4 Unless its Yoda that speaking 30 / 57
Neural Language Models 31 / 57
Recurrent Neural Networks A particular type of neural network that is useful for processing sequential data (such as, language) is a Recurrent Neural Network . 32 / 57
Recurrent Neural Networks Using an RNN we process our sequential data one input at a time. In an RNN the outputs of some of the neurons for one input are feed back into the network as part the next input. 33 / 57
Simple Feed-Forward Network Input Layer Hidden Input 1 Layer Input 2 Input 3 Output ... Layer 34 / 57
Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 35 / 57
Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 36 / 57
Recurrent Neural Networks Input Layer Hidden Input 1 Layer Input 2 Buffer Input 3 Output ... Layer 37 / 57
Recurrent Neural Networks Input Layer Input 1 Hidden Input 2 Layer Input 3 Buffer ... Output Layer 38 / 57
Recurrent Neural Networks Input Layer Input 1 Hidden Input 2 Layer Input 3 Buffer ... Output Layer 39 / 57
Recurrent Neural Networks Input Input 1 Layer Input 2 Hidden Input 3 Layer ... Buffer Output Layer 40 / 57
Recurrent Neural Networks Input Input 1 Layer Input 2 Hidden Input 3 Layer ... Buffer Output Layer 41 / 57
y t h t h t = φ (( W hh · h t − 1 ) + ( W xh · x t )) y t = φ ( W hy · h t ) x t h t − 1 Figure: Recurrent Neural Network 42 / 57
Recurrent Neural Networks Output: y 1 y 2 y 3 y t y t +1 · · · h 1 h 2 h 3 h t h t +1 Input: x 1 x 2 x 3 x t x t +1 Figure: RNN Unrolled Through Time 43 / 57
Hallucinating Text Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 · · · h 1 h 2 h 3 h t · · · Input: Word 1 44 / 57
Hallucinating Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain’d into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. From: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 45 / 57
Neural Machine Translation 46 / 57
Neural Machine Translation 1. RNN Encoders 2. RNN Language Models 47 / 57
Encoders Encoding: h 1 h 2 h m C · · · Input: Word 1 Word 2 Word m < eos > Figure: Using an RNN to Generate an Encoding of a Word Sequence 48 / 57
Language Models Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 h 1 h 2 h 3 h t · · · Input: Word 1 Word 2 Word 3 Word t Figure: RNN Language Model Unrolled Through Time 49 / 57
Decoder Output: ∗ Word 2 ∗ Word 3 ∗ Word 4 ∗ Word t +1 · · · h 1 h 2 h 3 h t · · · Input: Word 1 Figure: Using an RNN Language Model to Generate (Hallucinate) a Word Sequence 50 / 57
Encoder-Decoder Architecture Target 1 Target 2 < eos > · · · Encoder h 1 h 2 C d 1 d n · · · · · · Decoder Source 1 Source 2 < eos > · · · Figure: Sequence to Sequence Translation using an Encoder-Decoder Architecture 51 / 57
Neural Machine Translation Life is beautiful < eos > Encoder h 1 h 2 h 3 h 4 C d 1 d 2 d 3 Decoder belle est vie La < eos > Figure: Example Translation using an Encoder-Decoder Architecture 52 / 57
Beyond NMT: Image Annotation 53 / 57
Image Annotation Image from Image from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Xu et al. (2015) 54 / 57
Thank you for your attention john.d.kelleher@dit.ie @johndkelleher www.machinelearningbook.com https://ie.linkedin.com/in/johndkelleher DATA SCIENCE JOHN D. KELLEHER AND BRENDAN TIERNEY THE MIT PRESS ESSENTIAL KNOWLEDGE SERIES Acknowledgements: The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Books: Kelleher et al. (2015) Kelleher and Tierney (2018)
Recommend
More recommend