Machine Translation 2: Statistical MT: Neural MT and Representations Ondřej Bojar bojar@ufal.mfg.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague May 2020 MT2: NMT and Representations
Outline of Lectures on MT 1. Introduction. – Phrase-Based MT. 2. Neural Machine Translation. May 2020 MT2: NMT and Representations 1 • Why is MT diffjcult. • MT evaluation. • Approaches to MT. • Document, sentence and esp. word alignment. • Classical Statistical Machine Translation. • Neural MT: Sequence-to-sequence, attention, self-attentive. • Sentence representations. • Role of Linguistic Features in MT.
Outline of MT Lecture 2 1. Fundamental problems of PBMT. 2. Neural machine translation (NMT). May 2020 MT2: NMT and Representations 2 • Brief summary of NNs. • Sequence-to-sequence, with attention. • Transformer, self-attention. • Linguistic features in NMT.
Summary of PBMT Phrase-based MT: – lookup of all relevant translation options – stack-based beam search, gradually expanding hypotheses To train a PBMT system: 1. Align words. 2. Extract (and score) phrases consistent with word alignment. 3. Optimize weights (MERT). May 2020 MT2: NMT and Representations 3 • is a log-linear model • assumes phrases relatively independent of each other • decomposes sentence into contiguous phrases • search has two parts:
May 2020 1: Align Training Sentences MT2: NMT and Representations 4 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.
May 2020 2: Align Words MT2: NMT and Representations 5 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.
May 2020 3: Extract Phrase Pairs (MTUs) MT2: NMT and Representations 6 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat.
4: New Input May 2020 MT2: NMT and Representations 7 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. New input: Nemám ko č ku.
May 2020 4: New Input MT2: NMT and Representations 8 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: Nemám ko č ku.
5: Pick Probable Phrase Pairs (TM) May 2020 MT2: NMT and Representations 9 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: New input: Nemám Nemám ko č ku. I have
10 May 2020 MT2: NMT and Representations 6: So That n -Grams Probable (LM) Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. New input: New input: Nemám Nemám ko č ku. ko č ku. I have a cat.
Meaning Got Reversed! May 2020 MT2: NMT and Representations 11 Nemám žádného psa. Vid ě l ko č ku. I have no dog. He saw a cat. ... I don't have cat. a cat. ✘ New input: New input: Nemám Nemám ko č ku. ko č ku. I have
What Went Wrong? – Phrases do depend on each other. MT2: NMT and Representations May 2020 But adding it would increase data sparseness. – Word alignments ignored that dependence. Here “nemám” and “žádného” jointly express one negation. 12 ˆ ∏ p ( ˆ I p ( f J 1 | e I 1 ) p ( e I e ) p ( e I ˆ 1 = argmax 1 ) = argmax f | ˆ 1 ) (1) e I,e I I,e I ( ˆ e ) ∈ phrase pairs of f J 1 ,e I 1 1 f, ˆ 1 • Too strong phrase-independence assumption. • Language model is a separate unit. – p ( e I 1 ) models the target sentence independently of f J 1 .
13 (2) MT2: NMT and Representations May 2020 But what technical device can learn this? Main Benefjt: All dependencies available. Redefjning p ( e I 1 | f J 1 ) What if we modelled p ( e I 1 | f J 1 ) directly, word by word: p ( e I 1 | f J 1 ) = p ( e 1 , e 2 , . . . e I | f J 1 ) = p ( e 1 | f J 1 ) · p ( e 2 | e 1 , f J 1 ) · p ( e 3 | e 2 , e 1 , f J 1 ) . . . I ∏ p ( e i | e 1 , . . . e i − 1 , f J = 1 ) i =1 1 ) = ∏ I …this is “just a cleverer language model:” p ( e I i =1 p ( e i | e 1 , . . . e i − 1 )
NNs: Universal Approximators approximate any continuous function to any precision. https://www.quora.com/How-can-a-deep-neural-network-with-ReLU-activations-in-its-hidden-layers-approximate-any-function May 2020 MT2: NMT and Representations 14 • A neural network with a single hidden layer (possibly huge) can • (Nothing claimed about learnability.)
playground.tensorfmow.org May 2020 MT2: NMT and Representations 15
Perfect Features May 2020 MT2: NMT and Representations 16
Bad Features & Low Depth May 2020 MT2: NMT and Representations 17
Too Complex NN Fails to Learn May 2020 MT2: NMT and Representations 18
Deep NNs for Image Classifjcation May 2020 MT2: NMT and Representations 19
Representation Learning (sample inputs and expected outputs) May 2020 MT2: NMT and Representations 20 • Based on training data • the neural network learns by itself • what is important in the inputs • to predict the outputs best. A “representation” is a new set of axes. • Instead of 3 dimensions ( x, y, color ) , we get • 2000 dimensions: (elephantity, number of storks, blueness, …) • designed automatically to help in best prediction of the output
Skew: Transpose: Non-lin.: Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ May 2020 MT2: NMT and Representations 21 One Layer tanh( Wx + b ) , 2D → 2D W b tanh
Four Layers, Disentagling Spirals Animation by http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ May 2020 MT2: NMT and Representations 22
Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 23 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓
Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 24 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓
Processing Text with NNs 0 … … … … … 0 0 1 0 0 … on … … … … … … … 2.2M Czech 0 … the 0 … MT2: NMT and Representations May 2020 Main drawback: No relations, all words equally close/far. 0 0 0 0 0 0 zebra … 1 … … … … … 0 1 0 0 0 0 1 25 0 0 0 0 0 0 0 about 0 0 0 … 0 0 a mat the on is cat the 0 … … … 0 is 1.3M English … … … … … … … Vocabulary size: 0 0 0 0 1 0 cat … … … • Map each word to a vector of 0s and 1s (“1-hot repr.”): cat �→ (0 , 0 , . . . , 0 , 1 , 0 , . . . , 0) • Sentence is then a matrix: ↑ ↓
Solution: Word Embeddings – CBOW: Predict the word from its four neighbours. MT2: NMT and Representations May 2020 Right: CBOW with just a single-word context (http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) – Skip-gram: Predict likely neighbours given the word. 26 – NNs: The matrix that maps 1-hot input to the fjrst layer. – The dimensions have no clear interpretation. • Map each word to a dense vector. • In practice 300–2000 dimensions are used, not 1–2M. • Embeddings are trained for each particular task. • The famous word2vec (Mikolov et al., 2013): Input layer Hidden layer Output layer x 1 y 1 x 2 y 2 x 3 y 3 h 1 h 2 x k y j h i W V × N ={ w ki } W ' N × V ={ w' ij } h N x V y V
Continuous Space of Words Word2vec embeddings show interesting properties: (3) Illustrations from https://www.tensorflow.org/tutorials/word2vec May 2020 MT2: NMT and Representations 27 v ( king ) − v ( man ) + v ( woman ) ≈ v ( queen )
Further Compression: Sub-Words Morphemes MT2: NMT and Representations May 2020 český politik s@@ vez@@ l mi@@ granty BPE 30k Chars Char Pairs 28 Syllables český politik svezl migranty Orig nejneobhodpodařovávatelnějšími, Donaudampfschifgfahrtsgesellschaftskapitän • SMT struggled with productive morphology ( > 1M wordforms). • NMT can handle only 30–80k dictionaries. ⇒ Resort to sub-word units. čes ký ⊔ po li tik ⊔ sve zl ⊔ mig ran ty česk ý ⊔ politik ⊔ s vez l ⊔ migrant y če sk ý ⊔ po li ti k ⊔ sv ez l ⊔ mi gr an ty č e s k ý ⊔ p o l i t i k ⊔ s v e z l ⊔ m i g r a n t y BPE (Byte-Pair Encoding) uses n most common substrings (incl. frequent words).
Recommend
More recommend