Neural Machine Translation Spring 2020 2020-03-12 Adapted from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig)

Course Logistics ‣ Project proposal is due today ‣ What problem are you addressing? Why is it interesting? ‣ What specific aspects will your project be on? ‣ Re-implement paper? Compare di ff erent methods? ‣ What data do you plan to use? ‣ What is your method? ‣ How do you plan to evaluate? What metrics?

Last time • Statistical MT • Word-based • Phrase-based • Syntactic

Neural Machine Translation ‣ A single neural network is used to translate from source to target ‣ Architecture: Encoder-Decoder ‣ Two main components: ‣ Encoder: Convert source sentence (input) into a vector/ matrix ‣ Decoder: Convert encoding into a sentence in target language (output)

Sequence to Sequence learning (Seq2seq) • Encode entire input sequence into a single vector (using an RNN) • Decode one word at a time (again, using an RNN!) • Beam search for better inference • Learning is not trivial! (vanishing/exploding gradients) (Sutskever et al., 2014)

Encoder Sentence: This cat is cute h h t +3 h t h t +1 h t +2 h t − 1 x t +3 x t x t +1 x t +2 word embedding This cat is cute

Encoder Sentence: This cat is cute h h t +3 h 0 h 1 h t +1 h t +2 x t +3 x 1 x t +1 x t +2 word embedding This cat is cute

Encoder Sentence: This cat is cute h h t +3 h 4 h 0 h 1 h 2 h t +2 h 3 x t +3 x 4 x 1 x 2 x t +2 x 3 word embedding This cat is cute

Encoder (encoded representation) Sentence: This cat is cute h enc h 4 h 0 h 1 h 2 h 3 x 4 x 1 x 2 x 3 word embedding This cat is cute

Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ x ′ x ′ x ′ 4 5 1 2 3 word embedding <s> ce chat est mianon mignon

Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ y 1 x ′ x ′ 4 5 2 3 word embedding <s> ce chat est mianon mignon

Decoder est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 x ′ x ′ y 1 x ′ y 2 4 5 3 word embedding <s> ce chat est mianon mignon

Decoder • A conditioned language model est mignon <e> ce chat o o o o o z 5 z 4 h enc z 3 z 1 z 2 y 5 y 4 y 1 y 3 y 2 word embedding <s> ce chat est mianon

Seq2seq training ‣ Similar to training a language model! ‣ Minimize cross-entropy loss: T ∑ − log P ( y t | y 1 , . . . , y t − 1 , x 1 , . . . , x n ) t =1 ‣ Back-propagate gradients through both decoder and encoder ‣ Need a really big corpus 36M sentence pairs Russian : Машинный перевод - это крут o! English: Machine translation is cool!

Seq2seq training (slide credit: Abigail See)

Remember masking Use masking to help compute loss for batched sequences 1 1 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0

Scheduled Sampling Possible decay schedules (probability using true y decays over time) (figure credit: Bengio et al, 2015)

How seq2seq changed the MT landscape

MT Progress (source: Rico Sennrich)

(Wu et al., 2016)

NMT vs SMT Pros Cons ‣ Requires more data and ‣ Better performance compute ‣ Fluency ‣ Less interpretable ‣ Longer context ‣ Hard to debug ‣ Single NN optimized end-to-end ‣ Uncontrollable ‣ Less engineering ‣ Heavily dependent on data - could lead to unwanted biases ‣ Works out of the box for many ‣ More parameters language pairs

Seq2Seq for more than NMT Task/Application Input Output Machine Translation French English Summarization Document Short Summary Dialogue Utterance Response Parse tree Parsing Sentence (as sequence) Question Answering Context + Question Answer

Cross-Modal Seq2Seq Task/Application Input Output Speech Recognition Speech Signal Transcript Image Captioning Image Text Video Captioning Video Text

Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

Remember alignments?

Attention ‣ The neural MT equivalent of alignment models ‣ Key idea: At each time step during decoding, focus on a particular part of source sentence ‣ This depends on the decoder’s current hidden state (i.e. notion of what you are trying to decode) ‣ Usually implemented as a probability distribution over the h enc hidden states of the encoder ( ) i

Seq2seq with attention (slide credit: Abigail See)

̂ Seq2seq with attention Can also use as input y 1 for next time step (slide credit: Abigail See)

Seq2seq with attention (slide credit: Abigail See)

Computing attention ‣ Encoder hidden states: h enc 1 , . . . , h enc n ‣ Decoder hidden state at time : t h dec t ‣ First, get attention scores for this time step (we will see what is soon!):   g e t = [ g ( h enc 1 , h dec ), . . . , g ( h enc n , h dec )] t t ‣ Obtain the attention distribution using softmax:   α t = softmax ( e t ) ∈ ℝ n ‣ Compute weighted sum of encoder hidden states:   n ∑ α t i h enc ∈ ℝ h a t = i i =1 ‣ Finally, concatenate with decoder state and pass on to output layer: [ a t ; h dec ] ∈ ℝ 2 h t

  Types of attention ‣ Assume encoder hidden states and decoder hidden state h 1 , h 2 , . . . , h n z 1. Dot-product attention (assumes equal dimensions for and :   a b e i = g ( h i , z ) = z T h i ∈ ℝ 2. Multiplicative attention:   g ( h i , z ) = z T Wh i ∈ ℝ , where is a weight matrix W 3. Additive attention:   g ( h i , z ) = v T tanh ( W 1 h i + W 2 z ) ∈ ℝ where are weight matrices and is a weight vector W 1 , W 2 v

Issues with vanilla seq2seq Bottleneck ‣ A single encoding vector, h enc , needs to capture all the information about source sentence ‣ Longer sequences can lead to vanishing gradients ‣ Overfitting

Dropout ‣ Form of regularization for RNNs (and any NN in general) ‣ Idea: “Handicap” NN by removing hidden units stochastically ‣ set each hidden unit in a layer to 0 with probability p during training ( usually works well) p = 0.5 ‣ scale outputs by 1/(1 − p ) ‣ hidden units forced to learn more general patterns ‣ Test time: Use all activations (no need to rescale)

Handling large vocabularies ‣ Softmax can be expensive for large vocabularies exp( w i ⋅ h + b i ) P ( y i ) = Expensive to ∑ | V | j =1 exp( w j ⋅ h + b j ) compute ‣ English vocabulary size: 10K to 100K

Approximate Softmax ‣ Negative Sampling ‣ Structured softmax ‣ Embedding prediction

Negative Sampling • Softmax is expensive when vocabulary size is large (figure credit: Graham Neubig)

Negative Sampling • Sample just a subset of the vocabulary for negative • Saw simple negative sampling in word2vec (Mikolov 2013) Other ways to sample: Importance Sampling (Bengio and Senecal 2003) (figure credit: Graham Neubig) Noise Contrastive Estimation (Mnih & Teh 2012)

Hierarchical softmax (Morin and Bengio 2005) (figure credit: Quora)

Class based softmax ‣ Two-layer: cluster words into classes, predict class and then predict word. (figure credit: Graham Neubig) ‣ Clusters can be based on frequency , random , or word contexts . (Gooding 2001, Mikolov et al 2011)

Embedding prediction ‣ Directly predict embeddings of outputs themselves ‣ What loss to use? (Kumar and Tsvetkov 2019) ‣ L2? Cosine? ‣ Von-Mises Fisher distribution loss, make embeddings close on the unit ball (slide credit: Graham Neubig)

Generation How can we use our model (decoder) to generate sentences? • Sampling: Try to generate a random sentence according the the probability distribution • Argmax: Try to generate the best sentence, the sentence with the highest probability

Decoding Strategies ‣ Ancestral sampling ‣ Greedy decoding ‣ Exhaustive search ‣ Beam search

Ancestral Sampling • Randomly sample words one by one • Provides diverse output (high variance) (figure credit: Luong, Cho, and Manning)

Greedy decoding ‣ Compute argmax at every step of decoder to generate word ‣ What’s wrong?

Neural Machine Translation Spring 2020 2020-03-12 Adapted from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig) Course

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Lecture 7: RNA folding Chapter 6 Problem 6.51 in Jones and Pevzner and the Turner model Fall

Outline CSEP 590A Summer 2006 Biological roles for RNA What is secondary structure? Lecture

System Intrusions Professor Adam Bates Fall 2018 Security & Privacy Research at Illinois

Growth, Poverty Reduction and Inequality in Bangladesh feasible pathways to zero-poverty by

Automating API Style Guides HELLO! I am Phil Sturgeon I love to talk about APIs, 2 crashing

Busy Javascript Developer's Guide to VueJS Ted Neward Neward & Associates

Strictly Proper Decision Markets A reminder about Problem set 2 Due in 8 days Project

Lyman-alpha forest and Primordial Non-gaussianities (fnl) with collaborators: Anze Slosar, Uros

Neural Machine Translation Spring 2020 2020-03-12 Adapted from - PowerPoint PPT Presentation

SFU NatLangLab CMPT 825: Natural Language Processing Neural Machine Translation Spring 2020 2020-03-12 Adapted from slides from Danqi Chen, Karthik Narasimhan, and Jetic Gu. (with some content from slides from Abigail See, Graham Neubig) Course

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Machine Translation 12: (Non-neural) Statistical Machine Translation Rico Sennrich University of

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Convolutional over Recurrent Encoder for Neural Machine Translation Praveen Dakwale and Christof

Adaptive Multi-pass Decoder for Neural Machine Translation EMNLP 2018

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Semi-supervised Learning for Neural Machine Translation Yong Cheng joint work with Wei Xu,

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Lecture 7: RNA folding Chapter 6 Problem 6.51 in Jones and Pevzner and the Turner model Fall

Outline CSEP 590A Summer 2006 Biological roles for RNA What is secondary structure? Lecture

System Intrusions Professor Adam Bates Fall 2018 Security &amp; Privacy Research at Illinois

Growth, Poverty Reduction and Inequality in Bangladesh feasible pathways to zero-poverty by

Automating API Style Guides HELLO! I am Phil Sturgeon I love to talk about APIs, 2 crashing

Busy Javascript Developer's Guide to VueJS Ted Neward Neward &amp; Associates

Strictly Proper Decision Markets A reminder about Problem set 2 Due in 8 days Project

Lyman-alpha forest and Primordial Non-gaussianities (fnl) with collaborators: Anze Slosar, Uros

System Intrusions Professor Adam Bates Fall 2018 Security & Privacy Research at Illinois

Busy Javascript Developer's Guide to VueJS Ted Neward Neward & Associates