Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova Computational Pragmatics Lab, HSE December 2, 2019 Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 1 / 67
Machine translation Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 2 / 67
Machine translation Sequence-to-sequence Neural encoder encoder architectures Achieves high results on machine translation, spelling correction, summarization and other NLP tasks The encoder inputs sequence of tokens x 1: n and outputs hidden states h E n The decoder decodes an output sequence of tokens y 1: n by decoding last hidden state h D 0 = h E n seq2seq architectures are trained on parallel corpora Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 3 / 67 Image source: jeddy92
Machine translation Seq-to-seq for MT Both encoder and decoder are recurrent networks Input words x i ( i ∈ [1 , n ]) are represented as word embeddings ( w2v for example) The context vector: h n , last hidden state of RNN encoder, turns out to be a bottleneck It is challenging for the models to deal with long sentences as the impact of last words is higher Attention mechanism is one of the possible solutions Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 4 / 67
Machine translation Seq-to-seq for MT + attention Attention mechanism allows to align input and output words. The encoder passes all the hidden states to the decoder: not h E n , but rather h E i , i ∈ [1 , n ] The hidden states can be treated as context-aware word embeddings The hidden states are used to produced a context vector c for the decoder Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 5 / 67
Machine translation Seq-to-seq for MT + attention At the step j the decoder inputs h D j − 1 , j ∈ [ n +1 , m ] and a context vector c j from the encoder The context vector c j is a linear combination of the encoder hidden states: � α i h E c j = i i α i are attention weights which help the decoder to focur on the relvant part of the encoder input Image source: jalammar Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 6 / 67
Machine translation Seq-to-seq MT + attention To generate a new word the decoder at the step j : inputs h D j − 1 and produces h D j concatenates h D to c j j passes the concatenated vector through linear layer with softmax activation to get a probability distribution over target vocabulary Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 7 / 67
Machine translation Attention weights The attention weights α ij measure the similarity of the encoder hidden state h E i while generating the word j exp( sim ( h E i , h D j )) � a ij = k exp( sim ( h E k , h D j )) The similarity sim can be computed by ◮ dot product attention: sim ( h , s ) = h T s ◮ additive attention: sim ( h , s ) = w T tanh( W h h + W s s ) ◮ multiplicative attention: sim ( h , s ) = h T W s Weights are trained jointly with the whole model Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 8 / 67
Machine translation Attention map Figure: Visualisation of attention weights Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 9 / 67
Machine translation MT metrics BLEU compares system output to reference translation Reference translation: E-mail was sent on Tuesday System output: The letter was sent on Tuesday Given N ( N ∈ [1 , 4]) compute the number of N -grams present both in system output and reference translation: N = 1 ⇒ 4 N = 2 ⇒ 3 N = 3 ⇒ 2 N = 4 ⇒ 1 6 5 4 3 Take geometric mean N : � 4 6 · 3 5 · 2 4 · 1 4 score = 3 Brevity penalty: BP = min(1 , 6 / 5) Finally BLEU: BP · score ≈ 0 . 5081 Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 10 / 67
Task oriented chat-bots Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 11 / 67
Task oriented chat-bots Natural language understanding Two tasks (intent detection and slot filling) : identify speaker’s intent and extract semantic constituents from the natural language query Figure: ATIS corpus sample with intent and slot annotation Intent detection is a classification task Slot filling is a sequence labelling task NLU datasets : ATIS [1], Snips [2] Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 12 / 67
Task oriented chat-bots Joint intent detection and slot filling [3] 1 The encoder models is a biLSTM 2 The decoder is a unidirectional LSTM 3 At each step the decoder state s i is: s i = f ( s i − 1 , y i − 1 , h i , c i ), where c i = � T j α i , j h j , exp( e i , j ) α i , j = k exp( e i , k ) , � T e i , k = g ( s i − 1 , h k ) The inputs are explicitly aligned. Costs from both decoders are back-propagated to the encoder. Figure: Encoder-decoder models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 13 / 67
Task oriented chat-bots Joint intent detection and slot filling [3] BiLSTM reads the source sequence forward RNN models slot label dependencies the hidden state h i at each step is a concatenation of the forward state fh i and backward state bh i the hidden state is h i combined Figure: RNN-based model with the context vector c i c i is calculated as a weighted average of h = ( h 1 , ..., h T ) Figure: Attention weights Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 14 / 67
Constituency parsing Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 15 / 67
Constituency parsing Grammar as a Foreign Language [4] Figure: Example parsing task and its linearization Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 16 / 67
Constituency parsing Grammar as a Foreign Language [4] Figure: LSTM+attention encoder-decoder model for parsing Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 17 / 67
Constituency parsing Grammar as a Foreign Language [4] The encoder LSTM is is used to encode the sequence of input words A i , | A | = T A The decoder LSTM is used to output symbols B i , | B | = T B The attention vector at each output time t over the input words: i = v T tanh( W 1 h E u t i + W 2 h D t ) a t i = softmax( u T i ) T A � d ′ a t i h E t = i , i =1 where the vector v and matrices W 1 , W 2 are learnable parameters of the model. Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 18 / 67
Spelling correction Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 19 / 67
Spelling correction Neural Language Correction with Character-Based Attention [5] Trained on a parallel corpus of “bad” ( x ) and “good” ( y ) sentences Encoder has a pyramid structure: f ( j ) = GRU( f ( j − 1) , c ( j − 1) ) t t − 1 t b ( j ) = GRU( b ( j − 1) t +1 , c ( j − 1) ) t t h ( j ) = f ( j ) + b ( j ) Figure: An encoder-decoder t t t neural network model with two c ( j ) = tanh( W ( j ) pyr [ h ( j − 1) , h ( j − 1) 2 t +1 ] ⊤ + b ( j ) pyr ) t 2 t encoder hidden layers and one decoder hidden layer Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 20 / 67
Spelling correction Neural Language Correction with Character-Based Attention [5] Decoder network: d ( j ) = GRU( d ( j − 1) t − 1 , ( j − 1) ) t t Attention mechanism: u tk = φ 1 ( d ( M ) ) ⊤ φ 2 ( c k ) , φ : tahn ( W ×· ) u tk α tk = � j u tj a t = � j α tj c j Figure: An encoder-decoder Loss: neural network model with two L ( x , y ) = � T t =1 logP ( y t | x , y < t ) encoder hidden layers and one decoder hidden layer Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 21 / 67
Spelling correction Neural Language Correction with Character-Based Attention [5] Beam search for decoding: s k ( y 1: k | x ) = log P NN ( y 1: k | x ) + λ logP LM ( y 1: k ) Synthezing errors: article or determiner errors (ArtOrDet) and noun number Figure: An encoder-decoder errors (Nn) neural network model with two encoder hidden layers and one decoder hidden layer Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 22 / 67
Summarization Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 23 / 67
Recommend
More recommend