recurrent neural networks iii
play

Recurrent Neural Networks III Milan Straka April 29, 2019 Charles - PowerPoint PPT Presentation

NPFL114, Lecture 9 Recurrent Neural Networks III Milan Straka April 29, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Recurrent Neural Networks


  1. NPFL114, Lecture 9 Recurrent Neural Networks III Milan Straka April 29, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Recurrent Neural Networks Single RNN cell input state output Unrolled RNN cells input 1 input 2 input 3 input 4 state state state state output 1 output 2 output 3 output 4 NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 2/49

  3. Basic RNN Cell input output = new state previous state x ( t ) s ( t −1) Given an input and previous state , the new state is computed as ( t ) ( t −1) ( t ) = f ( s , x ; θ ). s One of the simplest possibilities is ( t ) ( t −1) ( t ) = tanh( Us + + b ). s V x NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 3/49

  4. Basic RNN Cell Basic RNN cells suffer a lot from vanishing/exploding gradients ( the challenge of long-term dependencies ). If we simplify the recurrence of states to ( t ) ( t −1) = , s Us we get ( t ) (0) t = . s U s U = QΛQ −1 U If has eigenvalue decomposition of , we get ( t ) −1 (0) t = . s QΛ Q s The main problem is that the same function is iteratively applied many times. Several more complex RNN cell variants have been proposed, which alleviate this issue to some degree, namely LSTM and GRU . NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 4/49

  5. Long Short-Term Memory Later in Gers, Schmidhuber & Cummins (1999) a possibility to forget information from memory c t cell was added. x t h t − 1 i i i ← σ ( W x + V h + b ) i t −1 t t f ← σ ( W x f + V h f + b ) f σ t −1 t t o o o ← σ ( W x + V h + b ) o t −1 t t c t y y y ← f ⋅ c + i ⋅ tanh( W x + V h + b ) c t −1 t −1 t t t t x t h t ← o ⋅ tanh( c ) tanh tanh h t t t σ σ h t − 1 x t h t − 1 x t h t − 1 NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 5/49

  6. Long Short-Term Memory http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 6/49

  7. Gated Recurrent Unit x t h t − 1 σ h t − 1 tanh x t h t 1 − + h t − 1 σ x t h t − 1 r ← σ ( W x r + V h r + b ) r t −1 t t u u u ← σ ( W x + V h + b ) u t −1 t t ^ t h h h ← tanh( W x + V ( r ⋅ h ) + b ) h t −1 t t ^ t ← u ⋅ h + (1 − u ) ⋅ h h t −1 t t t NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 7/49

  8. Gated Recurrent Unit http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 8/49

  9. Word Embeddings One-hot encoding considers all words to be independent of each other. However, words are not independent – some are more similar than others. Ideally, we would like some kind of similarity in the space of the word representations. Distributed Representation The idea behind distributed representation is that objects can be represented using a set of common underlying factors. R d We therefore represent words as fixed-size embeddings into space, with the vector elements playing role of the common underlying factors. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 9/49

  10. Word Embeddings The word embedding layer is in fact just a fully connected layer on top of one-hot encoding. However, it is important that this layer is shared across the whole network. D 1 D 1 V D D 2 D D 2 Word in Word in V V D one-hot one-hot encoding encoding D 3 D 3 V D NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 10/49

  11. Word Embeddings for Unknown Words Recurrent Character-level WEs Figure 1 of paper "Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation", https://arxiv.org/abs/1508.02096. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 11/49

  12. Word Embeddings for Unknown Words Convolutional Character-level WEs Figure 1 of paper "Character-Aware Neural Language Models", https://arxiv.org/abs/1508.06615. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 12/49

  13. Basic RNN Applications Sequence Element Classification Use outputs for individual elements. input 1 input 2 input 3 input 4 state state state state output 1 output 2 output 3 output 4 Sequence Representation Use state after processing the whole sequence (alternatively, take output of the last element). NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 13/49

  14. Structured Prediction , … , y ∈ , … , x Y N y x 1 1 N N Consider generating a sequence of given input . P ( y ∣ X ) i Predicting each sequence element independently models the distribution . y i However, there may be dependencies among the themselves, which is difficult to capture by independent element classification. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 14/49

  15. Linear-Chain Conditional Random Fields (CRF) Linear-chain Conditional Random Fields, usually abbreviated only to CRF, acts as an output layer. It can be considered an extension of a softmax – instead of a sequence of independent softmaxes, CRF is a sentence-level softmax, with additional weights for neighboring sequence elements. N ∑ s ( X , y ; θ , A ) = + ( y ∣ X ) ) ( A f , y y θ i i −1 i i =1 p ( y ∣ X ) = softmax ( s ( X , z ) ) z ∈ Y N z log p ( y ∣ X ) = s ( X , y ) − logadd ( s ( X , z )) z ∈ Y N NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 15/49

  16. Linear-Chain Conditional Random Fields (CRF) Computation p ( y ∣ X ) ( k ) α t We can compute efficiently using dynamic programming. If we denote as t y k probability of all sentences with elements with the last being . The core idea is the following: ( k ) = ( y = k ∣ X ) + logadd ( α ( j ) + ). α f A t −1 j , k j ∈ Y t θ t For efficient implementation, we use the fact that ln b −ln a ln( a + b ) = ln a + ln(1 + e ). NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 16/49

  17. Conditional Random Fields (CRF) Decoding logadd We can perform optimal decoding, by using the same algorithm, only replacing with max and tracking where the maximum was attained. Applications CRF output layers are useful for span labeling tasks, like named entity recognition dialog slot filling NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 17/49

  18. Connectionist Temporal Classification , … , y , … , x y x 1 1 M N Let us again consider generating a sequence of given input , but this M ≤ N x y time and there is no explicit alignment of and in the gold data. Figure 7.1 of the dissertation "Supervised Sequence Labelling with Recurrent Neural Networks" by Alex Graves. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 18/49

  19. Connectionist Temporal Classification We enlarge the set of output labels by a – ( blank ) and perform a classification for every input element to produce an extended labeling . We then post-process it by the following rules B (denoted ): 1. We remove neighboring symbols. 2. We remove the – . Because the explicit alignment of inputs and labels is not known, we consider all possible alignments. t l t p l Denoting the probability of label at time as , we define t ∑ ∏ t ′ def t α ( s ) = . p π t ′ ′ t =1 labeling π : B ( π )= y 1: t 1: s NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 19/49

  20. CRF and CTC Comparison In CRF, we normalize the whole sentences, therefore we need to compute unnormalized probabilities for all the (exponentially many) sentences. Decoding can be performed optimally. In CTC, we normalize per each label. However, because we do not have explicit alignment, we compute probability of a labeling by summing probabilities of (generally exponentially many) extended labelings. NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 20/49

  21. Connectionist Temporal Classification Computation When aligning an extended labeling to a regular one, we need to consider whether the extended labeling ends by a blank or not. We therefore define t ∑ ∏ t ′ def t ( s ) = α p − π t ′ ′ t =1 labeling π : B ( π )= y , π =− 1: t 1: s t t ∑ ∏ t ′ def t ( s ) = α p ∗ π t ′ t =1 ′ labeling π : B ( π )= y , π t  =− 1: t 1: s α ( s ) ( s ) + ( s ) t t t α α − ∗ and compute as . NPFL114, Lecture 9 Refresh CTC Word2vec Subword Embeddings Seq2seq Attention NMT 21/49

Recommend


More recommend