Some RNN Variants � Arun Mallya � Best viewed with Computer Modern fonts installed �
Outline � • Why Recurrent Neural Networks (RNNs)? � • The Vanilla RNN unit � • The RNN forward pass � • Backpropagation refresher � • The RNN backward pass � • Issues with the Vanilla RNN � • The Long Short-Term Memory (LSTM) unit � • The LSTM Forward & Backward pass � • LSTM variants and tips � – Peephole LSTM � – GRU �
� � � The Vanilla RNN Cell � x t � W � h t � h t-1 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 3 ¡
� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � y 1 � y 2 � y 3 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 h 1 � h 2 � h 3 � y t = F( h t ) C t = Loss( y t ,GT t ) x 1 h 0 � x 2 h 1 � x 3 h 2 � 4 ¡
� � � The Vanilla RNN Forward � C 1 � C 2 � C 3 � y 1 � y 2 � y 3 � ⎛ ⎞ x t h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 h 1 � h 2 � h 3 � y t = F( h t ) C t = Loss( y t ,GT t ) indicates shared weights � x 1 h 0 � x 2 h 1 � x 3 h 2 � 5 ¡
� � � The Vanilla RNN Backward � ⎛ ⎞ x t C 1 � C 2 � C 3 � h t = tanh W ⎜ ⎟ ⎝ ⎠ h t − 1 y 1 � y 2 � y 3 � y t = F( h t ) C t = Loss( y t ,GT t ) h 1 � h 2 � h 3 � ⎛ ⎞ ⎛ ⎞ ∂ C t ∂ C t ∂ y t = ⎜ ⎟ ⎜ ⎟ ∂ h 1 ⎝ ∂ y t ⎠ ⎝ ∂ h 1 ⎠ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ∂ C t ∂ y t ∂ h t ⎟ ! ∂ h 2 = ⎜ ⎟ ⎜ ⎟ ⎜ ⎜ ⎟ ∂ y t ∂ h t ∂ h t − 1 ∂ h 1 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ x 1 h 0 � x 2 h 1 � x 3 h 2 � 6 ¡
� � � � The Popular LSTM Cell � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � W i � f t = σ W f ⎟ + b f ⎜ ⎜ ⎟ Input Gate � i t � Output Gate � o t � ⎝ ⎠ ⎝ ⎠ h t − 1 Similarly for i t , o t � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 7 ¡ * Dashed line indicates time-lag �
LSTM – Forward/Backward � Go ¡To: ¡ Illustrated LSTM Forward and Backward Pass � 8 ¡
� Class Exercise � • Consider the problem of translation of English to French � • E.g. What is your name Comment tu t'appelle � • Is the below architecture suitable for this problem? � F 1 � F 2 � F 3 � E 1 � E 2 � E 3 � 9 ¡ Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf �
� Class Exercise � • Consider the problem of translation of English to French � • E.g. What is your name Comment tu t'appelle � • Is the below architecture suitable for this problem? � F 1 � F 2 � F 3 � E 1 � E 2 � E 3 � • No, sentences might be of di ff erent length and words might not align. Need to see entire sentence before translating � 10 ¡ Adapted from http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf �
� Class Exercise � • Consider the problem of translation of English to French � • E.g. What is your name Comment tu t'appelle � • Sentences might be of di ff erent length and words might not align. Need to see entire sentence before translating � F 4 � F 1 � F 2 � F 3 � E 1 � E 2 � E 3 � • Input-Output nature depends on the structure of the problem at hand � 11 ¡ Seq2Seq Learning with Neural Networks, Sutskever et al. , 2014 �
Multi-layer RNNs � • We can of course design RNNs with multiple hidden layers � y 1 � y 2 � y 3 � y 4 � y 5 � y 6 � x 1 � x 2 � x 3 � x 4 � x 5 � x 6 � • Think exotic: Skip connections across layers, across time, … � 12 ¡
Bi-directional RNNs � • RNNs can process the input sequence in forward and in the reverse direction � y 1 � y 2 � y 3 � y 4 � y 5 � y 6 � x 1 � x 2 � x 3 � x 4 � x 5 � x 6 � • Popular in speech recognition � 13 ¡
� Recap � • RNNs allow for processing of variable length inputs and outputs by maintaining state information across time steps � • Various Input-Output scenarios are possible � (Single/Multiple) � • RNNs can be stacked, or bi-directional � • Vanilla RNNs are improved upon by LSTMs which address the vanishing gradient problem through the CEC � • Exploding gradients are handled by gradient clipping � 14 ¡
� � � � The Popular LSTM Cell � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � W i � f t = σ W f ⎟ + b f ⎜ ⎜ ⎟ Input Gate � i t � Output Gate � o t � ⎝ ⎠ ⎝ ⎠ h t − 1 Similarly for i t , o t � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 15 ¡ * Dashed line indicates time-lag �
� � � � Extension I: Peephole LSTM � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � ⎜ ⎟ W i � ⎜ ⎟ f t = σ W f ⎟ + b f h t − 1 Input Gate � i t � Output Gate � o t � ⎜ ⎟ ⎜ ⎝ ⎠ ⎝ ⎠ c t − 1 Similarly for i t , o t ( uses c t ) � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 16 ¡ * Dashed line indicates time-lag �
� � � � The Popular LSTM Cell � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � W i � f t = σ W f ⎟ + b f ⎜ ⎜ ⎟ Input Gate � i t � Output Gate � o t � ⎝ ⎠ ⎝ ⎠ h t − 1 Similarly for i t , o t � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 17 ¡ * Dashed line indicates time-lag �
� � � � Extension I: Peephole LSTM � x t h t-1 � x t h t-1 � � � ⎛ ⎞ ⎛ ⎞ x t W o � ⎜ ⎟ W i � ⎜ ⎟ f t = σ W f ⎟ + b f h t − 1 Input Gate � i t � Output Gate � o t � ⎜ ⎟ ⎜ ⎝ ⎠ ⎝ ⎠ c t − 1 Similarly for i t , o t ( uses c t ) � W � x t � Cell � c t-1 � h t � h t-1 � c t = f t ⊗ c t − 1 + ⎛ ⎞ x t i t ⊗ tanh W ⎜ ⎟ ⎝ ⎠ f t � h t − 1 Forget Gate � W f � h t = o t ⊗ tanh c t x t h t-1 � � 18 ¡ * Dashed line indicates time-lag �
� Peephole LSTM � • Gates can only see the output from the previous time step, which is close to 0 if the output gate is closed. However, these gates control the CEC cell. � • Helped the LSTM learn better timing for the problems tested – Spike timing and Counting spike time delays � Recurrent nets that time and count, Gers et al ., 2000 �
Other minor variants � f t = 1 − i t • Coupled Input and Forget Gate � ⎛ ⎞ ⎛ ⎞ x t ⎜ ⎟ ⎜ ⎟ h t − 1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ c t − 1 ⎟ f t = σ W f + b f • Full Gate Recurrence � ⎜ ⎟ ⎜ ⎟ i t − 1 ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ f t − 1 ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ o t − 1
LSTM: A Search Space Odyssey � • Tested the following variants, using Peephole LSTM as standard: � 1. No Input Gate (NIG) � 2. No Forget Gate (NFG) � 3. No Output Gate (NOG) � 4. No Input Activation Function (NIAF) � 5. No Output Activation Function (NOAF) � 6. No Peepholes (NP) � 7. Coupled Input and Forget Gate (CIFG) � 8. Full Gate Recurrence (FGR) � • On the tasks of: � – Timit Speech Recognition: Audio frame to 1 of 61 phonemes � – IAM Online Handwriting Recognition: Sketch to characters � – JSB Chorales: Next-step music frame prediction � LSTM: A Search Space Odyssey, Gre ff et al ., 2015 �
LSTM: A Search Space Odyssey � • The standard LSTM performed reasonably well on multiple datasets and none of the modifications significantly improved the performance � • Coupling gates and removing peephole connections simplified the LSTM without hurting performance much � • The forget gate and output activation are crucial � • Found interaction between learning rate and network size to be minimal – indicates calibration can be done using a small network first � LSTM: A Search Space Odyssey, Gre ff et al ., 2015 �
Gated Recurrent Unit (GRU) � • A very simplified version of the LSTM � – Merges forget and input gate into a single ‘update’ gate � – Merges cell and hidden state � • Has fewer parameters than an LSTM and has been shown to outperform LSTM on some tasks � Learning Phrase Representations using RNN Encoder-Decoder for � Statistical Machine Translation, Cho et al ., 2014 �
� GRU � x t h t-1 � � ⎛ ⎞ ⎛ ⎞ x t t = σ W r ⎟ + b f r ⎜ ⎜ ⎟ W z � ⎝ ⎠ ⎝ ⎠ h t − 1 Update Gate � z t � ⎛ ⎞ x t h ' t = tanh W ⎜ ⎟ x t � W � t ⊗ h t − 1 ⎝ ⎠ r h’ t � h t � h t-1 � ⎛ ⎞ ⎛ ⎞ x t z t = σ W z ⎟ + b f ⎜ ⎜ ⎟ ⎝ ⎠ ⎝ ⎠ h t − 1 r t � Reset Gate � W f � h t = (1 − z t ) ⊗ h t − 1 + z t ⊗ h ' t x t h t-1 � � 24 ¡
Recommend
More recommend