Deep Dive on RNNs Charles Martin
What is an Artificial Neurone? Source - Wikimedia Commons
Feed-Forward Network For each unit: y = tanh � Wx + b �
Recurrent Network For each unit: y t = tanh � Ux t + Vh t − 1 + b �
Sequence Learning Tasks
Recurrent Network simplifying. . .
Recurrent Network simplifying and rotating. . .
“State” in Recurrent Networks ◮ Recurrent Networks are all about storing a “state” in between computations. ◮ A “lossy summary of. . . past sequences” ◮ h is the “hidden state” of our RNN ◮ What influences h ?
Defining the RNN State We can define a simplified RNN represented by this diagram as follows: � Ux t + Vh t − 1 + b � h t = tanh ˆ y t = softmax( c + Wh t )
Unfolding an RNN in Time Figure 1: Unfolding an RNN in Time ◮ By unfolding the RNN we can compute ˆ y for a given length of sequence. ◮ Note that the weight matrices U , V , W are the same for each timestep; this is the big advantage of RNNs!
Forward Propagation We can now use the following equations to compute ˆ y 3 , by computing h for the previous steps: � Ux t + Vh t − 1 + b � h t = tanh ˆ y t = softmax( c + Wh t )
Y-hat is Softmax’d ˆ y is a probability distribution! A finite number of weights that add to 1: e z j σ ( z ) j = k =1 e z k for j = 1 , . . . , K � K
Calculating Loss: Categorical Cross Entropy We use the categorical cross-entropy function for loss: h t = tanh � b + Vh t − 1 + Ux t � ˆ y t = softmax( c + Wh t ) L t = − y t · log(ˆ y t ) � Loss = L t t
Backpropagation Through Time (BPTT) Propagates error correction backwards through the network graph, adjusting all parameters ( U , V , W ) to minimise loss.
Example: Character-level text model ◮ Training data: a collection of text. ◮ Input ( X ): snippets of 30 characters from the collection. ◮ Target output ( y ) : 1 character, the next one after the 30 in each X .
Training the Character-level Model ◮ Target: A probability distribution with P ( n ) = 1 ◮ Output: A probability distribution over all next letters. ◮ E.g.: “My cat is named Simon” would lead to X : “My cat is named Simo” and y : “n”
Using the trained model to generate text ◮ S : Sampling function, sample a letter using the output probability distribution. ◮ The generated letter is reinserted at as the next input. ◮ We don’t want to always draw the most likely character. The would give frequent repetition and “copying” from the training text. Need a sampling strategy.
Char-RNN ◮ RNN as a sequence generator ◮ Input is current symbol, output is next predicted symbol. ◮ Connect output to input and continue! ◮ CharRNN simply applies this to a See: Karpathy, A. (2015). The (subset) of ASCII characters. unreasonable effectiveness of recurrent ◮ Train and generate on any text neural networks. corpus: Fun!
Char-RNN Examples Shakespeare (Karpathy, 2015): Latex Algebraic Geometry: Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. DUKE VINCENTIO: Well, your wit is in the care of side and that. N.B. “ Proof. Omitted.” Lol.
RNN Architectures and LSTM
Bidirectional RNNs ◮ Useful for tasks where the whole sequence is available. ◮ Each output unit (ˆ y ) depends on both past and future - but most sensitive to closer times. ◮ Popular in speech recognition, translation etc.
Encoder-Decoder (seq-to-seq) ◮ Learns to generate output sequence ( y ) from an input sequence ( x ). ◮ Final hidden state of encoder is used to compute a context variable C . ◮ For example, translation.
Deep RNNs ◮ Does adding deeper layers to an RNN make it work better? ◮ Several options for architecture. ◮ Simply stacking RNN layers is very popular; shown to work better by Graves et al. (2013) ◮ Intuitively: layers might learn some hierarchical knowledge automatically. ◮ Typical setup: up to three recurrent layers.
Long-Term Dependencies ◮ Learning long dependencies is a mathematical challenge. h t = Wh t − 1 ◮ Basically: gradients propagated h t = ( W t ) h 0 through the same weights tend to vanish (mostly) or explode (rarely) (supposing W admits eigendecomposition ◮ E.g., consider a simplified RNN with with orthogonal matrix Q ) no nonlinear activation function or input. ◮ Each time step multiplies h(0) by W . W = Q Λ Q ⊤ ◮ This corresponds to raising power of h t = Q Λ t Qh 0 eigenvalues in Λ. ◮ Eventually, components of h(0) not aligned with the largest eigenvector will be discarded.
Vanishing and Exploding Gradients ◮ “in order to store memories in a way ◮ Note that this problem is only that is robust to small perturbations, relevant for recurrent networks since the RNN must enter a region of the weights W affecting the hidden parameter space where gradients state are the same at each time step. ◮ Goodfellow and Benigo (2016): “the vanish” ◮ “whenever the model is able to problem of learning long-term represent long term dependencies, dependencies remains one of the the gradient of a long term main challenges in deep learning” ◮ WildML (2015). Backpropagation interaction has exponentially smaller magnitude than the gradient of a Through Time and Vanishing short term interaction.” Gradients ◮ ML for artists
Gated RNNs ◮ Possible solution! ◮ Provide a gate that can change the hidden state a little bit at each step. ◮ The gates are controlled by learnable weights as well! ◮ Hidden state weights that may change at each time step. ◮ Create paths through time with derivatives that do not vanish/explode. ◮ Gates choose information to accumulate or forget at each time step. ◮ Most effective sequence models used in practice!
Long Short-Term Memory ◮ Self-loop containing an internal state (c). ◮ Three extra gating units: ◮ Forget gate : controls how much memory is preserved. ◮ Input gate : control how much of current input is stored. ◮ Output gate : control how much of state is shown to output. ◮ Each gate has own weights and biases , so this uses lots more parameters. ◮ Some variants on this design, e.g., use c as additional input to three gate units.
Long Short-Term Memory ◮ Forget gate: f ◮ Internal state: s ◮ Input gate: g ◮ Output gate: q ◮ Output: h
Other Gating Units ◮ Are three gates necessary? ◮ Other gating units are simpler, e.g., Gated Recurrent Unit (GRU) ◮ For the moment, LSTMs are winning in practical use. ◮ Maybe someone wants to explore alternatives in a project? Source: (Olah, C. 2015.)
Visualising LSTM activations Sometimes, the LSTM cell state corresponds with features of the sequential data: Source: (Karpathy, 2015)
CharRNN Applications: FolkRNN Some kinds of music can be represented in a text-like manner. Source: Sturm et al. 2015. Folk Music Style Modelling by Recurrent Neural Networks with Long Short Term Memory Units
Other CharRNN Applications Teaching Recurrent Neural Networks about Monet
Google Magenta Performance RNN ◮ State-of-the-art in music generating RNNs. ◮ Encode MIDI musical sequences as categorical data. ◮ Now supports polyphony (multiple notes), dynamics (volume), expressive timing
Neural iPad Band, another CharRNN ◮ iPad music transcribed as sequence of numbers for each performer. ◮ Trick: encode multiple ints as one (preserving ordering). ◮ Video
Books and Learning References ◮ Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. ◮ François Chollet. 2018. Manning. ◮ Chris Olah. 2015. Understanding LSTMs ◮ RNNs in Tensorflow ◮ Maybe RNN/LSTM is dead? CNNs can work similarly to BLSTMs ◮ Karpathy. 2015. The Unreasonable Effectiveness of RNNs
Summary ◮ Recurrent Neural Networks let us capture and model the structure of sequential data. ◮ Sampling from trained RNNs allow us to generate new, creative sequences. ◮ The internal state of RNNs make them interesting for interactive applications, since it lets them capture and continue from the current context or “style”. ◮ LSTM units are able to overcome the vanishing gradient problem to some extent.
Recommend
More recommend