CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units (GRUs) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Module 15.1: Selective Read, Selective Write, Selective Forget - The Whiteboard Analogy 2/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
y 1 y 2 y 3 y 4 y t The state ( s i ) of an RNN records information from all previous time steps V V V V V At each new timestep the old s 1 W W W W W . . . s 2 s 3 s 4 information gets morphed by the s t current input U U U U U One could imagine that after t steps the information stored at time step x 1 x 2 x 3 x 4 x t t − k (for some k < t ) gets completely morphed so much that it would be impossible to extract the original information stored at time step t − k 3/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
y 1 y 2 y 3 y 4 y t A similar problem occurs when the information flows backwards (backpropagation) V V V V V It is very hard to assign the s 1 W W W W W . . . responsibility of the error caused s 2 s 3 s 4 s t at time step t to the events that U U U U U occurred at time step t − k This responsibility is of course in the x 1 x 2 x 3 x 4 x t form of gradients and we studied the problem in backward flow of gradients We saw a formal argument for this while discussing vanishing gradients 4/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Let us see an analogy for this We can think of the state as a fixed size memory Compare this to a fixed size white board that you use to record information At each time step (periodic intervals) we keep writing something to the board Effectively at each time step we morph the information recorded till that time point After many timesteps it would be impossible to see how the information at time step t − k contributed to the state at timestep t 5/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Continuing our whiteboard analogy, suppose we are interested in deriving an expression on the whiteboard We follow the following strategy at each time step Selectively write on the board Selectively read the already written content Selectively forget (erase) some content Let us look at each of these in detail 6/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective write a = 1 b = 3 c = 5 d = 11 There may be many steps in the Compute ac ( bd + a ) + ad derivation but we may just skip a few Say “board” can have only 3 statements In other words we select what to at a time. write 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 7/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) Since the RNN also has a finite state 5 ad size, we need to figure out a way to 6 ac ( bd + a ) + ad allow it to selectively read, write and forget ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Module 15.2: Long Short Term Memory(LSTM) and Gated Recurrent Units(GRUs) 11/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? We will see this over the next few slides 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) selectively write new information from the current word to the state 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 14/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead Wishlist: selective write, selective actor delivered an amazing performance read and selective forget to ensure that this finite sized state vector is used effectively 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t We will now see how to implement these items from our wishlist 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 -0.9 W -0.4 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) But a more sensible way of doing this would be to assign a value between 0 and 1 which determines what fraction of the current state to pass on to the next state 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 But how do we compute o t − 1 ? How does the RNN know what fraction of the state to pass on? 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 o t is called the output gate as it decides how much to pass (write) to the next time step 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recommend
More recommend