Selective write a = 1 b = 3 c = 5 d = 11 There may be many steps in the Compute ac ( bd + a ) + ad derivation but we may just skip a few Say “board” can have only 3 statements In other words we select what to at a time. write 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 7/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective read a = 1 b = 3 c = 5 d = 11 While writing one step we typically Compute ac ( bd + a ) + ad read some of the previous steps we Say “board” can have only 3 statements have already written and then decide at a time. what to write next For example at Step 3, information 1 ac from Step 2 is important 2 bd In other words we select what to read 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 8/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements at a time. 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd = 33 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 ac ( bd + a ) = 170 bd + a = 34 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ac = 5 ac ( bd + a ) = 170 ad = 11 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective forget a = 1 b = 3 c = 5 d = 11 Once the board is full, we need to Compute ac ( bd + a ) + ad delete some obsolete information Say “board” can have only 3 statements But how do we decide what to delete? at a time. We will typically delete the least useful information 1 ac In other words we select what to 2 bd forget 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 9/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements at a time. 1 ac 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd 3 bd + a 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) 5 ad 6 ac ( bd + a ) + ad ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
a = 1 b = 3 c = 5 d = 11 There are various other scenarios where we can motivate the need for Compute ac ( bd + a ) + ad selective write, read and forget Say “board” can have only 3 statements For example, you could think of our at a time. brain as something which can store 1 ac only a finite number of facts 2 bd At different time steps we selectively read, write and forget some of these 3 bd + a facts 4 ac ( bd + a ) Since the RNN also has a finite state 5 ad size, we need to figure out a way to 6 ac ( bd + a ) + ad allow it to selectively read, write and forget ad + ac ( bd + a ) = 181 ac ( bd + a ) = 170 ad = 11 10/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Module 15.2: Long Short Term Memory(LSTM) and Gated Recurrent Units(GRUs) 11/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? We will see this over the next few slides 12/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead actor delivered an amazing performance 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Consider the task of predicting the sentiment + / − (positive/negative) of a review RNN reads the document from left to right and after every word updates the state By the time we reach the end of the document the information obtained from the first few words is completely lost ... ... ... performance The first Ideally we want to forget the information added by stop words Review: The first half of the movie was dry but (a, the, etc.) the second half really picked up pace. The lead selectively read the information added by actor delivered an amazing performance previous sentiment bearing words (awesome, amazing, etc.) selectively write new information from the current word to the state 13/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Questions Can we give a concrete example where RNNs also need to selectively read, write and forget ? How do we convert this intuition into mathematical equations ? 14/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recall that the blue colored vector + / − ( s t ) is called the state of the RNN ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t ... ... ... performance The first Review: The first half of the movie was dry but the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead actor delivered an amazing performance 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recall that the blue colored vector + / − ( s t ) is called the state of the RNN It has a finite size ( s t ∈ R n ) and is used to store all the information upto timestep t This state is analogous to the whiteboard and sooner or later it will ... ... ... performance get overloaded and the information The first from the initial states will get Review: The first half of the movie was dry but morphed beyond recognition the second half really picked up pace. The lead Wishlist: selective write, selective actor delivered an amazing performance read and selective forget to ensure that this finite sized state vector is used effectively 15/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Just to be clear, we have computed -1.4 -0.9 -0.4 0.2 a state s t − 1 at timestep t − 1 and 1 1 . . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . . -0.3 x t 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Just to be clear, we have computed -1.4 -0.9 -0.4 selective read 0.2 selective write a state s t − 1 at timestep t − 1 and 1 1 . selective forget . . . . . now we want to overload it with new -2 -1.9 information ( x t ) and compute a new s t − 1 s t 0.7 state ( s t ) -0.2 1.1 . . While doing so we want to make sure . that we use selective write, selective -0.3 x t read and selective forget so that only important information is retained in s t We will now see how to implement these items from our wishlist 16/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U -2 -1.9 s t − 1 s t 0.7 -0.2 1.1 . . . -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 -1.9 s t − 1 s t 0.7 -0.2 1.1 . . . -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 -0.9 W -0.4 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) -2 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 -0.9 W 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 -0.9 W 0 0.2 Recall that in RNNs we use s t − 1 to σ 1 1 . . . . compute s t . . U s t = σ ( Ws t − 1 + Ux t ) (ignoring bias) 0 -1.9 s t − 1 s t 0.7 But now instead of passing s t − 1 as it -0.2 1.1 is to s t we want to pass (write) only . . . some portions of it to the next state -0.3 x t In the strictest case our decisions could be binary (for example, retain 1st and 3rd entries and delete the rest of the entries) But a more sensible way of doing this would be to assign a value between 0 and 1 which determines what fraction of the current state to pass on to the next state 17/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . -0.3 x t 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 We introduce a vector o t − 1 which 1 0.9 = 0.9 1 ⊙ . . . . . . . . decides what fraction of each element . . . . of s t − 1 should be passed to the next -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 state 1.1 selective write . . . Each element of o t − 1 gets multiplied -0.3 with the corresponding element of x t s t − 1 Each element of o t − 1 is restricted to be between 0 and 1 But how do we compute o t − 1 ? How does the RNN know what fraction of the state to pass on? 18/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Write -1.4 0.2 0.5 -1.4 -0.4 0.34 0.36 -0.4 Well the RNN has to learn o t − 1 along 1 0.9 = 0.9 1 ⊙ . . . . . . . . with the other parameters ( W, U, V ) . . . . -2 0.29 0.6 -2 We compute o t − 1 and h t − 1 as 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 1.1 selective write . . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) x t The parameters W o , U o , b o need to be learned along with the existing parameters W, U, V The sigmoid (logistic) function ensures that the values are between 0 and 1 o t is called the output gate as it decides how much to pass (write) to the next time step 19/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t 1.1 selective write . . . -0.3 x t 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 We will also use x t which is the new 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t input at time step t 1.1 selective write . . . -0.3 s t = σ ( Wh t − 1 + Ux t + b ) ˜ x t 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 -1.4 W -0.4 0.34 0.36 0.6 -0.4 We will now use h t − 1 to compute the σ 1 0.9 = 0.9 0.1 1 ⊙ . . . . . . . . . . new state at the next time step . . . . . U -2 0.29 0.6 0.2 -2 We will also use x t which is the new 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ input at time step t 1.1 selective write . . . -0.3 s t = σ ( Wh t − 1 + Ux t + b ) ˜ x t Note that W, U and b are similar to the parameters that we used in RNN (for simplicity we have not shown the bias b in the figure) 20/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 ˜ s t i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate i t = σ ( W i h t − 1 + U i x t + b i ) 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Read -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 s t thus captures all the information ˜ σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . from the previous state ( h t − 1 ) and the . . . . . . U current input x t -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t However, we may not want to 1.1 selective write . selective read . . use all this new information and -0.3 only selectively read from it before x t constructing the new cell state s t To do this we introduce another gate called the input gate i t = σ ( W i h t − 1 + U i x t + b i ) and use i t ⊙ ˜ s t as the selectively read state information 21/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ Input gate: i t = σ ( W i h t − 1 + U i x t + b i ) 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
So far we have the following -1.4 0.2 0.5 0.4 0.8 -1.4 W -0.4 0.34 0.36 0.6 0.66 -0.4 σ 1 0.9 = 0.9 0.1 0.1 1 ⊙ ⊙ . . . . . . . . . . . . Previous state: . . . . . . U -2 0.29 0.6 0.2 0.71 -2 0.7 s t − 1 -0.2 s t − 1 o t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read Output gate: . . o t − 1 = σ ( W o h t − 2 + U o x t − 1 + b o ) -0.3 x t Selectively Write: h t − 1 = o t − 1 ⊙ σ ( s t − 1 ) Current (temporary) state: s t = σ ( Wh t − 1 + Ux t + b ) ˜ Input gate: i t = σ ( W i h t − 1 + U i x t + b i ) Selectively Read: i t ⊙ ˜ s t 22/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Selective Forget How do we combine s t − 1 and ˜ s t to get the new state 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
-1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget How do we combine s t − 1 and ˜ s t to get the new state Here is one simple (but effective) way of doing this: 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
-1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget How do we combine s t − 1 and ˜ s t to get the new state Here is one simple (but effective) way of doing this: s t = s t − 1 + i t ⊙ ˜ s t 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
-1.4 0.2 0.5 0.4 0.8 -1.4 -0.9 W -0.4 0.34 0.36 0.6 0.66 -0.4 0.2 σ 1 0.9 = 0.9 0.1 0.1 1 = 1 ⊙ ⊙ + . . . . . . . . . . . . . . . . . . . . . U -2 0.29 0.6 0.2 0.71 -2 -1.9 0.7 -0.2 s t − 1 o t − 1 s t − 1 s t h t − 1 s t ˜ i t 1.1 selective write . selective read . . -0.3 x t Selective Forget But we may not want to use the whole How do we combine s t − 1 and ˜ s t to get of s t − 1 but forget some parts of it the new state Here is one simple (but effective) way of doing this: s t = s t − 1 + i t ⊙ ˜ s t 23/43 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 15
Recommend
More recommend