recurrent neural networks
play

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview - PowerPoint PPT Presentation

Recurrent Neural Networks CS 6956: Deep Learning for NLP Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing


  1. Recurrent Neural Networks CS 6956: Deep Learning for NLP

  2. Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units 1

  3. Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Gating and Long short-term memory units 2

  4. A simple RNN 1. How to generate the current state using the previous state and the current input? Next state 𝐭 " = 𝑕(𝐭 "&' 𝐗 ) + 𝐲 " 𝐗 - + 𝐜) 2. How to generate the current output using the current state? The output is the state. That is, 𝒛 " = 𝐭 " 3

  5. How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) Initial state I 4 like cake

  6. How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) – The outputs are labels associated with each word Verb Pronoun Noun Initial state I 5 like cake

  7. How do we train a recurrent network? We need to specify a problem first. Let’s take an example. – Inputs are sequences (say, of words) – The outputs are labels associated with each word – Losses for each word are added up Loss loss1 loss2 loss3 Verb Pronoun Noun Initial state I 6 like cake

  8. Gradients to the rescue β€’ We have a computation graph β€’ Use back propagation to compute gradients of the loss with respect to the parameters ( 𝐗 ) , 𝐗 - , 𝐜 ) – Sometimes called Backpropagation Through Time (BPTT) β€’ Update gradients using SGD or a variant – Adam, for example 7

  9. A simple RNN 1. How to generate the current state using the previous state and the current input? Next state 𝐭 " = 𝑕(𝐭 "&' 𝐗 ) + 𝐲 " 𝐗 - + 𝐜) 2. How to generate the current output using the current state? The output is the state. That is, 𝒛 " = 𝐭 " 8

  10. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step 9

  11. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step First input: 𝑦 ' Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 State: s ' = 𝑕(𝑒 ' ) Loss: π‘š ' = 𝑔(𝑑 ' ) 10

  12. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) Loss: π‘š ' = 𝑔(𝑑 ' ) 11

  13. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 12

  14. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 13

  15. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 14

  16. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Follows the chain rule 15

  17. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Let us examine the non-linearity in this system due to the activation function 16

  18. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 17

  19. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) 18

  20. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) Always between zero and one 19

  21. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) πœ–π‘‘ ' = 1 βˆ’ tanh G 𝑒 ' That is πœ–π‘’ ' 20

  22. Does this work? Let’s see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function 𝑔 of the state at that step Let’s compute the derivative of the First input: 𝑦 ' loss with respect to the parameter 𝑋 Transform: 𝑒 ' = 𝑑 6 𝑋 ) + 𝑦 ' 𝑋 - + 𝑐 - State: s ' = 𝑕(𝑒 ' ) πœ–π‘š ' = πœ–π‘š ' β‹… πœ–π‘‘ ' β‹… πœ–π‘’ ' Loss: π‘š ' = 𝑔(𝑑 ' ) πœ–π‘‹ πœ–π‘‘ ' πœ–π‘’ ' πœ–π‘‹ - - Suppose 𝑕 𝑨 = tanh 𝑨 Then BC BD = 1 βˆ’ tanh G (𝑨) πœ–π‘‘ ' = 1 βˆ’ tanh G 𝑒 ' That is πœ–π‘’ ' A number between zero and one. 21

Recommend


More recommend