Recurrent Neural Networks CS 6956: Deep Learning for NLP
Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Long short-term memory units 1
Overview 1. Modeling sequences 2. Recurrent neural networks: An abstraction 3. Usage patterns for RNNs 4. BiDirectional RNNs 5. A concrete example: The Elman RNN 6. The vanishing gradient problem 7. Gating and Long short-term memory units 2
A simple RNN 1. How to generate the current state using the previous state and the current input? Next state π " = π(π "&' π ) + π² " π - + π) 2. How to generate the current output using the current state? The output is the state. That is, π " = π " 3
How do we train a recurrent network? We need to specify a problem first. Letβs take an example. β Inputs are sequences (say, of words) Initial state I 4 like cake
How do we train a recurrent network? We need to specify a problem first. Letβs take an example. β Inputs are sequences (say, of words) β The outputs are labels associated with each word Verb Pronoun Noun Initial state I 5 like cake
How do we train a recurrent network? We need to specify a problem first. Letβs take an example. β Inputs are sequences (say, of words) β The outputs are labels associated with each word β Losses for each word are added up Loss loss1 loss2 loss3 Verb Pronoun Noun Initial state I 6 like cake
Gradients to the rescue β’ We have a computation graph β’ Use back propagation to compute gradients of the loss with respect to the parameters ( π ) , π - , π ) β Sometimes called Backpropagation Through Time (BPTT) β’ Update gradients using SGD or a variant β Adam, for example 7
A simple RNN 1. How to generate the current state using the previous state and the current input? Next state π " = π(π "&' π ) + π² " π - + π) 2. How to generate the current output using the current state? The output is the state. That is, π " = π " 8
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step 9
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step First input: π¦ ' Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π State: s ' = π(π’ ' ) Loss: π ' = π(π‘ ' ) 10
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) Loss: π ' = π(π‘ ' ) 11
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Follows the chain rule 12
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Follows the chain rule 13
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Follows the chain rule 14
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Follows the chain rule 15
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Let us examine the non-linearity in this system due to the activation function 16
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Suppose π π¨ = tanh π¨ 17
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Suppose π π¨ = tanh π¨ Then BC BD = 1 β tanh G (π¨) 18
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Suppose π π¨ = tanh π¨ Then BC BD = 1 β tanh G (π¨) Always between zero and one 19
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Suppose π π¨ = tanh π¨ Then BC BD = 1 β tanh G (π¨) ππ‘ ' = 1 β tanh G π’ ' That is ππ’ ' 20
Does this work? Letβs see a simple example To avoid complicating the notation more than necessary, suppose 1. The inputs, states and outputs are all scalars 2. The loss at each step is a function π of the state at that step Letβs compute the derivative of the First input: π¦ ' loss with respect to the parameter π Transform: π’ ' = π‘ 6 π ) + π¦ ' π - + π - State: s ' = π(π’ ' ) ππ ' = ππ ' β ππ‘ ' β ππ’ ' Loss: π ' = π(π‘ ' ) ππ ππ‘ ' ππ’ ' ππ - - Suppose π π¨ = tanh π¨ Then BC BD = 1 β tanh G (π¨) ππ‘ ' = 1 β tanh G π’ ' That is ππ’ ' A number between zero and one. 21
Recommend
More recommend