Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Long/Short-Term Memory Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified. University of Illinois ECE 417: Multimedia Signal Processing, Fall 2020
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Recurrent Neural Net (RNN) = Nonlinear(IIR) Image CC-SA-4.0 by Ixnay, https://commons.wikimedia.org/wiki/File:Recurrent_neural_network_unfold.svg
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Back-Propagation and Causal Graphs ˆ y h 1 h 0 x N − 1 d ˆ y d ˆ y ∂ h i � dx = ∂ x dh i i =0 For each h i , we find the total derivative of ˆ y w.r.t. h i , multiplied by the partial derivative of h i w.r.t. x .
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Back-Propagation Through Time Back-propagation through time computes the error gradient at each time step based on the error gradients at future time steps. If the forward-prop equation is M − 1 � y [ n ] = g ( e [ n ]) , ˆ e [ n ] = x [ n ] + w [ m ]ˆ y [ n − m ] , m =1 then the BPTT equation is M − 1 ∂ E dE � δ [ n ] = de [ n ] = ∂ e [ n ] + δ [ n + m ] w [ m ] ˙ g ( e [ n ]) m =1 Weight update, for an RNN, multiplies the back-prop times the forward-prop. dE � dw [ m ] = δ [ n ]ˆ y [ n − m ] n
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Vanishing/Exploding Gradient The “vanishing gradient” problem refers to the tendency of d ˆ y [ n + m ] to disappear, exponentially, when m is large. de [ n ] The “exploding gradient” problem refers to the tendency of d ˆ y [ n + m ] to explode toward infinity, exponentially, when m is de [ n ] large. If the largest feedback coefficient is | w [ m ] | > 1, then you get exploding gradient. If | w [ m ] | < 1, you get vanishing gradient.
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: A Memorizer Network Suppose that we have a very simple RNN: y [ n ] = wx [ n ] + u ˆ ˆ y [ n − 1] Suppose that x [ n ] is only nonzero at time 0: � x 0 n = 0 x [ n ] = 0 n � = 0 Suppose that, instead of measuring x [0] directly, we are only allowed to measure the output of the RNN m time-steps later. Our goal is to learn w and u so that ˆ y [ m ] remembers x 0 , thus: E = 1 y [ m ] − x 0 ) 2 2 (ˆ
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: A Memorizer Network Now, how do we perform gradient update of the weights? If y [ n ] = wx [ n ] + u ˆ ˆ y [ n − 1] then � ∂ ˆ � dE y [ n ] dE � dw = d ˆ y [ n ] ∂ w n � dE � dE � � � = x [ n ] = x 0 d ˆ y [ n ] d ˆ y [0] n But the error is defined as E = 1 y [ m ] − x 0 ) 2 2 (ˆ so dE y [0] = u dE y [1] = u 2 dE y [2] = . . . = u m (ˆ y [ m ] − x 0 ) d ˆ d ˆ d ˆ
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Example: Vanishing Gradient So we find out that the gradient, w.r.t. the coefficient w , is either exponentially small, Exponential Decay or exponentially large, depending on whether | u | < 1 or | u | > 1: dE y [ m ] − x 0 ) u m dw = x 0 (ˆ In other words, if our application requires the neural net to wait m time steps before generating its output, then the gradient is Image CC-SA-4.0, PeterQ, Wikipedia exponentially smaller, and therefore training the neural net is exponentially harder.
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Notation Today’s lecture will try to use notation similar to the Wikipedia page for LSTM. x [ t ] = input at time t y [ t ] = target/desired output c [ t ] = excitation at time t OR LSTM cell h [ t ] = activation at time t OR LSTM output u = feedback coefficient w = feedforward coefficient b = bias
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Running Example: a Pocket Calculator The rest of this lecture will refer to a toy application called “pocket calculator.” Pocket Calculator When x [ t ] > 0, add it to the current tally: c [ t ] = c [ t − 1] + x [ t ]. When x [ t ] = 0, Print out the current tally, h [ t ] = c [ t − 1], and then 1 Reset the tally to zero, c [ t ] = 0. 2 Example Signals Input: x [ t ] = 1 , 2 , 1 , 0 , 1 , 1 , 1 , 0 Target Output: y [ t ] = 0 , 0 , 0 , 4 , 0 , 0 , 0 , 3
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Pocket Calculator Pocket Calculator When x [ t ] > 0, add it to the current tally: c [ t ] = c [ t − 1] + x [ t ]. When x [ t ] = 0, Print out the current 1 tally, h [ t ] = c [ t − 1], and then Reset the tally to zero, 2 c [ t ] = 0.
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Outline Review: Recurrent Neural Networks 1 Vanishing/Exploding Gradient 2 Running Example: a Pocket Calculator 3 Regular RNN 4 Forget Gate 5 Long Short-Term Memory (LSTM) 6 Backprop for an LSTM 7 Conclusion 8
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion One-Node One-Tap Linear RNN Suppose that we have a very simple RNN: Excitation: c [ t ] = x [ t ] + uh [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) where σ h () is some feedback nonlinearity. In this simple example, let’s just use σ h ( c [ t ]) = c [ t ], i.e., no nonlinearity. GOAL: Find u so that h [ t ] ≈ y [ t ]. In order to make the problem easier, we will only score an “error” when y [ t ] � = 0: E = 1 � ( h [ t ] − y [ t ]) 2 2 t : y [ t ] > 0
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion RNN with u = 1 RNN: u = 1? Obviously, if we want to just add numbers, we should just set u = 1. Then the RNN is computing Excitation: c [ t ] = x [ t ] + h [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) That works until the first zero-valued input. But then it just keeps on adding.
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion RNN: u = 0 . 5? RNN with u = 0 . 5 Can we get decent results using u = 0 . 5? Advantage: by the time we reach x [ t ] = 0, the sum has kind of leaked away from us ( c [ t ] ≈ 0), so a hard-reset is not necessary. Disadvantage: by the time we reach x [ t ] = 0, the sum has kind of leaked away from us ( h [ t ] ≈ 0).
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Gradient Descent c [ t ] = x [ t ] + uh [ t − 1] h [ t ] = σ h ( c [ t ]) Let’s try initializing u = 0 . 5, and then performing gradient descent to improve it. Gradient descent has five steps: 1 Forward Propagation: c [ t ] = x [ t ] + uh [ t − 1], h [ t ] = c [ t ]. 2 Synchronous Backprop: ǫ [ t ] = ∂ E /∂ c [ t ]. 3 Back-Prop Through Time: δ [ t ] = dE / dc [ t ]. 4 Weight Gradient: dE / du = � t δ [ t ] h [ t − 1] 5 Gradient Descent: u ← u − η dE / du
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Gradient Descent Excitation: c [ t ] = x [ t ] + uh [ t − 1] Activation: h [ t ] = σ h ( c [ t ]) Error: E = 1 � ( h [ t ] − y [ t ]) 2 2 t : y [ t ] > 0 So the back-prop stages are: � ( h [ t ] − y [ t ]) Synchronous Backprop: ǫ [ t ] = ∂ E y [ t ] > 0 ∂ c [ t ] = 0 otherwise BPTT: δ [ t ] = dE dc [ t ] = ǫ [ t ] + u δ [ t + 1] Weight Gradient: dE � du = δ [ t ] h [ t − 1] t
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Backprop Stages, u = 0 . 5 Backprop Stages � ( h [ t ] − y [ t ]) y [ t ] > 0 ǫ [ t ] = 0 otherwise δ [ t ] = ǫ [ t ] + u δ [ t + 1] dE � du = δ [ t ] h [ t − 1] t
Review Vanishing Gradient Example RNN Forget Gate LSTM Backprop Conclusion Vanishing Gradient and Exploding Gradient Notice that, with | u | < 1, δ [ t ] tends to vanish exponentially fast as we go backward in time. This is called the vanishing gradient problem. It is a big problem for RNNs with long time-dependency, and for deep neural nets with many layers. If we set | u | > 1, we get an even worse problem, sometimes called the exploding gradient problem.
Recommend
More recommend