FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Nets ECE 417: Multimedia Signal Processing Mark Hasegawa-Johnson University of Illinois November 19, 2019
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Basics of DSP: Filtering ∞ � y [ n ] = h [ m ] x [ n − m ] m = −∞ Y ( z ) = H ( z ) X ( z )
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Finite Impulse Response (FIR) N − 1 � y [ n ] = h [ m ] x [ n − m ] m =0 The coefficients, h [ m ], are chosen in order to optimally position the N − 1 zeros of the transfer function, r k , defined according to: N − 1 N − 1 h [ m ] z − m = h [0] � � 1 − r k z − 1 � � H ( z ) = m =0 k =1
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Infinite Impulse Response (IIR) N − 1 M − 1 � � y [ n ] = b m x [ n − m ] + a m y [ n − m ] m =0 m =1 The coefficients, b m and a m , are chosen in order to optimally position the N − 1 zeros and M − 1 poles of the transfer function, r k and p k , defined according to: � N − 1 � N − 1 � 1 − r k z − 1 � m =0 b m z − m k =1 H ( z ) = m =1 a m z − m = b 0 1 − � M − 1 � M − 1 k =1 (1 − p k z − 1 )
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Convolutional Neural Net = Nonlinear(FIR) � N − 1 � � y [ n ] = g h [ m ] x [ n − m ] m =0 The coefficients, h [ m ], are chosen to minimize some kind of error. For example, suppose that the goal is to make y [ n ] resemble a target signal t [ n ]; then we might use N E = 1 ( y [ n ] − t [ n ]) 2 � 2 n =0 and choose h [ n ] ← h [ n ] − η dE dh [ n ]
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Recurrent Neural Net (RNN) = Nonlinear(IIR) � M − 1 � � y [ n ] = g x [ n ] + a m y [ n − m ] m =1 The coefficients, a m , are chosen to minimize the error. For example, suppose that the goal is to make y [ n ] resemble a target signal t [ n ]; then we might use N E = 1 ( y [ n ] − t [ n ]) 2 � 2 n =0 and choose a m ← a m − η dE da m
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Review: Excitation and Activation The activation of a hidden node is the output of the nonlinearity (for this reason, the nonlinearity is sometimes called the activation function ). For example, in a fully-connected network with outputs z l , weights � v , bias v 0 , nonlinearity g (), and hidden node activations � y , the activation of the l th output node is p � � � z l = g v l 0 + v lk y k k =1 The excitation of a hidden node is the input of the nonlinearity. For example, the excitation of the node above is p � e l = v l 0 + v lk y k k =1
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Backprop = Derivative w.r.t. Excitation The excitation of a hidden node is the input of the nonlinearity. For example, the excitation of the node above is p � e l = v l 0 + v lk y k k =1 The gradient of the error w.r.t. the weight is dE = ǫ l y k dv lk where ǫ l is the derivative of the error w.r.t. the l th excitation : ǫ l = dE de l
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Backprop for Fully-Connected Network Suppose we have a fully-connected network, with inputs � x , weight matrices U and V , nonlinearities g () and h (), and output z : � e k = u k 0 + u kj x j j y k = g ( e k ) � e l = v l 0 + v lk y k k z l = h ( e l ) Then the back-prop gradients are the derivatives of E with respect to the excitations at each node: ǫ l = dE de l δ k = dE de k
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Back-Prop in a CNN Suppose we have a convolutional neural net, defined by N − 1 � e [ n ] = h [ m ] x [ n − m ] m =0 y [ n ] = g ( e [ n ]) then dE � dh [ m ] = δ [ n ] x [ n − m ] n where δ [ n ] is the back-prop gradient, defined by dE δ [ n ] = de [ n ]
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Back-Prop in an RNN Suppose we have a recurrent neural net, defined by M − 1 � e [ n ] = x [ n ] + a m y [ n − m ] m =1 y [ n ] = g ( e [ n ]) then dE � = δ [ n ] y [ n − m ] da m n where y [ n − m ] is calculated by forward-propagation, and then δ [ n ] is calculated by back-propagation as dE δ [ n ] = de [ n ]
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Outline Linear Time Invariant Filtering: FIR & IIR 1 Nonlinear Time Invariant Filtering: CNN & RNN 2 Back-Propagation Training for CNN and RNN 3 Back-Prop Through Time 4 Vanishing/Exploding Gradient 5 Gated Recurrent Units 6 Long Short-Term Memory (LSTM) 7 Conclusion 8
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives For example, suppose we want y [ n ] to be as close as possible to some target signal t [ n ]: E = 1 � ( y [ n ] − t [ n ]) 2 2 n Notice that E depends on y [ n ] in many different ways: ∂ E ∂ y [ n + 1] ∂ y [ n + 2] dE dE dE dy [ n ] = ∂ y [ n ] + + + . . . dy [ n + 1] ∂ y [ n ] dy [ n + 2] ∂ y [ n ]
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives In general, ∞ dE ∂ E dE ∂ y [ n + m ] � dy [ n ] = ∂ y [ n ] + dy [ n + m ] ∂ y [ n ] m =1 where dE dy [ n ] is the total derivative, and includes all of the different ways in which E depends on y [ n ]. ∂ y [ n + m ] is the partial derivative, i.e., the change in y [ n + m ] ∂ y [ n ] per unit change in y [ n ] if all of the other variables (all other values of y [ n + k ]) are held constant.
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives So for example, if E = 1 � ( y [ n ] − t [ n ]) 2 2 n then the partial derivative of E w.r.t. y [ n ] is ∂ E ∂ y [ n ] = y [ n ] − t [ n ] and the total derivative of E w.r.t. y [ n ] is ∞ dE dE ∂ y [ n + m ] � dy [ n ] = ( y [ n ] − t [ n ]) + dy [ n + m ] ∂ y [ n ] m =1
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Partial vs. Full Derivatives So for example, if M − 1 � � � y [ n ] = g x [ n ] + a m y [ n − m ] m =1 then the partial derivative of y [ n + k ] w.r.t. y [ n ] is � M − 1 � ∂ y [ n + k ] � = a k ˙ g x [ n + k ] + a m y [ n + k − m ] ∂ y [ n ] m =1 g ( x ) = dg where ˙ dx is the derivative of the nonlinearity. The total derivative of y [ n + k ] w.r.t. y [ n ] is k − 1 dy [ n + k ] = ∂ y [ n + k ] dy [ n + k ] ∂ y [ n + j ] � + dy [ n ] ∂ y [ n ] dy [ n + j ] ∂ y [ n ] j =1
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Synchronous Backprop vs. BPTT The basic idea of back-prop-through-time is divide-and-conquer. 1 Synchronous Backprop: First, calculate the partial derivative of E w.r.t. the excitation e [ n ] at time n , assuming that all other time steps are held constant. ∂ E ǫ [ n ] = ∂ e [ n ] 2 Back-prop through time: Second, iterate backward through time to calculate the total derivative dE δ [ n ] = de [ n ]
FIR/IIR CNN/RNN Back-Prop BPTT Vanishing Gradient GRU LSTM Conclusion Synchronous Backprop in an RNN Suppose we have a recurrent neural net, defined by M − 1 � e [ n ] = x [ n ] + a m y [ n − m ] m =1 y [ n ] = g ( e [ n ]) E = 1 ( y [ n ] − t [ n ]) 2 � 2 n then ∂ E ǫ [ n ] = ∂ e [ n ] = ( y [ n ] − t [ n ]) ˙ g ( e [ n ]) g ( x ) = dg where ˙ dx is the derivative of the nonlinearity.
Recommend
More recommend