Linear recursions: Vector version What about at middling values of ๐ข ? It will depend on the other eigen values โข Consider simple, scalar, linear recursion (note change of notation) If ๐๐(๐ ๐๐๐ฆ ) > 1 it will blow up, otherwise it will contract โ โ ๐ข = ๐โ ๐ข โ 1 + ๐ท๐ฆ(๐ข) and shrink to 0 rapidly โ โ 0 ๐ข = ๐ ๐ข ๐๐ฆ 0 For any input, for large ๐ข the length of the hidden vector โข Length of response ( โ ) to a single input at 0 will expand or contract according to the ๐ข -th power of the โข We can write ๐ = ๐ฮ๐ โ1 largest eigen value of the hidden-layer weight matrix Unless it has no component along the eigen vector โ ๐๐ฃ ๐ = ๐ ๐ ๐ฃ ๐ corresponding to the largest eigen value. In that case it โ For any vector โ we can write will grow according to the second largest Eigen value.. โข โ = ๐ 1 ๐ฃ 1 + ๐ 2 ๐ฃ 2 + โฏ + ๐ ๐ ๐ฃ ๐ And so on.. โข ๐โ = ๐ 1 ๐ 1 ๐ฃ 1 + ๐ 2 ๐ 2 ๐ฃ 2 + โฏ + ๐ ๐ ๐ ๐ ๐ฃ ๐ ๐ข ๐ฃ 2 + โฏ + ๐ ๐ ๐ ๐ ๐ข ๐ฃ ๐ โข ๐ ๐ข โ = ๐ 1 ๐ 1 ๐ข ๐ฃ 1 + ๐ 2 ๐ 2 ๐ข ๐ฃ ๐ where ๐ = argmax ๐ขโโ ๐ ๐ข โ = ๐ ๐ ๐ ๐ โ lim ๐ ๐ ๐
Linear recursions โข Vector linear recursion โ โ ๐ข = ๐โ ๐ข โ 1 + ๐ท๐ฆ(๐ข) โ โ 0 ๐ข = ๐ฅ ๐ข ๐๐ฆ 0 โข Response to a single input [1 1 1 1] at 0 ๐ ๐๐๐ฆ = 0.9 ๐ ๐๐๐ฆ = 1.1 ๐ ๐๐๐ฆ = 1.1 ๐ ๐๐๐ฆ = 1 ๐ ๐๐๐ฆ = 1
Linear recursions โข Vector linear recursion โ โ ๐ข = ๐โ ๐ข โ 1 + ๐ท๐ฆ(๐ข) โ โ 0 ๐ข = ๐ฅ ๐ข ๐๐ฆ 0 โข Response to a single input [1 1 1 1] at 0 ๐ ๐๐๐ฆ = 0.9 ๐ ๐๐๐ฆ = 1.1 ๐ ๐๐๐ฆ = 1.1 ๐ 2๐๐ = 0.5 ๐ ๐๐๐ฆ = 1 ๐ ๐๐๐ฆ = 1 ๐ 2๐๐ = 0.1 Complex Eigenvalues
Lesson.. โข In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix โ If the largest Eigen value is greater than 1, the system will โblow upโ โ If it is lesser than 1, the response will โvanishโ very quickly โ Complex Eigen values cause oscillatory response โข Which we may or may not want โข Force matrix to have real eigen values for smooth behavior โ Symmetric weight matrix
How about non-linearities โ ๐ข = ๐(๐ฅโ ๐ข โ 1 + ๐๐ฆ ๐ข ) โข The behavior of scalar non-linearities โข Left: Sigmoid, Middle: Tanh, Right: Relu โ Sigmoid: Saturates in a limited number of steps, regardless of ๐ฅ โ Tanh: Sensitive to ๐ฅ , but eventually saturates โข โPrefersโ weights close to 1.0 โ Relu: Sensitive to ๐ฅ , can blow up
How about non-linearities โ ๐ข = ๐(๐ฅโ ๐ข โ 1 + ๐๐ฆ ๐ข ) โข With a negative start (equivalent to โ ve wt) โข Left: Sigmoid, Middle: Tanh, Right: Relu โ Sigmoid: Saturates in a limited number of steps, regardless of ๐ฅ โ Tanh: Sensitive to ๐ฅ , but eventually saturates โ Relu: For negative starts, has no response
Vector Process โ ๐ข = ๐(๐โ ๐ข โ 1 + ๐ท๐ฆ ๐ข ) โข Assuming a uniform unit vector initialization โ 1,1,1, โฆ / ๐ โ Behavior similar to scalar recursion โ Interestingly, RELU is more prone to blowing up (why?) โข Eigenvalues less than 1.0 retain the most โmemoryโ sigmoid tanh relu
Vector Process โ ๐ข = ๐(๐โ ๐ข โ 1 + ๐ท๐ฆ ๐ข ) โข Assuming a uniform unit vector initialization โ โ1, โ1, โ1, โฆ / ๐ โ Behavior similar to scalar recursion โ Interestingly, RELU is more prone to blowing up (why?) sigmoid tanh relu
Stability Analysis โข Formal stability analysis considers convergence of โ Lyapunov โ functions โ Alternately, Routhโs criterion and/or pole -zero analysis โ Positive definite functions evaluated at โ โ Conclusions are similar: only the tanh activation gives us any reasonable behavior โข And still has very short โmemoryโ โข Lessons: โ Bipolar activations (e.g. tanh) have the best behavior โ Still sensitive to Eigenvalues of ๐ โ Best case memory is short โ Exponential memory behavior โข โForgetsโ in exponential manner
How about deeper recursion โข Consider simple, scalar, linear recursion โ Adding more โtapsโ adds more โmodesโ to memory in somewhat non-obvious ways โ ๐ข = 0.5โ ๐ข โ 1 + 0.25โ ๐ข โ 5 + ๐ฆ(๐ข) โ ๐ข = 0.5โ ๐ข โ 1 + 0.25โ ๐ข โ 5 + 0.1โ ๐ข โ 8 + ๐ฆ(๐ข)
Stability Analysis โข Similar analysis of vector functions with non- linear activations is relatively straightforward โ Linear systems: Routhโs criterion โข And pole-zero analysis (involves tensors) โ On board? โ Non-linear systems: Lyapunov functions โข Conclusions do not change
RNNs.. โข Excellent models for time-series analysis tasks โ Time-series prediction โ Time-series classification โ Sequence prediction.. โ They can even simplify problems that are difficult for MLPs โข But the memory isnโt all that great.. โ Also..
The vanishing gradient problem โข A particular problem with training deep networks.. โ The gradient of the error with respect to weights is unstable..
Some useful preliminary math: The problem with training deep networks W 0 W 1 W 2 โข A multilayer perceptron is a nested function ๐ = ๐ ๐ ๐ ๐โ1 ๐ ๐โ1 ๐ ๐โ2 ๐ ๐โ2 โฆ ๐ 0 ๐ ๐ is the weights matrix at the k th layer โข ๐ โข The error for ๐ can be written as ๐ธ๐๐ค(๐) = ๐ธ ๐ ๐ ๐ ๐โ1 ๐ ๐โ1 ๐ ๐โ2 ๐ ๐โ2 โฆ ๐ 0 ๐
Training deep networks โข Vector derivative chain rule: for any ๐ ๐๐ ๐ : ๐๐ ๐๐ ๐ = ๐๐ ๐๐ ๐ ๐๐๐ ๐ ๐๐ ๐ ๐๐ ๐๐๐ ๐ ๐๐ ๐ ๐๐ Poor notation ๐ผ ๐ ๐ = ๐ผ ๐ ๐. ๐. ๐ผ ๐ ๐ โข Where โ ๐ = ๐๐ ๐ โ ๐ผ ๐ ๐ is the jacobian matrix of ๐(๐) w.r.t ๐ โข Using the notation ๐ผ ๐ ๐ instead of ๐พ ๐ (๐จ) for consistency
Training deep networks โข For ๐ธ๐๐ค(๐) = ๐ธ ๐ ๐ ๐ ๐โ1 ๐ ๐โ1 ๐ ๐โ2 ๐ ๐โ2 โฆ ๐ 0 ๐ โข We get: ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข Where โ ๐ผ ๐ ๐ ๐ธ๐๐ค is the gradient ๐ธ๐๐ค(๐) of the error w.r.t the output of the kth layer of the network โข Needed to compute the gradient of the error w.r.t ๐ ๐โ1 โ ๐ผ๐ ๐ is jacobian of ๐ ๐ () w.r.t. to its current input โ All blue terms are matrices
The Jacobian of the hidden layers โฒ (๐จ 1 ) ๐ ๐ 0 โฏ 0 ๐ข,1 โฒ (๐จ 2 ) 0 ๐ โฏ 0 ๐ข,2 ๐ผ๐ ๐ข ๐จ ๐ = โฎ โฎ โฑ โฎ โ 1 โฒ (๐จ ๐ ) 0 0 โฏ ๐ ๐ข,๐ 1 (๐ข) = ๐ 1 ๐ข ๐ โ ๐ 1 ๐จ ๐ โข ๐ผ๐ ๐ข () is the derivative of the output of the (layer of) hidden recurrent neurons with respect to their input โ A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer
The Jacobian 1 (๐ข) = ๐ 1 ๐ข โ ๐ 1 ๐จ ๐ โฒ (๐จ 1 ) ๐ ๐ 0 โฏ 0 ๐ข,1 โฒ (๐จ 2 ) 0 ๐ โฏ 0 ๐ข,2 ๐ผ๐ ๐ข ๐จ ๐ = โฎ โฎ โฑ โฎ โ 1 โฒ (๐จ ๐ ) 0 0 โฏ ๐ ๐ข,๐ ๐ โข The derivative (or subgradient) of the activation function is always bounded โ The diagonals of the Jacobian are bounded โข There is a limit on how much multiplying a vector by the Jacobian will scale it
The derivative of the hidden state activation โฒ (๐จ 1 ) ๐ 0 โฏ 0 ๐ข,1 โฒ (๐จ 2 ) 0 ๐ โฏ 0 ๐ข,2 ๐ผ๐ ๐ข ๐จ ๐ = โฎ โฎ โฑ โฎ โฒ (๐จ ๐ ) 0 0 โฏ ๐ ๐ข,๐ โข Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1 โข The most common activation for the hidden units in an RNN is the tanh() โ The derivative of tanh() is always less than 1 โข Multiplication by the Jacobian is always a shrinking operation
Training deep networks ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข As we go back in layers, the Jacobians of the activations constantly shrink the derivative โ After a few instants the derivative of the divergence at any time is totally โforgottenโ
What about the weights ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข In a single-layer RNN, the weight matrices are identical โข The chain product for ๐ผ ๐ ๐ ๐ธ๐๐ค will โ E xpand ๐ผ๐ธ along directions in which the singular values of the weight matrices are greater than 1 โ S hrink ๐ผ๐ธ in directions where the singular values ae less than 1 โ Exploding or vanishing gradients
Exploding/Vanishing gradients ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข Every blue term is a matrix โข ๐ผ๐ธ is proportional to the actual error โ Particularly for L 2 and KL divergence โข The chain product for ๐ผ ๐ ๐ ๐ธ๐๐ค will โ E xpand ๐ผ๐ธ in directions where each stage has singular values greater than 1 โ S hrink ๐ผ๐ธ in directions where each stage has singular values less than 1
Gradient problems in deep networks ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข The gradients in the lower/earlier layers can explode or vanish โ Resulting in insignificant or unstable gradient descent updates โ Problem gets worse as network depth increases
Vanishing gradient examples.. ELU activation, Batch gradients Input layer Output layer โข 19 layer MNIST model โ Different activations: Exponential linear units, RELU, sigmoid, than โ Each layer is 1024 layers wide โ Gradients shown at initialization โข Will actually decrease with additional training โข Figure shows log ๐ผ ๐ ๐๐๐ฃ๐ ๐๐ ๐น where ๐ ๐๐๐ฃ๐ ๐๐ is the vector of incoming weights to each neuron โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Vanishing gradient examples.. RELU activation, Batch gradients Input layer Output layer โข 19 layer MNIST model โ Different activations: Exponential linear units, RELU, sigmoid, than โ Each layer is 1024 layers wide โ Gradients shown at initialization โข Will actually decrease with additional training โข Figure shows log ๐ผ ๐ ๐๐๐ฃ๐ ๐๐ ๐น where ๐ ๐๐๐ฃ๐ ๐๐ is the vector of incoming weights to each neuron โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Vanishing gradient examples.. Sigmoid activation, Batch gradients Input layer Output layer โข 19 layer MNIST model โ Different activations: Exponential linear units, RELU, sigmoid, than โ Each layer is 1024 layers wide โ Gradients shown at initialization โข Will actually decrease with additional training โข Figure shows log ๐ผ ๐ ๐๐๐ฃ๐ ๐๐ ๐น where ๐ ๐๐๐ฃ๐ ๐๐ is the vector of incoming weights to each neuron โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Vanishing gradient examples.. Tanh activation, Batch gradients Input layer Output layer โข 19 layer MNIST model โ Different activations: Exponential linear units, RELU, sigmoid, than โ Each layer is 1024 layers wide โ Gradients shown at initialization โข Will actually decrease with additional training โข Figure shows log ๐ผ ๐ ๐๐๐ฃ๐ ๐๐ ๐น where ๐ ๐๐๐ฃ๐ ๐๐ is the vector of incoming weights to each neuron โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Vanishing gradient examples.. ELU activation, Individual instances โข 19 layer MNIST model โ Different activations: Exponential linear units, RELU, sigmoid, than โ Each layer is 1024 layers wide โ Gradients shown at initialization โข Will actually decrease with additional training โข Figure shows log ๐ผ ๐ ๐๐๐ฃ๐ ๐๐ ๐น where ๐ ๐๐๐ฃ๐ ๐๐ is the vector of incoming weights to each neuron โ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Vanishing gradients โข ELU activations maintain gradients longest โข But in all cases gradients effectively vanish after about 10 layers! โ Your results may vary โข Both batch gradients and gradients for individual instances disappear โ In reality a tiny number may actually blow up.
Recurrent nets are very deep nets Y(T) h f (-1) X(0) ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข The relation between ๐(0) and ๐(๐) is one of a very deep network โ Gradients from errors at t = ๐ will vanish by the time theyโre propagated to ๐ข = 0
Recall: Vanishing stuff.. ๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐) h -1 ๐(0) ๐(1) ๐(2) ๐(๐ โ 2) ๐(๐ โ 1) ๐(๐) โข Stuff gets forgotten in the forward pass too
The long-term dependency problem 1 PATTERN1 [โฆโฆโฆโฆโฆโฆโฆโฆโฆโฆ..] PATTERN 2 Jane had a quick lunch in the bistro. Then she.. โข Any other pattern of any length can happen between pattern 1 and pattern 2 โ RNN will โforgetโ pattern 1 if intermediate stuff is too long โ โJaneโ ๏ the next pronoun referring to her will be โsheโ โข Must know to โrememberโ for extended periods of time and โrecallโ when necessary โ Can be performed with a multi-tap recursion, but how many taps? โ Need an alternate way to โrememberโ stuff
And now we enter the domain of..
Exploding/Vanishing gradients ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ. ๐ผ๐ ๐ . ๐ ๐โ1 . ๐ผ๐ ๐โ1 . ๐ ๐โ2 โฆ ๐ผ๐ ๐+1 ๐ ๐ โข Can we replace this with something that doesnโt fade or blow up? โข ๐ผ ๐ ๐ ๐ธ๐๐ค = ๐ผ๐ธ๐ท๐ ๐ ๐ท๐ ๐โ1 ๐ท โฆ ๐ ๐ โข Can we have a network that just โremembersโ arbitrarily long, to be recalled on demand?
Enter โ the constant error carousel โ(๐ข + 1) โ(๐ข + 2) โ(๐ข + 3) ร ร ร ร โ(๐ข + 4) โ(๐ข) ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Time t+1 t+2 t+3 t+4 โข History is carried through uncompressed โ No weights, no nonlinearities โ Only scaling is through the s โgatingโ term that captures other triggers โ E.g. โHave I seen Pattern2โ?
Enter โ the constant error carousel โ(๐ข + 1) โ(๐ข + 2) โ(๐ข + 3) ร ร ร ร โ(๐ข + 4) โ(๐ข) ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Time โข Actual non-linear work is done by other portions of the network
Enter โ the constant error carousel โ(๐ข + 1) โ(๐ข + 2) โ(๐ข + 3) ร ร ร ร โ(๐ข + 4) โ(๐ข) ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Other stuff ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Time โข Actual non-linear work is done by other portions of the network
Enter โ the constant error carousel โ(๐ข + 1) โ(๐ข + 2) โ(๐ข + 3) ร ร ร ร โ(๐ข + 4) โ(๐ข) ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Other stuff ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Time โข Actual non-linear work is done by other portions of the network
Enter โ the constant error carousel โ(๐ข + 1) โ(๐ข + 2) โ(๐ข + 3) ร ร ร ร โ(๐ข + 4) โ(๐ข) ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Other stuff ๐(๐ข + 4) ๐(๐ข + 1) ๐(๐ข + 2) ๐(๐ข + 3) Time โข Actual non-linear work is done by other portions of the network
Enter the LSTM โข Long Short-Term Memory โข Explicitly latch information to prevent decay / blowup โข Following notes borrow liberally from โข http://colah.github.io/posts/2015-08- Understanding-LSTMs/
Standard RNN โข Recurrent neurons receive past recurrent outputs and current input as inputs โข Processed through a tanh() activation function โ As mentioned earlier, tanh() is the generally used activation for the hidden layer โข Current recurrent output passed to next higher layer and next time instant
Long Short-Term Memory โข The ๐() are multiplicative gates that decide if something is important or not โข Remember, every line actually represents a vector
LSTM: Constant Error Carousel โข Key component: a remembered cell state
LSTM: CEC โข ๐ท ๐ข is the linear history carried by the constant-error carousel โข Carries information through, only affected by a gate โ And addition of history, which too is gated..
LSTM: Gates โข Gates are simple sigmoidal units with outputs in the range (0,1) โข Controls how much of the information is to be let through
LSTM: Forget gate โข The first gate determines whether to carry over the history or to forget it โ More precisely, how much of the history to carry over โ Also called the โforgetโ gate โ Note, weโre actually distinguishing between the cell memory ๐ท and the state โ that is coming over time! Theyโre related though
LSTM: Input gate โข The second gate has two parts โ A perceptron layer that determines if thereโs something interesting in the input โ A gate that decides if its worth remembering โ If so its added to the current memory cell
LSTM: Memory cell update โข The second gate has two parts โ A perceptron layer that determines if thereโs something interesting in the input โ A gate that decides if its worth remembering โ If so its added to the current memory cell
LSTM: Output and Output gate โข The output of the cell โ Simply compress it with tanh to make it lie between 1 and -1 โข Note that this compression no longer affects our ability to carry memory forward โ While weโre at it, lets toss in an output gate โข To decide if the memory contents are worth reporting at this time
LSTM: The โPeepholeโ Connection โข Why not just let the cell directly influence the gates while at it โ Party!!
The complete LSTM unit ๐ท ๐ขโ1 ๐ท ๐ข tanh ๐ ๐ข ๐ ๐ข ๐ ๐ข แ ๐ท ๐ข s() s() s() tanh โ ๐ขโ1 โ ๐ข ๐ฆ ๐ข โข With input, output, and forget gates and the peephole connection..
Backpropagation rules: Forward ๐ท ๐ขโ1 ๐ท ๐ข tanh ๐ ๐ข ๐ ๐ข ๐ ๐ข แ ๐ท ๐ข s() s() s() tanh โ ๐ขโ1 โ ๐ข ๐ฆ ๐ข Gates โข Forward rules: Variables
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค =
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โ โฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ ๐ข ๐ข ๐ข+1 แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ข+1 +
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ ๐ข ๐ข ๐ข+1 แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ข+1 + ๐ท ๐ข โ ๐โฒ . ๐ ๐ท๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐ ๐ผ โ ๐ข ๐ธ๐๐ค = ๐ผ ๐จ ๐ข ๐ธ๐๐ค๐ผ โ ๐ข ๐จ ๐ข
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข+1 ๐ ๐ข+1 ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐ ๐จ ๐ข ๐ธ๐๐ค๐ผ โ ๐ข ๐จ ๐ข + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ท ๐ข โ ๐ โฒ . ๐ ๐ผ โ ๐ข ๐ธ๐๐ค = ๐ผ โ๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข+1 ๐ ๐ข+1 ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐ ๐จ ๐ข ๐ธ๐๐ค๐ผ โ ๐ข ๐จ ๐ข + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ โ๐ + แ ๐ผ โ ๐ข ๐ธ๐๐ค = ๐ผ โ๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข+1 ๐ ๐ข+1 ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐ ๐จ ๐ข ๐ธ๐๐ค๐ผ โ ๐ข ๐จ ๐ข + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ โ๐ + แ ๐ผ โ ๐ข ๐ธ๐๐ค = ๐ผ โ๐ + ๐ผ ๐ท ๐ข+1 ๐ธ๐๐ค โ ๐ ๐ข+1 โ ๐ข๐๐โโฒ . ๐ โ๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข+1 ๐ ๐ข+1 ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐ ๐จ ๐ข ๐ธ๐๐ค๐ผ โ ๐ข ๐จ ๐ข + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ โ๐ + แ ๐ผ โ ๐ข ๐ธ๐๐ค = ๐ผ โ๐ + ๐ผ ๐ท ๐ข+1 ๐ธ๐๐ค โ ๐ ๐ข+1 โ ๐ข๐๐โ โฒ . ๐ โ๐ + ๐ผ โ ๐ข+1 ๐ธ๐๐ค โ ๐ข๐๐โ . โ ๐โฒ . ๐ โ๐
Backpropagation rules: Backward ๐จ ๐ข ๐ท ๐ข ๐ท ๐ข ๐ท ๐ขโ1 ๐ท ๐ข+1 tanh tanh ๐ ๐ข+1 ๐ ๐ข+1 ๐ ๐ข ๐ ๐ ๐ข ๐ข แ แ ๐ท ๐ข ๐ท ๐ข+1 s() s() s() s() s() s() tanh tanh โ ๐ข โ ๐ขโ1 โ ๐ข+1 Not explicitly deriving the derivatives w.r.t weights; Left as an exercise ๐ฆ ๐ข ๐ฆ ๐ข+1 ๐ผ ๐ท ๐ข ๐ธ๐๐ค = ๐ผ โ ๐ข ๐ธ๐๐ค โ ๐ ๐ข โ ๐ข๐๐โโฒ . ๐ ๐ทโ + ๐ข๐๐โ . โ ๐โฒ . ๐ ๐ท๐ + ๐ข+1 + ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ ๐ท๐ + แ ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ ๐ท๐ ๐จ ๐ข ๐ธ๐๐ค๐ผ โ ๐ข ๐จ ๐ข + ๐ผ โ ๐ข ๐ท ๐ข+1 โ ๐ท ๐ข โ ๐ โฒ . ๐ ๐ท ๐ข+1 โ ๐ โฒ . ๐ โ๐ + แ ๐ผ โ ๐ข ๐ธ๐๐ค = ๐ผ โ๐ + ๐ผ ๐ท ๐ข+1 ๐ธ๐๐ค โ ๐ ๐ข+1 โ ๐ข๐๐โ โฒ . ๐ โ๐ + ๐ผ โ ๐ข+1 ๐ธ๐๐ค โ ๐ข๐๐โ . โ ๐โฒ . ๐ โ๐
Gated Recurrent Units : Lets simplify the LSTM โข Simplified LSTM which addresses some of your concerns of why
Gated Recurrent Units : Lets simplify the LSTM โข Combine forget and input gates โ In new input is to be remembered, then this means old memory is to be forgotten โข Why compute twice?
Gated Recurrent Units : Lets simplify the LSTM โข Donโt bother to separately maintain compressed and regular memories โ Pointless computation! โข But compress it before using it to decide on the usefulness of the current input!
LSTM Equations ๐ = ๐ ๐ฆ ๐ข ๐ ๐ + ๐ก ๐ขโ1 ๐ ๐ โข ๐ = ๐ ๐ฆ ๐ข ๐ ๐ + ๐ก ๐ขโ1 ๐ ๐ โข ๐ = ๐ ๐ฆ ๐ข ๐ ๐ + ๐ก ๐ขโ1 ๐ ๐ โข โข ๐: input gate, how much of the new ๐ = tanh ๐ฆ ๐ข ๐ ๐ + ๐ก ๐ขโ1 ๐ ๐ โข information will be let through the memory โข ๐ ๐ข = ๐ ๐ขโ1 โ ๐ + ๐ โ ๐ cell. โข ๐ก ๐ข = tanh ๐ ๐ข โ ๐ โข ๐ : forget gate, responsible for information โข ๐ง = ๐ก๐๐๐ข๐๐๐ฆ ๐๐ก ๐ข should be thrown away from memory cell. โข ๐: output gate, how much of the information will be passed to expose to the next time step. โข ๐: self-recurrent which is equal to standard RNN โข ๐ ๐ : internal memory of the memory cell โข ๐ ๐ : hidden state LSTM Memory Cell โข ๐ณ : final output 95
LSTM architectures example Y(t) X(t) Time โข Each green box is now an entire LSTM or GRU unit โข Also keep in mind each box is an array of units
Bidirectional LSTM Y(0) Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (-1) X(0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(0) X(1) X(2) X(T-2) X(T-1) X(T) t โข Like the BRNN, but now the hidden nodes are LSTM units. โข Can have multiple layers of LSTM units in either direction โ Its also possible to have MLP feed-forward layers between the hidden layers.. โข The output nodes (orange boxes) may be complete MLPs
Significant issue left out โข The Divergence
Story so far Y desired (t) DIVERGENCE Y(t) h -1 X(t) t=0 Time โข Outputs may not be defined at all times โ Often no clear synchrony between input and desired output โข Unclear how to specify alignment โข Unclear how to compute a divergence โ Obvious choices for divergence may not be differentiable (e.g. edit distance) โข In later lectures..
Some typical problem settings โข Lets consider a few typical problems โข Issues: โ How to define the divergence() โ How to compute the gradient โ How to backpropagate โ Specific problem: The constant error carousel..
Recommend
More recommend