recurrent networks
play

Recurrent Networks 10/16/2017 1 Which open source project? - PowerPoint PPT Presentation

Deep Learning Recurrent Networks 10/16/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural networks.. All previous examples


  1. Linear recursions: Vector version What about at middling values of ๐‘ข ? It will depend on the other eigen values โ€ข Consider simple, scalar, linear recursion (note change of notation) If ๐‘†๐‘“(๐œ‡ ๐‘›๐‘๐‘ฆ ) > 1 it will blow up, otherwise it will contract โ€“ โ„Ž ๐‘ข = ๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ(๐‘ข) and shrink to 0 rapidly โ€“ โ„Ž 0 ๐‘ข = ๐‘‹ ๐‘ข ๐‘‘๐‘ฆ 0 For any input, for large ๐‘ข the length of the hidden vector โ€ข Length of response ( โ„Ž ) to a single input at 0 will expand or contract according to the ๐‘ข -th power of the โ€ข We can write ๐‘‹ = ๐‘‰ฮ›๐‘‰ โˆ’1 largest eigen value of the hidden-layer weight matrix Unless it has no component along the eigen vector โ€“ ๐‘‹๐‘ฃ ๐‘— = ๐œ‡ ๐‘— ๐‘ฃ ๐‘— corresponding to the largest eigen value. In that case it โ€“ For any vector โ„Ž we can write will grow according to the second largest Eigen value.. โ€ข โ„Ž = ๐‘ 1 ๐‘ฃ 1 + ๐‘ 2 ๐‘ฃ 2 + โ‹ฏ + ๐‘ ๐‘œ ๐‘ฃ ๐‘œ And so on.. โ€ข ๐‘‹โ„Ž = ๐‘ 1 ๐œ‡ 1 ๐‘ฃ 1 + ๐‘ 2 ๐œ‡ 2 ๐‘ฃ 2 + โ‹ฏ + ๐‘ ๐‘œ ๐œ‡ ๐‘œ ๐‘ฃ ๐‘œ ๐‘ข ๐‘ฃ 2 + โ‹ฏ + ๐‘ ๐‘œ ๐œ‡ ๐‘œ ๐‘ข ๐‘ฃ ๐‘œ โ€ข ๐‘‹ ๐‘ข โ„Ž = ๐‘ 1 ๐œ‡ 1 ๐‘ข ๐‘ฃ 1 + ๐‘ 2 ๐œ‡ 2 ๐‘ข ๐‘ฃ ๐‘› where ๐‘› = argmax ๐‘ขโ†’โˆž ๐‘‹ ๐‘ข โ„Ž = ๐‘ ๐‘› ๐œ‡ ๐‘› โ€“ lim ๐œ‡ ๐‘˜ ๐‘˜

  2. Linear recursions โ€ข Vector linear recursion โ€“ โ„Ž ๐‘ข = ๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ(๐‘ข) โ€“ โ„Ž 0 ๐‘ข = ๐‘ฅ ๐‘ข ๐‘‘๐‘ฆ 0 โ€ข Response to a single input [1 1 1 1] at 0 ๐œ‡ ๐‘›๐‘๐‘ฆ = 0.9 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1

  3. Linear recursions โ€ข Vector linear recursion โ€“ โ„Ž ๐‘ข = ๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ(๐‘ข) โ€“ โ„Ž 0 ๐‘ข = ๐‘ฅ ๐‘ข ๐‘‘๐‘ฆ 0 โ€ข Response to a single input [1 1 1 1] at 0 ๐œ‡ ๐‘›๐‘๐‘ฆ = 0.9 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1.1 ๐œ‡ 2๐‘œ๐‘’ = 0.5 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1 ๐œ‡ ๐‘›๐‘๐‘ฆ = 1 ๐œ‡ 2๐‘œ๐‘’ = 0.1 Complex Eigenvalues

  4. Lesson.. โ€ข In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix โ€“ If the largest Eigen value is greater than 1, the system will โ€œblow upโ€ โ€“ If it is lesser than 1, the response will โ€œvanishโ€ very quickly โ€“ Complex Eigen values cause oscillatory response โ€ข Which we may or may not want โ€ข Force matrix to have real eigen values for smooth behavior โ€“ Symmetric weight matrix

  5. How about non-linearities โ„Ž ๐‘ข = ๐‘”(๐‘ฅโ„Ž ๐‘ข โˆ’ 1 + ๐‘‘๐‘ฆ ๐‘ข ) โ€ข The behavior of scalar non-linearities โ€ข Left: Sigmoid, Middle: Tanh, Right: Relu โ€“ Sigmoid: Saturates in a limited number of steps, regardless of ๐‘ฅ โ€“ Tanh: Sensitive to ๐‘ฅ , but eventually saturates โ€ข โ€œPrefersโ€ weights close to 1.0 โ€“ Relu: Sensitive to ๐‘ฅ , can blow up

  6. How about non-linearities โ„Ž ๐‘ข = ๐‘”(๐‘ฅโ„Ž ๐‘ข โˆ’ 1 + ๐‘‘๐‘ฆ ๐‘ข ) โ€ข With a negative start (equivalent to โ€“ ve wt) โ€ข Left: Sigmoid, Middle: Tanh, Right: Relu โ€“ Sigmoid: Saturates in a limited number of steps, regardless of ๐‘ฅ โ€“ Tanh: Sensitive to ๐‘ฅ , but eventually saturates โ€“ Relu: For negative starts, has no response

  7. Vector Process โ„Ž ๐‘ข = ๐‘”(๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ ๐‘ข ) โ€ข Assuming a uniform unit vector initialization โ€“ 1,1,1, โ€ฆ / ๐‘‚ โ€“ Behavior similar to scalar recursion โ€“ Interestingly, RELU is more prone to blowing up (why?) โ€ข Eigenvalues less than 1.0 retain the most โ€œmemoryโ€ sigmoid tanh relu

  8. Vector Process โ„Ž ๐‘ข = ๐‘”(๐‘‹โ„Ž ๐‘ข โˆ’ 1 + ๐ท๐‘ฆ ๐‘ข ) โ€ข Assuming a uniform unit vector initialization โ€“ โˆ’1, โˆ’1, โˆ’1, โ€ฆ / ๐‘‚ โ€“ Behavior similar to scalar recursion โ€“ Interestingly, RELU is more prone to blowing up (why?) sigmoid tanh relu

  9. Stability Analysis โ€ข Formal stability analysis considers convergence of โ€œ Lyapunov โ€ functions โ€“ Alternately, Routhโ€™s criterion and/or pole -zero analysis โ€“ Positive definite functions evaluated at โ„Ž โ€“ Conclusions are similar: only the tanh activation gives us any reasonable behavior โ€ข And still has very short โ€œmemoryโ€ โ€ข Lessons: โ€“ Bipolar activations (e.g. tanh) have the best behavior โ€“ Still sensitive to Eigenvalues of ๐‘‹ โ€“ Best case memory is short โ€“ Exponential memory behavior โ€ข โ€œForgetsโ€ in exponential manner

  10. How about deeper recursion โ€ข Consider simple, scalar, linear recursion โ€“ Adding more โ€œtapsโ€ adds more โ€œmodesโ€ to memory in somewhat non-obvious ways โ„Ž ๐‘ข = 0.5โ„Ž ๐‘ข โˆ’ 1 + 0.25โ„Ž ๐‘ข โˆ’ 5 + ๐‘ฆ(๐‘ข) โ„Ž ๐‘ข = 0.5โ„Ž ๐‘ข โˆ’ 1 + 0.25โ„Ž ๐‘ข โˆ’ 5 + 0.1โ„Ž ๐‘ข โˆ’ 8 + ๐‘ฆ(๐‘ข)

  11. Stability Analysis โ€ข Similar analysis of vector functions with non- linear activations is relatively straightforward โ€“ Linear systems: Routhโ€™s criterion โ€ข And pole-zero analysis (involves tensors) โ€“ On board? โ€“ Non-linear systems: Lyapunov functions โ€ข Conclusions do not change

  12. RNNs.. โ€ข Excellent models for time-series analysis tasks โ€“ Time-series prediction โ€“ Time-series classification โ€“ Sequence prediction.. โ€“ They can even simplify problems that are difficult for MLPs โ€ข But the memory isnโ€™t all that great.. โ€“ Also..

  13. The vanishing gradient problem โ€ข A particular problem with training deep networks.. โ€“ The gradient of the error with respect to weights is unstable..

  14. Some useful preliminary math: The problem with training deep networks W 0 W 1 W 2 โ€ข A multilayer perceptron is a nested function ๐‘ = ๐‘” ๐‘‚ ๐‘‹ ๐‘‚โˆ’1 ๐‘” ๐‘‚โˆ’1 ๐‘‹ ๐‘‚โˆ’2 ๐‘” ๐‘‚โˆ’2 โ€ฆ ๐‘‹ 0 ๐‘Œ ๐‘™ is the weights matrix at the k th layer โ€ข ๐‘‹ โ€ข The error for ๐‘Œ can be written as ๐ธ๐‘—๐‘ค(๐‘Œ) = ๐ธ ๐‘” ๐‘‚ ๐‘‹ ๐‘‚โˆ’1 ๐‘” ๐‘‚โˆ’1 ๐‘‹ ๐‘‚โˆ’2 ๐‘” ๐‘‚โˆ’2 โ€ฆ ๐‘‹ 0 ๐‘Œ

  15. Training deep networks โ€ข Vector derivative chain rule: for any ๐‘” ๐‘‹๐‘• ๐‘Œ : ๐‘’๐‘” ๐‘‹๐‘• ๐‘Œ = ๐‘’๐‘” ๐‘‹๐‘• ๐‘Œ ๐‘’๐‘‹๐‘• ๐‘Œ ๐‘’๐‘• ๐‘Œ ๐‘’๐‘Œ ๐‘’๐‘‹๐‘• ๐‘Œ ๐‘’๐‘• ๐‘Œ ๐‘’๐‘Œ Poor notation ๐›ผ ๐‘Œ ๐‘” = ๐›ผ ๐‘Ž ๐‘”. ๐‘‹. ๐›ผ ๐‘Œ ๐‘• โ€ข Where โ€“ ๐‘Ž = ๐‘‹๐‘• ๐‘Œ โ€“ ๐›ผ ๐‘Ž ๐‘” is the jacobian matrix of ๐‘”(๐‘Ž) w.r.t ๐‘Ž โ€ข Using the notation ๐›ผ ๐‘Ž ๐‘” instead of ๐พ ๐‘” (๐‘จ) for consistency

  16. Training deep networks โ€ข For ๐ธ๐‘—๐‘ค(๐‘Œ) = ๐ธ ๐‘” ๐‘‚ ๐‘‹ ๐‘‚โˆ’1 ๐‘” ๐‘‚โˆ’1 ๐‘‹ ๐‘‚โˆ’2 ๐‘” ๐‘‚โˆ’2 โ€ฆ ๐‘‹ 0 ๐‘Œ โ€ข We get: ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข Where โ€“ ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค is the gradient ๐ธ๐‘—๐‘ค(๐‘Œ) of the error w.r.t the output of the kth layer of the network โ€ข Needed to compute the gradient of the error w.r.t ๐‘‹ ๐‘™โˆ’1 โ€“ ๐›ผ๐‘” ๐‘œ is jacobian of ๐‘” ๐‘‚ () w.r.t. to its current input โ€“ All blue terms are matrices

  17. The Jacobian of the hidden layers โ€ฒ (๐‘จ 1 ) ๐‘ ๐‘” 0 โ‹ฏ 0 ๐‘ข,1 โ€ฒ (๐‘จ 2 ) 0 ๐‘” โ‹ฏ 0 ๐‘ข,2 ๐›ผ๐‘” ๐‘ข ๐‘จ ๐‘— = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ„Ž 1 โ€ฒ (๐‘จ ๐‘‚ ) 0 0 โ‹ฏ ๐‘” ๐‘ข,๐‘‚ 1 (๐‘ข) = ๐‘” 1 ๐‘ข ๐‘Œ โ„Ž ๐‘— 1 ๐‘จ ๐‘— โ€ข ๐›ผ๐‘” ๐‘ข () is the derivative of the output of the (layer of) hidden recurrent neurons with respect to their input โ€“ A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer

  18. The Jacobian 1 (๐‘ข) = ๐‘” 1 ๐‘ข โ„Ž ๐‘— 1 ๐‘จ ๐‘— โ€ฒ (๐‘จ 1 ) ๐‘ ๐‘” 0 โ‹ฏ 0 ๐‘ข,1 โ€ฒ (๐‘จ 2 ) 0 ๐‘” โ‹ฏ 0 ๐‘ข,2 ๐›ผ๐‘” ๐‘ข ๐‘จ ๐‘— = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ„Ž 1 โ€ฒ (๐‘จ ๐‘‚ ) 0 0 โ‹ฏ ๐‘” ๐‘ข,๐‘‚ ๐‘Œ โ€ข The derivative (or subgradient) of the activation function is always bounded โ€“ The diagonals of the Jacobian are bounded โ€ข There is a limit on how much multiplying a vector by the Jacobian will scale it

  19. The derivative of the hidden state activation โ€ฒ (๐‘จ 1 ) ๐‘” 0 โ‹ฏ 0 ๐‘ข,1 โ€ฒ (๐‘จ 2 ) 0 ๐‘” โ‹ฏ 0 ๐‘ข,2 ๐›ผ๐‘” ๐‘ข ๐‘จ ๐‘— = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ€ฒ (๐‘จ ๐‘‚ ) 0 0 โ‹ฏ ๐‘” ๐‘ข,๐‘‚ โ€ข Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1 โ€ข The most common activation for the hidden units in an RNN is the tanh() โ€“ The derivative of tanh() is always less than 1 โ€ข Multiplication by the Jacobian is always a shrinking operation

  20. Training deep networks ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข As we go back in layers, the Jacobians of the activations constantly shrink the derivative โ€“ After a few instants the derivative of the divergence at any time is totally โ€œforgottenโ€

  21. What about the weights ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข In a single-layer RNN, the weight matrices are identical โ€ข The chain product for ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค will โ€“ E xpand ๐›ผ๐ธ along directions in which the singular values of the weight matrices are greater than 1 โ€“ S hrink ๐›ผ๐ธ in directions where the singular values ae less than 1 โ€“ Exploding or vanishing gradients

  22. Exploding/Vanishing gradients ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข Every blue term is a matrix โ€ข ๐›ผ๐ธ is proportional to the actual error โ€“ Particularly for L 2 and KL divergence โ€ข The chain product for ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค will โ€“ E xpand ๐›ผ๐ธ in directions where each stage has singular values greater than 1 โ€“ S hrink ๐›ผ๐ธ in directions where each stage has singular values less than 1

  23. Gradient problems in deep networks ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข The gradients in the lower/earlier layers can explode or vanish โ€“ Resulting in insignificant or unstable gradient descent updates โ€“ Problem gets worse as network depth increases

  24. Vanishing gradient examples.. ELU activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  25. Vanishing gradient examples.. RELU activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  26. Vanishing gradient examples.. Sigmoid activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  27. Vanishing gradient examples.. Tanh activation, Batch gradients Input layer Output layer โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  28. Vanishing gradient examples.. ELU activation, Individual instances โ€ข 19 layer MNIST model โ€“ Different activations: Exponential linear units, RELU, sigmoid, than โ€“ Each layer is 1024 layers wide โ€“ Gradients shown at initialization โ€ข Will actually decrease with additional training โ€ข Figure shows log ๐›ผ ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ ๐น where ๐‘‹ ๐‘œ๐‘“๐‘ฃ๐‘ ๐‘๐‘œ is the vector of incoming weights to each neuron โ€“ I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron

  29. Vanishing gradients โ€ข ELU activations maintain gradients longest โ€ข But in all cases gradients effectively vanish after about 10 layers! โ€“ Your results may vary โ€ข Both batch gradients and gradients for individual instances disappear โ€“ In reality a tiny number may actually blow up.

  30. Recurrent nets are very deep nets Y(T) h f (-1) X(0) ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข The relation between ๐‘Œ(0) and ๐‘(๐‘ˆ) is one of a very deep network โ€“ Gradients from errors at t = ๐‘ˆ will vanish by the time theyโ€™re propagated to ๐‘ข = 0

  31. Recall: Vanishing stuff.. ๐‘(0) ๐‘(1) ๐‘(2) ๐‘(๐‘ˆ โˆ’ 2) ๐‘(๐‘ˆ โˆ’ 1) ๐‘(๐‘ˆ) h -1 ๐‘Œ(0) ๐‘Œ(1) ๐‘Œ(2) ๐‘Œ(๐‘ˆ โˆ’ 2) ๐‘Œ(๐‘ˆ โˆ’ 1) ๐‘Œ(๐‘ˆ) โ€ข Stuff gets forgotten in the forward pass too

  32. The long-term dependency problem 1 PATTERN1 [โ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆ..] PATTERN 2 Jane had a quick lunch in the bistro. Then she.. โ€ข Any other pattern of any length can happen between pattern 1 and pattern 2 โ€“ RNN will โ€œforgetโ€ pattern 1 if intermediate stuff is too long โ€“ โ€œJaneโ€ ๏ƒ  the next pronoun referring to her will be โ€œsheโ€ โ€ข Must know to โ€œrememberโ€ for extended periods of time and โ€œrecallโ€ when necessary โ€“ Can be performed with a multi-tap recursion, but how many taps? โ€“ Need an alternate way to โ€œrememberโ€ stuff

  33. And now we enter the domain of..

  34. Exploding/Vanishing gradients ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ. ๐›ผ๐‘” ๐‘‚ . ๐‘‹ ๐‘‚โˆ’1 . ๐›ผ๐‘” ๐‘‚โˆ’1 . ๐‘‹ ๐‘‚โˆ’2 โ€ฆ ๐›ผ๐‘” ๐‘™+1 ๐‘‹ ๐‘™ โ€ข Can we replace this with something that doesnโ€™t fade or blow up? โ€ข ๐›ผ ๐‘” ๐‘™ ๐ธ๐‘—๐‘ค = ๐›ผ๐ธ๐ท๐œ ๐‘‚ ๐ท๐œ ๐‘‚โˆ’1 ๐ท โ€ฆ ๐œ ๐‘™ โ€ข Can we have a network that just โ€œremembersโ€ arbitrarily long, to be recalled on demand?

  35. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Time t+1 t+2 t+3 t+4 โ€ข History is carried through uncompressed โ€“ No weights, no nonlinearities โ€“ Only scaling is through the s โ€œgatingโ€ term that captures other triggers โ€“ E.g. โ€œHave I seen Pattern2โ€?

  36. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  37. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Other stuff ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  38. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Other stuff ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  39. Enter โ€“ the constant error carousel โ„Ž(๐‘ข + 1) โ„Ž(๐‘ข + 2) โ„Ž(๐‘ข + 3) ร— ร— ร— ร— โ„Ž(๐‘ข + 4) โ„Ž(๐‘ข) ๐œ(๐‘ข + 4) ๐œ(๐‘ข + 1) ๐œ(๐‘ข + 2) ๐œ(๐‘ข + 3) Other stuff ๐‘Œ(๐‘ข + 4) ๐‘Œ(๐‘ข + 1) ๐‘Œ(๐‘ข + 2) ๐‘Œ(๐‘ข + 3) Time โ€ข Actual non-linear work is done by other portions of the network

  40. Enter the LSTM โ€ข Long Short-Term Memory โ€ข Explicitly latch information to prevent decay / blowup โ€ข Following notes borrow liberally from โ€ข http://colah.github.io/posts/2015-08- Understanding-LSTMs/

  41. Standard RNN โ€ข Recurrent neurons receive past recurrent outputs and current input as inputs โ€ข Processed through a tanh() activation function โ€“ As mentioned earlier, tanh() is the generally used activation for the hidden layer โ€ข Current recurrent output passed to next higher layer and next time instant

  42. Long Short-Term Memory โ€ข The ๐œ() are multiplicative gates that decide if something is important or not โ€ข Remember, every line actually represents a vector

  43. LSTM: Constant Error Carousel โ€ข Key component: a remembered cell state

  44. LSTM: CEC โ€ข ๐ท ๐‘ข is the linear history carried by the constant-error carousel โ€ข Carries information through, only affected by a gate โ€“ And addition of history, which too is gated..

  45. LSTM: Gates โ€ข Gates are simple sigmoidal units with outputs in the range (0,1) โ€ข Controls how much of the information is to be let through

  46. LSTM: Forget gate โ€ข The first gate determines whether to carry over the history or to forget it โ€“ More precisely, how much of the history to carry over โ€“ Also called the โ€œforgetโ€ gate โ€“ Note, weโ€™re actually distinguishing between the cell memory ๐ท and the state โ„Ž that is coming over time! Theyโ€™re related though

  47. LSTM: Input gate โ€ข The second gate has two parts โ€“ A perceptron layer that determines if thereโ€™s something interesting in the input โ€“ A gate that decides if its worth remembering โ€“ If so its added to the current memory cell

  48. LSTM: Memory cell update โ€ข The second gate has two parts โ€“ A perceptron layer that determines if thereโ€™s something interesting in the input โ€“ A gate that decides if its worth remembering โ€“ If so its added to the current memory cell

  49. LSTM: Output and Output gate โ€ข The output of the cell โ€“ Simply compress it with tanh to make it lie between 1 and -1 โ€ข Note that this compression no longer affects our ability to carry memory forward โ€“ While weโ€™re at it, lets toss in an output gate โ€ข To decide if the memory contents are worth reporting at this time

  50. LSTM: The โ€œPeepholeโ€ Connection โ€ข Why not just let the cell directly influence the gates while at it โ€“ Party!!

  51. The complete LSTM unit ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข tanh ๐‘ ๐‘ข ๐‘— ๐‘ข ๐‘” ๐‘ข แˆš ๐ท ๐‘ข s() s() s() tanh โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข ๐‘ฆ ๐‘ข โ€ข With input, output, and forget gates and the peephole connection..

  52. Backpropagation rules: Forward ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข tanh ๐‘ ๐‘ข ๐‘— ๐‘ข ๐‘” ๐‘ข แˆš ๐ท ๐‘ข s() s() s() tanh โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข ๐‘ฆ ๐‘ข Gates โ€ข Forward rules: Variables

  53. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค =

  54. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž

  55. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Ž โ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘

  56. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘” ๐‘ ๐‘ข ๐‘ข ๐‘ข+1 แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐‘ข+1 +

  57. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘” ๐‘ ๐‘ข ๐‘ข ๐‘ข+1 แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘”

  58. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘—

  59. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข

  60. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘”

  61. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘—

  62. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘— + ๐›ผ ๐ท ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘— ๐‘ข+1 โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ โ„Ž๐‘—

  63. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘— + ๐›ผ ๐ท ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข+1 โˆ˜ ๐‘ข๐‘๐‘œโ„Ž โ€ฒ . ๐‘‹ โ„Ž๐‘— + ๐›ผ โ„Ž ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ โ„Ž๐‘

  64. Backpropagation rules: Backward ๐‘จ ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ข ๐ท ๐‘ขโˆ’1 ๐ท ๐‘ข+1 tanh tanh ๐‘— ๐‘ข+1 ๐‘ ๐‘ข+1 ๐‘— ๐‘ข ๐‘” ๐‘ ๐‘ข ๐‘ข แˆš แˆš ๐ท ๐‘ข ๐ท ๐‘ข+1 s() s() s() s() s() s() tanh tanh โ„Ž ๐‘ข โ„Ž ๐‘ขโˆ’1 โ„Ž ๐‘ข+1 Not explicitly deriving the derivatives w.r.t weights; Left as an exercise ๐‘ฆ ๐‘ข ๐‘ฆ ๐‘ข+1 ๐›ผ ๐ท ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข โˆ˜ ๐‘ข๐‘๐‘œโ„Žโ€ฒ . ๐‘‹ ๐ทโ„Ž + ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ ๐ท๐‘ + ๐‘ข+1 + ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐‘” ๐ท๐‘— ๐‘จ ๐‘ข ๐ธ๐‘—๐‘ค๐›ผ โ„Ž ๐‘ข ๐‘จ ๐‘ข + ๐›ผ โ„Ž ๐‘ข ๐ท ๐‘ข+1 โˆ˜ ๐ท ๐‘ข โˆ˜ ๐œ โ€ฒ . ๐‘‹ ๐ท ๐‘ข+1 โˆ˜ ๐œ โ€ฒ . ๐‘‹ โ„Ž๐‘” + แˆš ๐›ผ โ„Ž ๐‘ข ๐ธ๐‘—๐‘ค = ๐›ผ โ„Ž๐‘— + ๐›ผ ๐ท ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ ๐‘ข+1 โˆ˜ ๐‘ข๐‘๐‘œโ„Ž โ€ฒ . ๐‘‹ โ„Ž๐‘— + ๐›ผ โ„Ž ๐‘ข+1 ๐ธ๐‘—๐‘ค โˆ˜ ๐‘ข๐‘๐‘œโ„Ž . โˆ˜ ๐œโ€ฒ . ๐‘‹ โ„Ž๐‘

  65. Gated Recurrent Units : Lets simplify the LSTM โ€ข Simplified LSTM which addresses some of your concerns of why

  66. Gated Recurrent Units : Lets simplify the LSTM โ€ข Combine forget and input gates โ€“ In new input is to be remembered, then this means old memory is to be forgotten โ€ข Why compute twice?

  67. Gated Recurrent Units : Lets simplify the LSTM โ€ข Donโ€™t bother to separately maintain compressed and regular memories โ€“ Pointless computation! โ€ข But compress it before using it to decide on the usefulness of the current input!

  68. LSTM Equations ๐‘— = ๐œ ๐‘ฆ ๐‘ข ๐‘‰ ๐‘— + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘— โ€ข ๐‘” = ๐œ ๐‘ฆ ๐‘ข ๐‘‰ ๐‘” + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘” โ€ข ๐‘ = ๐œ ๐‘ฆ ๐‘ข ๐‘‰ ๐‘ + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘ โ€ข โ€ข ๐’‹: input gate, how much of the new ๐‘• = tanh ๐‘ฆ ๐‘ข ๐‘‰ ๐‘• + ๐‘ก ๐‘ขโˆ’1 ๐‘‹ ๐‘• โ€ข information will be let through the memory โ€ข ๐‘‘ ๐‘ข = ๐‘‘ ๐‘ขโˆ’1 โˆ˜ ๐‘” + ๐‘• โˆ˜ ๐‘— cell. โ€ข ๐‘ก ๐‘ข = tanh ๐‘‘ ๐‘ข โˆ˜ ๐‘ โ€ข ๐’ˆ : forget gate, responsible for information โ€ข ๐‘ง = ๐‘ก๐‘๐‘”๐‘ข๐‘›๐‘๐‘ฆ ๐‘Š๐‘ก ๐‘ข should be thrown away from memory cell. โ€ข ๐’‘: output gate, how much of the information will be passed to expose to the next time step. โ€ข ๐’‰: self-recurrent which is equal to standard RNN โ€ข ๐’… ๐’– : internal memory of the memory cell โ€ข ๐’• ๐’– : hidden state LSTM Memory Cell โ€ข ๐ณ : final output 95

  69. LSTM architectures example Y(t) X(t) Time โ€ข Each green box is now an entire LSTM or GRU unit โ€ข Also keep in mind each box is an array of units

  70. Bidirectional LSTM Y(0) Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (-1) X(0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(0) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข Like the BRNN, but now the hidden nodes are LSTM units. โ€ข Can have multiple layers of LSTM units in either direction โ€“ Its also possible to have MLP feed-forward layers between the hidden layers.. โ€ข The output nodes (orange boxes) may be complete MLPs

  71. Significant issue left out โ€ข The Divergence

  72. Story so far Y desired (t) DIVERGENCE Y(t) h -1 X(t) t=0 Time โ€ข Outputs may not be defined at all times โ€“ Often no clear synchrony between input and desired output โ€ข Unclear how to specify alignment โ€ข Unclear how to compute a divergence โ€“ Obvious choices for divergence may not be differentiable (e.g. edit distance) โ€ข In later lectures..

  73. Some typical problem settings โ€ข Lets consider a few typical problems โ€ข Issues: โ€“ How to define the divergence() โ€“ How to compute the gradient โ€“ How to backpropagate โ€“ Specific problem: The constant error carousel..

Recommend


More recommend