recurrent neural networks stability analysis and lstms
play

Recurrent Neural Networks: Stability analysis and LSTMs M. - PowerPoint PPT Presentation

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2019 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n,


  1. Linear recursions: Vector version What about at middling values of ๐‘ข ? It will depend on the other eigen values โ€ข Vector linear recursion (note change of notation) โ€“ โ„Ž ๐‘ข = ๐‘‹ E โ„Ž ๐‘ข โˆ’ 1 + ๐‘‹ I ๐‘ฆ(๐‘ข) If ๐œ‡ `1I > 1 it will blow up, otherwise it will contract 0 ๐‘‹ โ€“ โ„Ž G ๐‘ข = ๐‘‹ I ๐‘ฆ 1 and shrink to 0 rapidly E โ€ข Length of response vector to a single input at 1 is |โ„Ž {G} ๐‘ข | For any input, for large ๐‘ข the length of the hidden vector will expand or contract according to the ๐‘ข th power of the E = ๐‘‰ฮ›๐‘‰ FG โ€ข We can write ๐‘‹ largest eigen value of the hidden-layer weight matrix โ€“ ๐‘‹ E ๐‘ฃ V = ๐œ‡ V ๐‘ฃ V Unless it has no component along the eigen vector corresponding to the โ€“ For any vector ๐‘ค we can write largest eigen value. In that case it will grow according to the second largest Eigen value.. โ€ข ๐‘ค = ๐‘ G ๐‘ฃ G + ๐‘ K ๐‘ฃ K + โ‹ฏ + ๐‘ Z ๐‘ฃ Z And so on.. โ€ข ๐‘‹ E ๐‘ค = ๐‘ G ๐œ‡ G ๐‘ฃ G + ๐‘ K ๐œ‡ K ๐‘ฃ K + โ‹ฏ + ๐‘ Z ๐œ‡ Z ๐‘ฃ Z 0 ๐‘ฃ K + โ‹ฏ + ๐‘ Z ๐œ‡ Z 0 ๐‘ฃ Z 0 ๐‘ค = ๐‘ G ๐œ‡ G 0 ๐‘ฃ G + ๐‘ K ๐œ‡ K โ€ข ๐‘‹ E 0 ๐‘ฃ ` where ๐‘› = argmax 0 ๐‘ค = ๐‘ ` ๐œ‡ ` โ€“ lim 0โ†’_ ๐‘‹ ๐œ‡ f E f 26

  2. Linear Li r recursions โ€ข Vector linear recursion โ€“ โ„Ž ๐‘ข = ๐‘‹ E โ„Ž ๐‘ข โˆ’ 1 + ๐‘‹ I ๐‘ฆ(๐‘ข) 0 ๐‘‹ โ€“ โ„Ž G ๐‘ข = ๐‘‹ I ๐‘ฆ 1 E โ€ข Response to a single input [1 1 1 1] at 1 ๐œ‡ `1I = 0.9 ๐œ‡ `1I = 1.1 ๐œ‡ `1I = 1.1 ๐œ‡ `1I = 1 ๐œ‡ `1I = 1 27

  3. Linear Li r recursions โ€ข Vector linear recursion โ€“ โ„Ž ๐‘ข = ๐‘‹ E โ„Ž ๐‘ข โˆ’ 1 + ๐‘‹ I ๐‘ฆ(๐‘ข) 0 ๐‘‹ โ€“ โ„Ž G ๐‘ข = ๐‘‹ I ๐‘ฆ 1 E โ€ข Response to a single input [1 1 1 1] at 1 ๐œ‡ `1I = 0.9 ๐œ‡ `1I = 1.1 ๐œ‡ `1I = 1.1 ๐œ‡ KZj = 0.5 ๐œ‡ `1I = 1 ๐œ‡ `1I = 1 ๐œ‡ KZj = 0.1 Complex Eigenvalues 28

  4. Lesson.. Le โ€ข In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix โ€“ If the largest Eigen value is greater than 1, the system will โ€œblow upโ€ โ€“ If it is lesser than 1, the response will โ€œvanishโ€ very quickly โ€“ Complex Eigen values cause oscillatory response โ€ข Which we may or may not want โ€ข For smooth behavior, must force the weights matrix to have real Eigen values โ€ข Symmetric weight matrix 29

  5. Ho How w abo about ut no non-lin linear earities ities (sc (scalar) โ€ข The behavior of scalar non-linearities โ„Ž ๐‘ข = ๐‘”(๐‘ฅ E โ„Ž ๐‘ข โˆ’ 1 + ๐‘ฅ I ๐‘ฆ ๐‘ข ) โ€ข Left: Sigmoid, Middle: Tanh, Right: Relu โ€“ Sigmoid: Saturates in a limited number of steps, regardless of ๐‘ฅ E โ€“ Tanh: Sensitive to ๐‘ฅ E , but eventually saturates โ€ข โ€œPrefersโ€ weights close to 1.0 โ€“ Relu: Sensitive to ๐‘ฅ E , can blow up 30

  6. How about non-linearities (sc Ho (scalar) โ„Ž ๐‘ข = ๐‘”(๐‘ฅ E โ„Ž ๐‘ข โˆ’ 1 + ๐‘ฅ I ๐‘ฆ ๐‘ข ) โ€ข With a negative start โ€ข Left: Sigmoid, Middle: Tanh, Right: Relu โ€“ Sigmoid: Saturates in a limited number of steps, regardless of ๐‘ฅ E โ€“ Tanh: Sensitive to ๐‘ฅ E , but eventually saturates โ€“ Relu: For negative starts, has no response 31

  7. ๏ฟฝ Ve Vector Process โ€ข Assuming a uniform unit vector initialization โ„Ž ๐‘ข = ๐‘”(๐‘ฅ E โ„Ž ๐‘ข โˆ’ 1 + ๐‘ฅ I ๐‘ฆ ๐‘ข ) โ€“ 1,1,1, โ€ฆ / ๐‘‚ โ€“ Behavior similar to scalar recursion โ€“ Interestingly, RELU is more prone to blowing up (why?) โ€ข Eigenvalues less than 1.0 retain the most โ€œmemoryโ€ sigmoid tanh relu 32

  8. St Stability Analysis โ€ข Formal stability analysis considers convergence of โ€œLyapunovโ€ functions โ€“ Alternately, Routhโ€™s criterion and/or pole-zero analysis โ€“ Positive definite functions evaluated at โ„Ž โ€“ Conclusions are similar: only the tanh activation gives us any reasonable behavior โ€ข And still has very short โ€œmemoryโ€ โ€ข Lessons: โ€“ Bipolar activations (e.g. tanh) have the best memory behavior โ€“ Still sensitive to Eigenvalues of ๐‘‹ โ€“ Best case memory is short โ€“ Exponential memory behavior โ€ข โ€œForgetsโ€ in exponential manner 33

  9. St Story so far โ€ข Recurrent networks retain information from the infinite past in principle โ€ข In practice, they tend to blow up or forget โ€“ If the largest Eigen value of the recurrent weights matrix is greater than 1, the network response may blow up โ€“ If its less than one, the response dies down very quickly โ€ข The โ€œmemoryโ€ of the network also depends on the activation of the hidden units โ€“ Sigmoid activations saturate and the network becomes unable to retain new information โ€“ RELUs blow up โ€“ Tanh activations are the most effective at storing memory โ€ข But still, for not very long 34

  10. RN RNNs.. s.. โ€ข Excellent models for time-series analysis tasks โ€“ Time-series prediction โ€“ Time-series classification โ€“ Sequence prediction.. โ€“ They can even simplify problems that are difficult for MLPs โ€ข But the memory isnโ€™t all that great.. โ€“ Also.. 35

  11. Th The vanishing gradient problem โ€ข A particular problem with training deep networks.. โ€“ (Any deep network, not just recurrent nets) โ€“ The gradient of the error with respect to weights is unstable.. 36

  12. Re Reminder: Tr Training deep networks ๐‘ [t] = ๐‘๐‘ฃ๐‘ข๐‘ž๐‘ฃ๐‘ข ๐‘ [t] ๐‘ƒ๐‘ฃ๐‘ข๐‘ž๐‘ฃ๐‘ข = ๐‘ [t] ๐‘” ๐‘จ [t] = ๐‘” ๐‘จ [t] ร— ๐‘ [tFG] = ๐‘” ๐‘‹ [t] ๐‘ [tFG] ๐‘‹ [t] ๐‘” ๐‘จ [K] = ๐‘” ๐‘‹ [t] ๐‘”(๐‘‹ [tFG] ๐‘ [tFK] ๐‘ [G] ร— = ๐‘” ๐‘‹ [t] ๐‘” ๐‘‹ [tFG] โ€ฆ ๐‘” ๐‘‹ [K] ๐‘” ๐‘‹ [G] ๐‘ฆ ๐‘” ๐‘จ [G] ๐‘‹ [K] ร— ๐‘‹ [G] ๐‘ฆ For convenience, we use the same activation functions for all layers. However, output layer neurons most commonly do not need activation function (they show class scores or real-valued targets.) 37

  13. Re Reminder: Training deep deep ne networks โ€ข For ๐‘€๐‘๐‘ก๐‘ก(๐‘ฆ) = ๐น ๐‘” [t] ๐‘‹ [t] ๐‘” [tFG] ๐‘‹ [tFG] ๐‘” [tFK] โ€ฆ ๐‘‹ [G] ๐‘ฆ โ€ข We get: x [z] ๐‘€๐‘๐‘ก๐‘ก. ๐›ผ๐‘” [t] . ๐‘‹ [t] . ๐›ผ๐‘” [tFG] . ๐‘‹ [tFG] โ€ฆ ๐›ผ๐‘” [{|G] . ๐‘‹ [{|G] ๐›ผ x [y] ๐‘€๐‘๐‘ก๐‘ก = ๐›ผ โ€ข Where โ€“ ๐›ผ x [y] ๐‘€๐‘๐‘ก๐‘ก is the gradient of the error w.r.t the output of the l-th layer of the network โ€ข Needed to compute the gradient of the error w.r.t ๐‘‹ [{] โ€“ ๐›ผ๐‘” [{] is jacobian of ๐‘” [{] w.r.t. to its current input โ€“ All blue terms are matrices 38

  14. Re Reminder: Gradient pr probl blems i in de n deep ne p netw twork rks x [z] ๐‘€๐‘๐‘ก๐‘ก. ๐›ผ๐‘” [t] . ๐‘‹ [t] . ๐›ผ๐‘” [tFG] . ๐‘‹ [tFG] โ€ฆ ๐›ผ๐‘” [{|G] . ๐‘‹ [{|G] ๐›ผ x [y] ๐‘€๐‘๐‘ก๐‘ก = ๐›ผ โ€ข The gradients in the lower/earlier layers can explode or vanish โ€“ Resulting in insignificant or unstable gradient descent updates โ€“ Problem gets worse as network depth increases 39

  15. Reminder: Training deep Re deep ne networks x [z] ๐‘€๐‘๐‘ก๐‘ก. ๐›ผ๐‘” [t] . ๐‘‹ [t] . ๐›ผ๐‘” [tFG] . ๐‘‹ [tFG] โ€ฆ ๐›ผ๐‘” [{|G] . ๐‘‹ [{|G] ๐›ผ x [y] ๐‘€๐‘๐‘ก๐‘ก = ๐›ผ โ€ข As we go back in layers, the Jacobians of the activations constantly shrink the derivative โ€“ After a few layers the derivative of the loss at any time is totally โ€œforgottenโ€ 40

  16. The Jacobian of the hidden layers fo Th for an an RNN ๐‘ ๐‘” } (๐‘จ 0,G ) 0 โ‹ฏ 0 ๐‘” } (๐‘จ 0,K ) 0 โ‹ฏ 0 ๐›ผ๐‘” ๐‘จ(๐‘ข) = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ„Ž ๐‘” } (๐‘จ 0,โ‚ฌ ) 0 0 โ‹ฏ ๐‘Œ โ„Ž V (๐‘ข) = ๐‘” ๐‘จ V ๐‘ข โ€ข ๐›ผ๐‘”() is the derivative of the output of the (layer of) hidden recurrent neurons with respect to their input โ€“ For vector activations: A full matrix โ€“ For scalar activations: A matrix where the diagonal entries are the derivatives of the activation of the recurrent hidden layer 41

  17. Th The Jacobian ๐‘ ๐‘” } (๐‘จ 0,G ) 0 โ‹ฏ 0 ๐‘” } (๐‘จ 0,K ) 0 โ‹ฏ 0 ๐›ผ๐‘” ๐‘จ(๐‘ข) = โ‹ฎ โ‹ฎ โ‹ฑ โ‹ฎ โ„Ž ๐‘” } (๐‘จ 0,โ‚ฌ ) 0 0 โ‹ฏ ๐‘Œ โ„Ž V (๐‘ข) = ๐‘” ๐‘จ V ๐‘ข โ€ข The derivative (or subgradient) of the activation function is always bounded โ€“ The diagonals (or singular values) of the Jacobian are bounded โ€ข There is a limit on how much multiplying a vector by the Jacobian will scale it 42

  18. Th The derivative of the hidden state activation โ€ข Most common activation functions, such as sigmoid, tanh() and RELU have derivatives that are always less than 1 โ€ข The most common activation for the hidden units in an RNN is the tanh() โ€“ The derivative of tanh () is never greater than 1 (and mostly less than 1) โ€ข Multiplication by the Jacobian is always a shrinking operation 43

  19. What abo Wha bout ut the he weigh ghts x [โ€š] ๐‘€๐‘๐‘ก๐‘ก. ๐›ผ๐‘” [ฦ’] . ๐‘‹. ๐›ผ๐‘” [ฦ’FG] . ๐‘‹ โ€ฆ ๐›ผ๐‘” [0|G] . ๐‘‹ ๐›ผ x [โ€ข] ๐‘€๐‘๐‘ก๐‘ก = ๐›ผ โ€ข In a single-layer RNN, the weight matrices are identical โ€“ The conclusion below holds for any deep network, though โ€ข The chain product for ๐›ผ x [โ€ข] ๐‘€๐‘๐‘ก๐‘ก will โ€“ E xpand ๐›ผ x [โ€š] ๐‘€๐‘๐‘ก๐‘ก along directions in which the singular values of the weight matrices are greater than 1 โ€“ S hrink ๐›ผ x [โ€š] ๐‘€๐‘๐‘ก๐‘ก in directions where the singular values are less than 1 โ€“ Repeated multiplication by the weights matrix will result in Exploding or vanishing gradients 44

  20. Expl Explodi ding ng/Vani nishi shing ng gr gradi dients x [โ€š] ๐‘€๐‘๐‘ก๐‘ก. ๐›ผ๐‘” [ฦ’] . ๐‘‹. ๐›ผ๐‘” [ฦ’FG] . ๐‘‹ โ€ฆ ๐›ผ๐‘” [0] . ๐‘‹ [0] ๐›ผ x [โ€ข] ๐‘€๐‘๐‘ก๐‘ก = ๐›ผ โ€ข Every blue term is a matrix โ€ข ๐›ผ x [โ€š] ๐‘€๐‘๐‘ก๐‘ก is proportional to the actual loss โ€“ Particularly for L 2 and KL divergence โ€ข The chain product for ๐›ผ x [โ€ข] ๐‘€๐‘๐‘ก๐‘ก will โ€“ E xpand it in directions where each stage has singular values greater than 1 โ€“ S hrink it in directions where each stage has singular values less than 1 45

  21. Training RNN 46

  22. Training RNN โ„Ž f = ๐‘” ๐‘‹ EE โ„Ž fFG + ๐‘‹ IE ๐‘ฆ f ๐œ–โ„Ž f ๐œ–โ„Ž f,` ฦ’ diag ๐‘” } ๐‘จ = ๐‘‹ ฦ’ Z,. ๐‘” } ` = ๐‘‹ f EE EE ๐œ–โ„Ž fFG,Z ๐œ–โ„Ž fFG ๐œ–โ„Ž f ๐‘” } ๐‘จ ฦ’ โ‰ค ๐‘‹ โ‰ค ๐›พ ห† ๐›พ E f EE ๐œ–โ„Ž fFG ๐œ–โ„Ž f ๐œ–โ„Ž 0 0 0 ฦ’ diag ๐‘” } ๐‘จ โ‰ค ๐›พ ห† ๐›พ E 0FB = โ€ฐ = โ€ฐ ๐‘‹ f EE ๐œ–โ„Ž B ๐œ–โ„Ž fFG fล B|G fล B|G โ€ข This can become very small or very large quickly (vanishing/exploding gradients) [Bengio et al 1994]. 47

  23. Recurrent nets are very deep nets Re Y(T) h f (0) X(1) โ€ข The relation between ๐‘Œ(1) and ๐‘(๐‘ˆ) is one of a very deep network โ€“ Gradients from errors at t = ๐‘ˆ will vanish by the time theyโ€™re propagated to ๐‘ข = 1 48

  24. Training RNNs is hard โ€ข The unrolled network can be very deep and inputs from many time steps ago can modify output โ€“ Unrolled network is very deep โ€ข Multiply the same matrix at each time step during forward prop 49

  25. The vanishing gradient problem: Example โ€ข In the case of language modeling words from time steps far away are not taken into consideration when training to predict the next word โ€ข Example: Jane walked into the room. John walked in too. It was late in the day. Jane said hi to ____ 50 This slide has been adopted from Socher lectures, cs224d, Stanford, 2017

  26. Th The long-te term dependency problem โ€ข Must know to โ€œrememberโ€ for extended periods of time and โ€œrecallโ€ when necessary โ€“ Can be performed with a multi-tap recursion, but how many taps? โ€“ Need an alternate way to โ€œrememberโ€ stuff 51

  27. St Story so far โ€ข Recurrent networks retain information from the infinite past in principle โ€ข In practice, they are poor at memorization โ€“ The hidden outputs can blow up, or shrink to zero depending on the Eigen values of the recurrent weights matrix โ€“ The memory is also a function of the activation of the hidden units โ€ข Tanh activations are the most effective at retaining memory, but even they donโ€™t hold it very long โ€ข Deep networks also suffer from a โ€œvanishing or exploding gradientโ€ problem โ€“ The gradient of the error at the output gets concentrated into a small number of parameters in the earlier layers, and goes to zero for others 52

  28. Vanilla RNN Gradient Flow 53

  29. Vanilla RNN Gradient Flow 54

  30. Vanilla RNN Gradient Flow Computing gradient of h0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients 55

  31. Trick for exploding gradient: clipping trick โ€ข The solution first introduced by Mikolov is to clip gradients to a maximum value. โ€ข Makes a big difference in RNNs. 56

  32. Gradient clipping intuition โ€ข Error surface of a single hidden unit RNN โ€“ High curvature walls โ€ข Solid lines: standard gradient descent trajectories โ€ข Dashed lines gradients rescaled to fixed size 57

  33. Vanilla RNN Gradient Flow Computing gradient of h0 involves many factors Gradient clipping: Scale Computing of W (and repeated tanh) gradient if its norm is too big Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Change RNN architecture 58

  34. For vanishing gradients: Initialization + ReLus! โ€ข Initialize Ws to identity matrix I and activations to RelU โ€ข New experiments with recurrent neural nets. Le et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Unit, 2015. 59

  35. Better units for recurrent models โ€ข More complex hidden unit computation in recurrence! โ€“ โ„Ž 0 = ๐‘€๐‘‡๐‘ˆ๐‘(๐‘ฆ 0 , โ„Ž 0FG ) โ€“ โ„Ž 0 = ๐ป๐‘†๐‘‰(๐‘ฆ 0 , โ„Ž 0FG ) โ€ข Main ideas: โ€“keep around memories to capture long distance dependencies โ€“allow error messages to flow at different strengths depending on the inputs 60

  36. And And no now w we enter the he do domain n of.. .. 61

  37. Expl Explodi ding ng/Vani nishi shing ng gr gradi dients โ€ข Can we replace this with something that doesnโ€™t fade or blow up? โ€ข Can we have a network that just โ€œremembersโ€ arbitrarily long, to be recalled on demand? โ€“ Not be directly dependent on vagaries of network parameters, but rather on input-based determination of whether it must be remembered โ€“ Replace them, e.g., by a function of the input that decides if things must be forgotten or not 62

  38. En Enter the he LSTM TM โ€ข Long Short-Term Memory โ€ข Explicitly latch information to prevent decay / blowup โ€ข Following notes borrow liberally from โ€ข http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 63

  39. St Standard RNN โ€ข Recurrent neurons receive past recurrent outputs and current input as inputs โ€ข Processed through a tanh() activation function โ€“ As mentioned earlier, tanh() is the generally used activation for the hidden layer โ€ข Current recurrent output passed to next higher layer and next time instant 64

  40. Some visualization 65

  41. Lo Long Sh Short rt-Te Term Memory โ€ข The ๐œ() are multiplicative gates that decide if something is important or not โ€ข Remember, every line actually represents a vector 66

  42. LSTM TM: Constant Error Carousel โ€ข Key component: a remembered cell state 67

  43. LSTM TM: CEC โ€ข ๐ท 0 is the linear history โ€ข Carries information through, only affected by a gate โ€“ And addition of history, which too is gated.. 68

  44. LSTM TM: Gates โ€ข Gates are simple sigmoidal units with outputs in the range (0,1) โ€ข Controls how much of the information is to be let through 69

  45. LSTM TM: Forget gate โ€ข The first gate determines whether to carry over the history or to forget it โ€“ More precisely, how much of the history to carry over โ€“ Also called the โ€œforgetโ€ gate โ€“ Note, weโ€™re actually distinguishing between the cell memory ๐ท and the state โ„Ž that is coming over time! Theyโ€™re related though 70

  46. LSTM TM: Input gate โ€ข The second input has two parts โ€“ A perceptron layer that determines if thereโ€™s something new and interesting in the input โ€“ A gate that decides if its worth remembering โ€“ If so its added to the current memory cell 71

  47. LSTM TM: Memory cell update โ€ข The second input has two parts โ€“ A perceptron layer that determines if thereโ€™s something interesting in the input โ€“ A gate that decides if its worth remembering โ€“ If so its added to the current memory cell 72

  48. LSTM TM: Output and Output gate โ€ข The output of the cell โ€“ Simply compress it with tanh to make it lie between 1 and -1 โ€ข Note that this compression no longer affects our ability to carry memory forward โ€“ Controlled by an output gate โ€ข To decide if the memory contents are worth reporting at this time 73

  49. Long-short-term-memories (LSTMs) โ„Ž 0FG โ€ข Input gate (current cell matter) : ๐‘— 0 = ๐œ ๐‘‹ + ๐‘ V V ๐‘ฆ 0 โ„Ž 0FG โ€ข Forget (gate 0, forget past): ๐‘” 0 = ๐œ ๐‘‹ + ๐‘ x x ๐‘ฆ 0 โ„Ž 0FG โ€ข Output (how much cell is exposed): ๐‘ 0 = ๐œ ๐‘‹ + ๐‘ โ€œ โ€œ ๐‘ฆ 0 โ„Ž 0FG โ€ข New memory cell: ๐‘‘ฬƒ 0 = tanh ๐‘‹ + ๐‘ โ€“ โ€“ ๐‘ฆ 0 โ€ข Final memory cell: ๐‘‘ 0 = ๐‘— 0 โˆ˜ ๐‘‘ฬƒ 0 + ๐‘” 0 โˆ˜ ๐‘‘ 0FG โ€ข Final hidden state: โ„Ž 0 = ๐‘ 0 โˆ˜ tanh ๐‘‘ 0 74

  50. LSTM TM Equations โ€ข ๐’‹ ๐’– : input gate, how much of the new information will be let through the memory cell. โ€ข ๐’ˆ ๐’– : forget gate, responsible for information should be thrown away from memory cell. โ€ข ๐’‘ ๐’– : output gate, how much of the information will be passed to expose to the next time step. โ€ข ๐’‰ ๐’– or ๐’… ลพ ๐’– : self-recurrent which is equal to standard RNN โ€ข ๐’… ๐’– : internal memory of the memory cell โ€ข ๐’Š ๐’– : hidden state โ€ข ๐ณ : output 75

  51. LSTM Gates โ€ข Gates are ways to let information through (or not): โ€“ Forget gate: look at previous cell state and current input, and decide which information to throw away. โ€“ Input gate: see which information in the current state we want to update. โ€“ Output: Filter cell state and output the filtered result. โ€“ Gate or update gate: propose new values for the cell state. โ€ข For instance: store gender of subject until another subject is seen. 76

  52. LSTM TM: Th The โ€œPeepholeโ€ Connection โ€ข The raw memory is informative by itself and can also be input โ€“ Note, weโ€™re using both ๐ท and โ„Ž 77

  53. Backp Ba kpropagation ru rules: s: Forward ๐ท 0FG ๐ท 0 tanh ๐‘ 0 ๐‘— 0 ๐‘” 0 ยก 0 ๐ท s () s () s () tanh โ„Ž 0FG โ„Ž 0 ๐‘ฆ 0 Gates โ€ข Forward rules: Variables 78

  54. LSTM TM cell forward # Continuing from previous slide # Note: [W,h] is a set of parameters, whose individual elements are # shown in red within the code. These are passed in # Static local variables which arenโ€™t required outside this cell static local z f , z i , z c , f, i, o, C i function [C o , h o ] = LSTM_cell.forward(C,h,x, [W,h]) z f = W fc C + W fh h + W fx x + b f f = sigmoid(z f ) # forget gate z i = W ic C + W ih h + W ix x + b i i = sigmoid(z i ) # input gate z c = W cc C + W ch h + W cx x + b c C i = tanh(z c ) # Detecting input pattern C o = f โˆ˜ C + i โˆ˜ C i # โ€œ โˆ˜ โ€ is component-wise multiply z o = W oc C o + W oh h + W ox x + b o o = sigmoid(z o ) # output gate h o = o โˆ˜ tanh(C) # โ€œ โˆ˜ โ€ is component-wise multiply 79 return C o ,h o

  55. LSTM TM network forward # Assuming h(0,*) is known and C(0,*)=0 # Assuming L hidden-state layers and an output layer # Note: LSTM_cell is an indexed class with functions # [W{l},b{l}] are the entire set of weights and biases # for the l th hidden layer # W o and b o are output layer weights and biases for t = 1:T # Including both ends of the index h(t,0) = x(t) # Vectors. Initialize hidden layer h(0) to input for l = 1:L # hidden layers operate at time t [C( t,l ),h( t,l )] = LSTM_cell(t,l).forward(โ€ฆ โ€ฆC( t-1,l ),h( t-1,l ),h( t,l-1 )[W{l},b{l}]) z o (t) = W o h(t,L) + b o Y(t) = softmax( z o (t) ) 80

  56. Long Short Term Memory (LSTM) g in the previous slides was called ๐‘‘ฬƒ 81

  57. Long Short Term Memory (LSTM) [Hochreiter et al., 1997] 82

  58. Long Short Term Memory (LSTM) [Hochreiter et al., 1997] 83

  59. Long Short Term Memory (LSTM) [Hochreiter et al., 1997] 84

  60. its : Lets simplify the LSTM Ga Gated Recu current t Un Units TM โ€ข Donโ€™t bother to separately maintain compressed and regular memories โ€“ Pointless computation! โ€ข But compress it before using it to decide on the usefulness of the current input! 85

  61. GRUs โ€ข Gated Recurrent Units (GRU) introduced by Cho et al. 2014 โ€ข Update gate โ„Ž 0FG ๐‘จ 0 = ๐œ ๐‘‹ + ๐‘ ยข ยข ๐‘ฆ 0 โ€ข Reset gate โ„Ž 0FG ๐‘  0 = ๐œ ๐‘‹ + ๐‘ 2 2 ๐‘ฆ 0 โ€ข Memory ๐‘  0 โˆ˜ โ„Ž 0FG ยค 0 = tanh ๐‘‹ โ„Ž + ๐‘ ` ` ๐‘ฆ 0 โ€ข Final Memory ยค 0 โ„Ž 0 = ๐‘จ 0 โˆ˜ โ„Ž 0FG + 1 โˆ’ ๐‘จ 0 โˆ˜ โ„Ž If reset gate unit is ~0, then this ignores previous memory and only stores the new input 86

  62. GRU intuition โ€ข Units with long term dependencies have active update gates z โ€ข Illustration: 87 This slide has been adopted from Socher lectures, cs224d, Stanford, 2017

  63. GRU intuition โ€ข If reset is close to 0, ignore previous hidden state โ€“ ร  Allows model to drop information that is irrelevant in the future โ€ข Update gate z controls how much of past state should matter now. โ€“ If z close to 1, then we can copy information in that unit through many time steps! Less vanishing gradient! โ€ข Units with short-term dependencies often have reset gates very active 88 This slide has been adopted from Socher lectures, cs224d, Stanford, 2017

  64. Other RNN Varients 89

  65. Which of these variants is best? โ€ข Do the differences matter? โ€“ Greff et al. (2015), perform comparison of popular variants, finding that theyโ€™re all about the same. โ€“ Jozefowicz et al. (2015) tested more than ten thousand RNN architectures, finding some that worked better than LSTMs on certain tasks. 90

  66. LSTM Achievements โ€ข LSTMs have essentially replaced n-grams as language models for speech . โ€ข Image captioning and other multi-modal tasks which were very difficult with previous methods are now feasible. โ€ข Many traditional NLP tasks work very well with LSTMs, but not necessarily the top performers: e.g., POS tagging and NER: Choi 2016. โ€ข Neural MT : broken away from plateau of SMT, especially for grammaticality (partly because of characters/subwords), but not yet industry strength. [Ann Copestake, Overview of LSTMs and word2vec, 2016.] 91 https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf

  67. Multi-layer RNN 92

  68. Mu Multi-layer LSTM TM architecture Y(t) X(t) Time โ€ข Each green box is now an entire LSTM or GRU unit โ€ข Also keep in mind each box is an array of units 93

  69. Ex Extensi nsions ns to the he RN RNN: Bi Bidi direc ectional nal RN RNN Proposed by Schuster and Paliwal 1997 โ€ข RNN with both forward and backward recursion โ€“ Explicitly models the fact that just as the future can be predicted from the past, the past can be deduced from the future 94

  70. Bidirectional RN Bi RNN Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข A forward net process the data from t=1 to t=T โ€ข A backward net processes it backward from t=T down to t=1 95

  71. Bidirectional RN Bi RNN: Processi ssing an input stri ring h f (0) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข The forward net process the data from t=1 to t=T โ€“ Only computing the hidden states, initially โ€ข The backward net processes it backward from t=T down to t=0 96

  72. Bi Bidirectional RN RNN: Processi ssing an input stri ring h f (0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข The backward nets processes the input data in reverse time, end to beginning โ€“ Initially only the hidden state values are computed Clearly, this is not an online process and requires the entire input data โ€ข โ€“ Note: This is not the backward pass of backprop. net processes it backward from t=T down to t=0 97

  73. Bi Bidirectional RN RNN: Processi ssing an input stri ring Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข The computed states of both networks are used to compute the final output at each time 98

  74. Backp Ba kpropagation in BRN BRNNs ๐’Š = ๐’Š, ๐’Š : represents both the past and future Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข Forward pass: Compute both forward and backward networks and final output 99

  75. Backp Ba kpropagation in BRN BRNNs Loss d 1 ..d T Loss() Y(1) Y(2) Y(T-2) Y(T-1) Y(T) h f (0) X(1) X(2) X(T-2) X(T-1) X(T) h b (inf) X(1) X(2) X(T-2) X(T-1) X(T) t โ€ข Backward pass: Define a divergence from the desired output โ€ข Separately perform back propagation on both nets 100 โ€“ From t=T down to t=0 for the forward net โ€“ From t=0 up to t=T for the backward net

Recommend


More recommend