recurrent neural networks stability analysis and lstms
play

Recurrent Neural Networks: Stability analysis and LSTMs M. - PowerPoint PPT Presentation

Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n,


  1. Recurrent Neural Networks: Stability analysis and LSTMs M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adopted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017.

  2. St Story so far Y(t+6) Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) • Iterated structures are good for analyzing time series data with short- time dependence on the past – These are “ Time delay Neural Nets” (TDNNs), AKA convnets 2

  3. St Story so far Y(t) h -1 X(t) t=0 Time • Recurrent structures are good for analyzing time series data with long-term dependence on the past – These are recurrent neural networks 3

  4. Re Recurrent structures can do what static structures cannot 1 0 1 0 1 0 1 1 1 1 0 MLP 1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 • The addition problem: Add two N-bit numbers to produce a N+1-bit number – Input is binary – Will require large number of training instances • Output must be specified for every pair of inputs • Weights that generalize will make errors – Network trained for N-bit numbers will not work for N+1 bit numbers 4

  5. MLP MLPs vs RNNs 1 Previous Carry RNN unit carry out 1 0 • The addition problem: Add two N-bit numbers to produce a N+1- bit number • RNN solution: Very simple, can add two numbers of any size 5

  6. MLP: Th The parity problem 1 MLP 1 0 0 0 1 1 0 0 1 0 • Is the number of “ones” even or odd • Network must be complex to capture all patterns – XOR network, quite complex – Fixed input size 6

  7. RNN: Th The parity problem Previous 1 1 output RNN unit 0 • Trivial solution • Generalizes to input of any size 7

  8. Story so far St Y desired (t) Loss Y(t) h -1 X(t) t=0 Time • Recurrent structures can be trained by minimizing the loss between the sequence of outputs and the sequence of desired outputs – Through gradient descent and backpropagation 8

  9. Back Propagation Th Through Ti Time 𝑀𝑝𝑡𝑡 𝑍 )*+,-) (1. . 𝑈) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈) h 0 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) • The loss computed is between the sequence of outputs by the network and the desired sequence of outputs • This is not just the sum of the divergences at individual times Unless we explicitly define it that way § 9

  10. � Time-sy Ti synchronous recurrence Y target (t) Loss Y(t) Y(t) h 0 X(t) t=1 Time Usual assumption: Sequence divergence is the sum of the • divergence at individual instants 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 1 … 𝑈 , 𝑍(1 … 𝑈) = 5 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 𝑢 , 𝑍(𝑢) ) 𝛼 9()) 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 1 … 𝑈 , 𝑍(1 … 𝑈) = 𝛼 9()) 𝑀𝑝𝑡𝑡 𝑍 )*+,-) 𝑢 , 𝑍(𝑢) 10

  11. Long-term behavior of RNNs • In linear systems, long-term behavior depends entirely on the eigenvalues of the hidden-layer weights matrix – If the largest Eigen value is greater than 1, the system will “blow up” – If it is lesser than 1, the response will “vanish” very quickly 11

  12. “B “BIB IBO” ” Stabi bility ty • “Bounded Input Bounded Output” stability – This is a highly desirable characteristic 12

  13. “BIB “B IBO” ” Stabi bility ty Y(t+5) • Returning to an old model.. 𝑍 𝑢 = 𝑔(𝑌 𝑢 − 𝑗 , 𝑗 = 1. . 𝐿) • When will the output “blow up”? X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) • Time-delay structures have bounded output if – The function 𝑔() has bounded output for bounded input • Which is true of almost every activation function – 𝑌(𝑢) is bounded 13

  14. Is this Is this BIB IBO? Y(t) h -1 X(t) t=0 Time • Will RNN necessarily be BIBO? 14

  15. Is Is this this BIB IBO? Y(t) h -1 X(t) t=0 Time • Will this necessarily be BIBO? – Guaranteed if output and hidden activations are bounded • But will it saturate (and where) – What if the activations are linear? 15

  16. Analyzing Ana ng recur urrenc nce Y(t) h -1 X(t) t=0 Time • Sufficient to analyze the behavior of the hidden layer ℎ ? since it carries the relevant information – Will assume only a single hidden layer for simplicity 16

  17. Ana Analyzing ng Recur ursi sion 17

  18. St Streetlight effect Y(t) h -1 X(t) t=0 Time • Easier to analyze linear systems – Will attempt to extrapolate to non-linear systems subsequently • All activations are identity functions – 𝑨 ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? , ℎ ? = 𝑨 ? 18

  19. Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B 19

  20. Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B ?ID ℎ CD + 𝑋 ? 𝑋 ?CD 𝑋 • ℎ ? = 𝑋 F 𝑦 J + 𝑋 F 𝑦 D + ⋯ B B B 20

  21. Li Linear r systems ms • ℎ ? = 𝑋 B ℎ ?CD + 𝑋 F 𝑦 ? – ℎ ?CD = 𝑋 B ℎ ?CH + 𝑋 F 𝑦 ?CD H ℎ ?CH + 𝑋 • ℎ ? = 𝑋 B 𝑋 F 𝑦 ?CD + 𝑋 F 𝑦 ? B ?ID ℎ CD + 𝑋 ? 𝑋 ?CD 𝑋 • ℎ ? = 𝑋 F 𝑦 J + 𝑋 F 𝑦 D + ⋯ B B B • ℎ ? = 𝐼 ? ℎ CD + 𝐼 ? 𝑦 J + 𝐼 ? 𝑦 D + 𝐼 ? 𝑦 H + ⋯ = ℎ CD 𝐼 ? (1 CD ) + 𝑦 J 𝐼 ? (1 J ) + 𝑦 D 𝐼 ? (1 D ) + 𝑦 H 𝐼 ? (1 H ) + ⋯ • Where 𝐼 ? (1 ) ) is the hidden response at time k when the input is [0 0 0 … 1 0 . . 0] (where the 1 occurs in the t-th position) 21

  22. St Streetlight effect Y(t) h -1 X(t) t=0 Time • Sufficient to analyze the response to a single input at 𝑢 = 0 – Principle of superposition in linear systems: ℎ ? = ℎ CD 𝐼 ? (1 CD ) + 𝑦 J 𝐼 ? (1 J ) + 𝑦 D 𝐼 ? (1 D ) + 𝑦 H 𝐼 ? (1 H ) + ⋯ 22

  23. Li Linear r recursions • Consider simple, scalar, linear recursion (note change of notation) – ℎ 𝑢 = 𝑥 B ℎ 𝑢 − 1 + 𝑥 F 𝑦(𝑢) ) 𝑥 F 𝑦 1 – ℎ D 𝑢 = 𝑥 B • Response to a single input at 1 ℎ D 𝑙 23

  24. Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | B = 𝑉Λ𝑉 CD • We can write 𝑋 – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X – For any vector 𝑤 we can write • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 24

  25. Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | B = 𝑉Λ𝑉 CD • We can write 𝑋 – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X For any input, for large 𝑢 the length of the hidden vector will expand or contract – For any vector 𝑤 we can write according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 25

  26. Li Linear r recursions: : Vector r version • Vector linear recursion (note change of notation) – ℎ 𝑢 = 𝑋 B ℎ 𝑢 − 1 + 𝑋 F 𝑦(𝑢) ) 𝑋 – ℎ D 𝑢 = 𝑋 F 𝑦 1 B • Length of response vector to a single input at 1 is |ℎ {D} 𝑢 | For any input, for large 𝑢 , the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix B = 𝑉Λ𝑉 CD • We can write 𝑋 Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. – 𝑋 B 𝑣 X = 𝜇 X 𝑣 X And so on.. – For any vector 𝑤 we can write • 𝑤 = 𝑏 D 𝑣 D + 𝑏 H 𝑣 H + ⋯ + 𝑏 \ 𝑣 \ • 𝑋 B 𝑤 = 𝑏 D 𝜇 D 𝑣 D + 𝑏 H 𝜇 H 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ 𝑣 \ ) 𝑣 H + ⋯ + 𝑏 \ 𝜇 \ ) 𝑣 \ ) 𝑤 = 𝑏 D 𝜇 D ) 𝑣 D + 𝑏 H 𝜇 H • 𝑋 B ) 𝑣 b where 𝑛 = argmax ) 𝑤 = 𝑏 b 𝜇 b – lim )→a 𝑋 𝜇 h B h 26

Recommend


More recommend