rnns for timeseries analysis
play

RNNs for Timeseries Analysis www.bgoncalves.com - PowerPoint PPT Presentation

RNNs for Timeseries Analysis www.bgoncalves.com github.com/bmtgoncalves/RNN Disclaimer The views and opinions expressed in this article are those of the authors and do not necessarily reflect


  1. ��������������� 
 RNNs for Timeseries Analysis www.bgoncalves.com github.com/bmtgoncalves/RNN

  2. Disclaimer The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of my employer. The examples provided with this tutorial were chosen for their didactic value and are not mean to be representative of my day to day work. @bgoncalves www.bgoncalves.com

  3. References @bgoncalves www.bgoncalves.com

  4. How the Brain “Works” (Cartoon version) @bgoncalves www.bgoncalves.com

  5. How the Brain “Works” (Cartoon version) @bgoncalves www.bgoncalves.com

  6. How the Brain “Works” (Cartoon version) • Each neuron receives input from other neurons • 10 11 neurons, each with with 10 4 weights • Weights can be positive or negative • Weights adapt during the learning process • “neurons that fire together wire together” (Hebb) • Different areas perform different functions using same structure (Modularity) @bgoncalves www.bgoncalves.com

  7. How the Brain “Works” (Cartoon version) Inputs f(Inputs) Output @bgoncalves www.bgoncalves.com

  8. Optimization Problem • (Machine) Learning can be thought of as an optimization problem . • Optimization Problems have 3 distinct pieces : Neural Network • The constraints • The function to optimize Prediction Error • The optimization algorithm . Gradient Descent @bgoncalves www.bgoncalves.com

  9. Artificial Neuron Bias 1 w 0 j x 1 w 1 j x 2 w 2 j z j a j φ ( z ) w T x w 3 j x 3 w Nj x N Activation Output Inputs Weights function @bgoncalves www.bgoncalves.com

  10. Activation Function - Sigmoid http://github.com/bmtgoncalves/Neural-Networks • Non-Linear function • Differentiable 1 φ ( z ) = 1 + e − z • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Perhaps the most common @bgoncalves www.bgoncalves.com

  11. Activation Function - tanh http://github.com/bmtgoncalves/Neural-Networks • Non-Linear function • Differentiable φ ( z ) = e z − e − z e z + e − z • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data @bgoncalves www.bgoncalves.com

  12. Forward Propagation • The output of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function • To create a multi-layer perceptron, you can simply use the output of one layer as the input to the next one. 
 1 1 w 0 k a 1 w 0 j x 1 w 1 k w 1 j a 2 w 2 k x 2 a k w 2 j w T a � � φ w k 3 w T x � � a j φ w 3 j x 3 w Nk w Nj a N x N • But how can we propagate back the errors and update the weights? @bgoncalves www.bgoncalves.com

  13. Backward Propagation of Errors (BackProp) • BackProp operates in two phases: • Forward propagate the inputs and calculate the deltas • Update the weights • The error at the output is a weighted average difference between predicted output and the observed one. • For inner layers there is no “real output”! @bgoncalves www.bgoncalves.com

  14. Loss Functions • For learning to occur, we must quantify how far off we are from the desired output. There are two common ways of doing this: • Quadratic error function: E = 1 | y n − a n | 2 X N • Cross Entropy n J = − 1 n log a n + (1 − y n ) T log (1 − a n ) h i X y T N n The Cross Entropy is complementary to sigmoid activation in the output layer and improves its stability. @bgoncalves www.bgoncalves.com

  15. 
 Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient 
 − ∂ H ∂θ mn θ mn ← θ mn − α ∂ H ∂θ mn H • where is the step size. α • Repeat until “convergence”. @bgoncalves www.bgoncalves.com

  16. @bgoncalves www.bgoncalves.com

  17. Feed Forward Networks h t Output x t Input h t = f ( x t ) @bgoncalves www.bgoncalves.com

  18. Feed Forward Networks h t Output x t Input h t = f ( x t ) @bgoncalves www.bgoncalves.com

  19. Feed Forward Networks h t Output Information 
 Flow Input x t h t = f ( x t ) @bgoncalves www.bgoncalves.com

  20. Information 
 Recurrent Neural Network (RNN) Flow h t Output h t Output Previous h t − 1 Output x t Input h t = f ( x t , h t − 1 ) @bgoncalves www.bgoncalves.com

  21. Recurrent Neural Network (RNN) h t h t − 1 h t x t @bgoncalves www.bgoncalves.com

  22. Recurrent Neural Network (RNN) • Each output depends (implicitly) on all previous outputs . • Input sequences generate output sequences ( seq2seq ) h t − 1 h t h t +1 h t − 2 h t − 1 h t h t +1 x t − 1 x t x t +1 @bgoncalves www.bgoncalves.com

  23. Recurrent Neural Network (RNN) h t h t h t − 1 tanh x t h t = tanh ( Wh t − 1 + Ux t ) @bgoncalves www.bgoncalves.com

  24. Recurrent Neural Network (RNN) h t h t h t − 1 tanh x t h t = tanh ( Wh t − 1 + Ux t ) Concatenate both inputs. @bgoncalves www.bgoncalves.com

  25. Timeseries • Temporal sequence of data points • Consecutive points are strongly correlated • Common in statistics, signal processing, econometrics, mathematical finance, earthquake prediction, etc • Numeric (real or discrete) or symbolic data @bgoncalves www.bgoncalves.com

  26. Long-Short Term Memory (LSTM) • What if we want to keep explicit information about previous states ( memory )? • How much information is kept, can be controlled through gates. • LSTMs were first introduced in 1997 by Hochreiter and Schmidhuber h t − 1 h t h t +1 c t − 2 c t − 1 c t c t +1 h t +1 h t − 2 h t − 1 h t x t − 1 x t x t +1 @bgoncalves www.bgoncalves.com

  27. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh f o i × × g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

  28. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Forget gate: 
 How much of f o i × × the previous state should g be kept? tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

  29. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Input gate : 
 How much of f o i × × the previous output g should be tanh σ σ remembered? σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

  30. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output gate : 
 All gates use the same How much of f o i × × inputs and the previous activation output g functions, should tanh but different σ σ contribute? σ weights h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

  31. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output gate : 
 How much of f o i × × the previous output g should tanh σ σ contribute? σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

  32. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh State : 
 Update the f o i × × current state g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

  33. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output : 
 Combine all f o i × × available information. g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com

Recommend


More recommend