rnns for timeseries analysis
play

RNNs for Timeseries Analysis www.data4sci.com - PowerPoint PPT Presentation

Bruno Gonalves RNNs for Timeseries Analysis www.data4sci.com github.com/DataForScience/RNN Disclaimer The views and opinions expressed in this tutorial are those of the authors and do not necessarily reflect the official policy or


  1. Bruno Gonçalves 
 RNNs for Timeseries Analysis www.data4sci.com github.com/DataForScience/RNN

  2. Disclaimer The views and opinions expressed in this tutorial are those of the authors and do not necessarily reflect the official policy or position of my employer. The examples provided with this tutorial were chosen for their didactic value and are not mean to be representative of my day to day work. @bgoncalves www.data4sci.com

  3. References @bgoncalves www.data4sci.com

  4. How the Brain “Works” (Cartoon version) @bgoncalves www.data4sci.com

  5. How the Brain “Works” (Cartoon version) • Each neuron receives input from other neurons • 10 11 neurons, each with with 10 4 weights • Weights can be positive or negative • Weights adapt during the learning process • “neurons that fire together wire together” (Hebb) • Different areas perform different functions using same structure (Modularity) @bgoncalves www.data4sci.com

  6. How the Brain “Works” (Cartoon version) Inputs f(Inputs) Output @bgoncalves www.data4sci.com

  7. Optimization Problem • (Machine) Learning can be thought of as an optimization problem . • Optimization Problems have 3 distinct pieces : Neural Network • The constraints Prediction Error • The function to optimize Gradient Descent • The optimization algorithm . @bgoncalves www.data4sci.com

  8. Artificial Neuron Bias 1 w 0 j x 1 w 1 j x 2 w 2 j z j a j φ ( z ) w T x w 3 j x 3 w Nj x N Activation Output Inputs Weights function @bgoncalves www.data4sci.com

  9. Activation Function - Sigmoid http://github.com/bmtgoncalves/Neural-Networks • Non-Linear function • Differentiable 1 φ ( z ) = 1 + e − z • non-decreasing • Compute new sets of features • Each layer builds up a more abstract 
 representation of the data • Perhaps the most common @bgoncalves www.data4sci.com

  10. Activation Function - tanh http://github.com/bmtgoncalves/Neural-Networks • Non-Linear function • Differentiable φ ( z ) = e z − e − z e z + e − z • non-decreasing • Compute new sets of features • Each layer builds up a more abstract 
 representation of the data @bgoncalves www.data4sci.com

  11. Forward Propagation • The output of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function • To create a multi-layer perceptron, you can simply use the output of one layer as the input to the next one. 
 1 1 w 0 k a 1 w 0 j x 1 w 1 k w 1 j a 2 w 2 k x 2 a k w 2 j w T a � � φ w k 3 w T x � � a j φ w 3 j x 3 w Nk w Nj a N x N @bgoncalves www.data4sci.com

  12. Backward Propagation of Errors (BackProp) • BackProp operates in two phases: • Forward propagate the inputs and calculate the deltas • Update the weights • The error at the output is a weighted average difference between predicted output and the observed one. • For inner layers there is no “real output”! @bgoncalves www.data4sci.com

  13. Loss Functions • For learning to occur, we must quantify how far off we are from the desired output. There are two common ways of doing this: • Quadratic error function: E = 1 | y n − a n | 2 X N • Cross Entropy n J = − 1 n log a n + (1 − y n ) T log (1 − a n ) h i X y T N n The Cross Entropy is complementary to sigmoid activation in the output layer and improves its stability. @bgoncalves www.data4sci.com

  14. 
 Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient 
 − ∂ H ∂θ mn θ mn ← θ mn − α ∂ H ∂θ mn H • where is the step size. α • Repeat until “convergence”. @bgoncalves www.data4sci.com

  15. @bgoncalves www.data4sci.com

  16. Feed Forward Networks h t Output x t Input h t = f ( x t ) @bgoncalves www.data4sci.com

  17. Feed Forward Networks h t Output Information 
 Flow Input x t h t = f ( x t ) @bgoncalves www.data4sci.com

  18. Information 
 Recurrent Neural Network (RNN) Flow h t Output h t Output Previous h t − 1 Output x t Input h t = f ( x t , h t − 1 ) h t = f ( x t ) @bgoncalves www.data4sci.com

  19. Recurrent Neural Network (RNN) • Each output depends (implicitly) on all previous outputs . • Input sequences generate output sequences ( seq2seq ) h t − 1 h t h t +1 h t − 2 h t − 1 h t h t +1 x t − 1 x t x t +1 @bgoncalves www.data4sci.com

  20. Recurrent Neural Network (RNN) h t h t h t − 1 tanh x t Concatenate h t = tanh ( Wh t − 1 + Ux t ) both inputs. @bgoncalves www.data4sci.com

  21. Timeseries • Temporal sequence of data points • Consecutive points are strongly correlated • Common in statistics, signal processing, econometrics, mathematical finance, earthquake prediction, etc • Numeric (real or discrete) or symbolic data • Supervised Learning problem: X t = f ( X t − 1 , ⋯ , X t − n ) @bgoncalves www.data4sci.com

  22. github.com/DataForScience/RNN

  23. Long-Short Term Memory (LSTM) • What if we want to keep explicit information about previous states ( memory )? • How much information is kept, can be controlled through gates. • LSTMs were first introduced in 1997 by Hochreiter and Schmidhuber h t − 1 h t h t +1 c t − 2 c t − 1 c t c t +1 h t +1 h t − 2 h t − 1 h t x t − 1 x t x t +1 @bgoncalves www.data4sci.com

  24. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh f o i × × g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  25. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Forget gate: 
 How much of f o i × × the previous state should g be kept? tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  26. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Input gate : 
 How much of f o i × × the previous output g should be tanh σ σ remembered? σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  27. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output gate : 
 All gates use the same How much of f o i × × inputs and the previous activation output g functions, should tanh but different σ σ contribute? σ weights h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  28. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output gate : 
 How much of f o i × × the previous output g should tanh σ σ contribute? σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  29. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh State : 
 Update the f o i × × current state g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  30. + Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output : 
 Combine all f o i × × available information. g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.data4sci.com

  31. @bgoncalves Using LSTMs #features S e q u e inputs n c inputs e L inputs e inputs n g inputs t h W 1 inputs inputs W 1 inputs W 1 W 1 W 1 Neuron LSTM W 1 W 1 LSTM W 1 LSTM LSTM LSTM W 2 LSTM LSTM LSTM #LSTM cells σ www.data4sci.com

  32. github.com/DataForScience/RNN

  33. Applications • Language Modeling and Prediction • Speech Recognition • Machine Translation • Part-of-Speech Tagging • Sentiment Analysis • Summarization • Time series forecasting @bgoncalves www.data4sci.com

Recommend


More recommend