��������������� RNNs for Timeseries Analysis www.bgoncalves.com github.com/bmtgoncalves/RNN
Disclaimer The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of my employer. The examples provided with this tutorial were chosen for their didactic value and are not mean to be representative of my day to day work. @bgoncalves www.bgoncalves.com
References @bgoncalves www.bgoncalves.com
How the Brain “Works” (Cartoon version) @bgoncalves www.bgoncalves.com
How the Brain “Works” (Cartoon version) @bgoncalves www.bgoncalves.com
How the Brain “Works” (Cartoon version) • Each neuron receives input from other neurons • 10 11 neurons, each with with 10 4 weights • Weights can be positive or negative • Weights adapt during the learning process • “neurons that fire together wire together” (Hebb) • Different areas perform different functions using same structure (Modularity) @bgoncalves www.bgoncalves.com
How the Brain “Works” (Cartoon version) Inputs f(Inputs) Output @bgoncalves www.bgoncalves.com
Optimization Problem • (Machine) Learning can be thought of as an optimization problem . • Optimization Problems have 3 distinct pieces : Neural Network • The constraints • The function to optimize Prediction Error • The optimization algorithm . Gradient Descent @bgoncalves www.bgoncalves.com
Artificial Neuron Bias 1 w 0 j x 1 w 1 j x 2 w 2 j z j a j φ ( z ) w T x w 3 j x 3 w Nj x N Activation Output Inputs Weights function @bgoncalves www.bgoncalves.com
Activation Function - Sigmoid http://github.com/bmtgoncalves/Neural-Networks • Non-Linear function • Differentiable 1 φ ( z ) = 1 + e − z • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data • Perhaps the most common @bgoncalves www.bgoncalves.com
Activation Function - tanh http://github.com/bmtgoncalves/Neural-Networks • Non-Linear function • Differentiable φ ( z ) = e z − e − z e z + e − z • non-decreasing • Compute new sets of features • Each layer builds up a more abstract representation of the data @bgoncalves www.bgoncalves.com
Forward Propagation • The output of a perceptron is determined by a sequence of steps: • obtain the inputs • multiply the inputs by the respective weights • calculate output using the activation function • To create a multi-layer perceptron, you can simply use the output of one layer as the input to the next one. 1 1 w 0 k a 1 w 0 j x 1 w 1 k w 1 j a 2 w 2 k x 2 a k w 2 j w T a � � φ w k 3 w T x � � a j φ w 3 j x 3 w Nk w Nj a N x N • But how can we propagate back the errors and update the weights? @bgoncalves www.bgoncalves.com
Backward Propagation of Errors (BackProp) • BackProp operates in two phases: • Forward propagate the inputs and calculate the deltas • Update the weights • The error at the output is a weighted average difference between predicted output and the observed one. • For inner layers there is no “real output”! @bgoncalves www.bgoncalves.com
Loss Functions • For learning to occur, we must quantify how far off we are from the desired output. There are two common ways of doing this: • Quadratic error function: E = 1 | y n − a n | 2 X N • Cross Entropy n J = − 1 n log a n + (1 − y n ) T log (1 − a n ) h i X y T N n The Cross Entropy is complementary to sigmoid activation in the output layer and improves its stability. @bgoncalves www.bgoncalves.com
Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient − ∂ H ∂θ mn θ mn ← θ mn − α ∂ H ∂θ mn H • where is the step size. α • Repeat until “convergence”. @bgoncalves www.bgoncalves.com
@bgoncalves www.bgoncalves.com
Feed Forward Networks h t Output x t Input h t = f ( x t ) @bgoncalves www.bgoncalves.com
Feed Forward Networks h t Output x t Input h t = f ( x t ) @bgoncalves www.bgoncalves.com
Feed Forward Networks h t Output Information Flow Input x t h t = f ( x t ) @bgoncalves www.bgoncalves.com
Information Recurrent Neural Network (RNN) Flow h t Output h t Output Previous h t − 1 Output x t Input h t = f ( x t , h t − 1 ) @bgoncalves www.bgoncalves.com
Recurrent Neural Network (RNN) h t h t − 1 h t x t @bgoncalves www.bgoncalves.com
Recurrent Neural Network (RNN) • Each output depends (implicitly) on all previous outputs . • Input sequences generate output sequences ( seq2seq ) h t − 1 h t h t +1 h t − 2 h t − 1 h t h t +1 x t − 1 x t x t +1 @bgoncalves www.bgoncalves.com
Recurrent Neural Network (RNN) h t h t h t − 1 tanh x t h t = tanh ( Wh t − 1 + Ux t ) @bgoncalves www.bgoncalves.com
Recurrent Neural Network (RNN) h t h t h t − 1 tanh x t h t = tanh ( Wh t − 1 + Ux t ) Concatenate both inputs. @bgoncalves www.bgoncalves.com
Timeseries • Temporal sequence of data points • Consecutive points are strongly correlated • Common in statistics, signal processing, econometrics, mathematical finance, earthquake prediction, etc • Numeric (real or discrete) or symbolic data @bgoncalves www.bgoncalves.com
Long-Short Term Memory (LSTM) • What if we want to keep explicit information about previous states ( memory )? • How much information is kept, can be controlled through gates. • LSTMs were first introduced in 1997 by Hochreiter and Schmidhuber h t − 1 h t h t +1 c t − 2 c t − 1 c t c t +1 h t +1 h t − 2 h t − 1 h t x t − 1 x t x t +1 @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh f o i × × g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Forget gate: How much of f o i × × the previous state should g be kept? tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Input gate : How much of f o i × × the previous output g should be tanh σ σ remembered? σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output gate : All gates use the same How much of f o i × × inputs and the previous activation output g functions, should tanh but different σ σ contribute? σ weights h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output gate : How much of f o i × × the previous output g should tanh σ σ contribute? σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh State : Update the f o i × × current state g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
+ Element wise addition × Element wise multiplication Long-Short Term Memory (LSTM) 1 minus the input 1 − h t c t − 1 c t + × tanh Output : Combine all f o i × × available information. g tanh σ σ σ h t − 1 h t x t g = tanh ( W g h t − 1 + U g x t ) f = σ ( W f h t − 1 + U f x t ) c t = ( c t − 1 ⊗ f ) + ( g ⊗ i ) i = σ ( W i h t − 1 + U i x t ) o = σ ( W o h t − 1 + U o x t ) h t = tanh ( c t ) ⊗ o @bgoncalves www.bgoncalves.com
Recommend
More recommend