3/3/2020 Recurrent neural networks and Lo Long-short term memory (L (LSTM) Jeong Min Lee CS3750, University of Pittsburgh Outline • RNN • RNN • Unfolding Computational Graph • Backpropagation and weight update • Explode / Vanishing gradient problem • LSTM • GRU • Tasks with RNN • Software Packages 1
3/3/2020 So far we are • Modeling sequence (time-series) and predicting future values by probabilistic models (AR, HMM, LDS, Particle Filtering, Hawkes Process, etc) • E.g. LDS • Observation 𝑦 𝑢 is modeled as emission matrix 𝐷 , hidden state 𝑨 𝑢 with Gaussian 𝑨 𝑢−1 𝑨 𝑢 𝑨 𝑢+1 noise 𝑥 𝑢 𝑦 𝑢 = 𝐷𝑨 𝑢 + 𝑥 𝑢 ; 𝑥 𝑢 ~𝑂 𝑥 0, Σ • The hidden state is also probabilistically 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 computed with transition matrix 𝐵 and Gaussian noise 𝑤 𝑢 𝑨 𝑢 = 𝐵𝑨 𝑢−1 + 𝑤 𝑢 ; 𝑤 𝑢 ~𝑂(𝑥|0, Γ) Paradigm Shift to RNN • We are moving into a new world where no probabilistic component exists in a model • That is, we may not need to inference like in LDS and HMM • In RNN, hidden states bear no probabilistic form or assumption • Given fixed input and target from data, RNN is to learn intermediate association between them and also the real-valued vector representation 2
3/3/2020 RNN • RNN’s input, output, and internal representation (hidden states) are all real-valued vectors • ℎ 𝑢 : hidden states; real-valued vector ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 • 𝑦 𝑢 : input vector (real-valued) • 𝑊ℎ 𝑢 : real-valued vector 𝑧 = λ(𝑊ℎ 𝑢 ) ො • 𝑧 : output vector (real-valued) ො RNN • RNN consists of three parameter matrices ( 𝑉 , 𝑋, 𝑊 ) with activation functions • 𝑉 : input-hidden matrix ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 • 𝑋 : hidden-hidden matrix • 𝑊 : hidden-output matrix 𝑧 = λ(𝑊ℎ 𝑢 ) ො 3
3/3/2020 RNN • tanh ∙ is a tangent hyperbolic function. It models non-linearity. ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 tanh(z) 𝑧 = λ(𝑊ℎ𝑢 ) ො z RNN • λ ∙ is output transformation function • It can be any function and selected for a task and type of target in data • It can be even another feed-forward neural network and it makes RNN to model anything, without any restriction • Sigmoid: binary probability distribution ℎ 𝑢 = tanh 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 • Softmax: categorical probability distribution • ReLU: positive real-value output 𝑧 = λ(𝑊ℎ𝑢 ) ො • Identity function: real-value output 4
3/3/2020 Make a prediction • Let’s see how it makes a prediction • In the beginning, initial hidden state ℎ 0 is filled with zero or random value • Also we assume the model is already trained. (we will see how it is trained soon) ℎ 0 𝑦 1 Make a prediction • Assume we currently have observation 𝑦 1 and want to predict 𝑦 2 • We compute hidden states ℎ 1 first ℎ 1 = tanh 𝑉𝑦 1 + 𝑋ℎ 0 𝑋 ℎ 0 ℎ 1 𝑉 𝑦 1 5
3/3/2020 Make a prediction • Then we generate prediction: • 𝑊ℎ 1 is a real-valued vector or scalar value (depends on the size of output matrix 𝑊) ℎ 1 = tanh 𝑉𝑦 1 + 𝑋ℎ 0 𝑦 2 ො 𝑦 2 = ො ො 𝑧 = λ(𝑊ℎ 1 ) 𝑊, λ( ) 𝑋 ℎ 0 ℎ 1 𝑉 𝑦 1 𝑦 2 ො Make a prediction multiple steps • In prediction for multiple steps a head, predicted value ො 𝑦 2 from previous step is considered as input 𝑦 2 at time step 2 ℎ 2 = tanh 𝑉ො 𝑦 2 + 𝑋ℎ 1 𝑦 2 ො 𝑦 3 ො 𝑦 3 = ො ො 𝑧 = λ(𝑊ℎ 2 ) 𝑊, λ( ) 𝑊, λ( ) 𝑋 𝑋 ℎ 0 ℎ 1 ℎ 2 𝑉 𝑉 𝑦 2 ො 𝑦 1 𝑦 3 ො 6
3/3/2020 Make a prediction multiple steps • Same mechanism applies forward in time.. ℎ 3 = tanh 𝑉ො 𝑦 3 + 𝑋ℎ 2 𝑦 4 = ො ො 𝑧 = λ(𝑊ℎ 3 ) 𝑦 2 ො 𝑦 3 ො 𝑦 4 ො 𝑊, λ( ) 𝑊, λ( ) 𝑊, λ( ) 𝑋 𝑋 𝑋 ℎ 0 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 2 ො 𝑦 1 𝑦 3 ො 𝑦 4 ො RNN Characteristic • You might observed that … • Parameters 𝑉, 𝑊, 𝑋 are shared across all time steps • No probabilistic component (random number generation) is involved • So, everything is deterministic 𝑦 2 ො 𝑦 3 ො 𝑦 4 ො 𝑊, λ( ) 𝑊, λ( ) 𝑊, λ( ) 𝑋 𝑋 𝑋 ℎ 0 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 2 ො 𝑦 1 𝑦 3 ො 𝑦 4 ො 7
3/3/2020 Another way to see RNN • RNN is a type of neural network Neural Network • Cascading several linear weights with nonlinear 𝑧 activation functions in between them 𝑊 ℎ • 𝑧 : output 𝑉 • 𝑊 : Hidden-Output matrix • ℎ : hidden units (states) 𝑦 • 𝑉 : Input-Hidden matrix • 𝑦 : input 8
3/3/2020 Neural Network • In traditional NN, it is assumed that every input is 𝑧 independent each other 𝑊 ℎ • But with sequential data, input in current time step is highly likely depends on input in previous time step 𝑉 𝑦 • We need some additional structure that can model dependencies of inputs over time Recurrent Neural Network • A type of a neural network that has a recurrence structure • The recurrence structure allows us to operate over a sequence of vectors 𝑧 𝑊 𝑋 ℎ 𝑉 𝑦 9
3/3/2020 RNN as an Unfolding Computational Graph 𝑧 𝑢−1 ො 𝑧 𝑢 ො 𝑧 𝑢+1 ො 𝑧 𝑊 𝑊 𝑊 𝑊 𝑋 𝑋 𝑋 𝑋 𝑋 … … ℎ ℎ 𝑢−1 ℎ 𝑢 ℎ 𝑢+1 ℎ ℎ Unfold 𝑉 𝑉 𝑉 𝑉 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 𝑦 RNN as an Unfolding Computational Graph 𝑧 𝑢−1 ො 𝑧 𝑢 ො 𝑧 𝑢+1 ො 𝑧 𝑊 𝑊 𝑊 𝑊 𝑋 𝑋 𝑋 𝑋 𝑋 … … ℎ ℎ 𝑢−1 ℎ 𝑢 ℎ 𝑢+1 ℎ ℎ Unfold 𝑉 𝑉 𝑉 𝑉 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 𝑦 RNN can be converted into a feed-forward neural network by unfolding over time 10
3/3/2020 How to train RNN? • Before make train happen, we need to define these: • 𝑧 𝑢 : true target • ො 𝑧 𝑢 : output of RNN (=prediction for true target) • 𝐹 𝑢 : error (loss); difference between the true target and the output • As the output transformation function 𝜇 is selected by the task and data, so does the loss: • Binary Classification: Binary Cross Entropy • Categorical Classification: Cross Entropy • Regression: Mean Squared Error With the loss, the RNN will be like: 𝑧 𝑧 𝑢−1 𝑧 𝑢 𝑧 𝑢+1 𝐹 𝑢−1 𝐹 𝑢 𝐹 𝑢+1 𝐹 𝑧 ො 𝑧 𝑢−1 ො 𝑧 𝑢 ො 𝑧 𝑢+1 ො Unfold 𝑊 𝑊 𝑊 𝑊 𝑋 𝑋 𝑋 𝑋 𝑋 … … ℎ ℎ ℎ 𝑢−1 ℎ 𝑢 ℎ 𝑢+1 ℎ 𝑉 𝑉 𝑉 𝑉 𝑦 𝑦 𝑢−1 𝑦 𝑢 𝑦 𝑢+1 11
3/3/2020 Back Propagation Through Time (BPTT) • Extension of standard backpropagation 𝑧 1 𝑧 2 𝑧 3 that performs gradient descent on an unfolded network 𝐹 1 𝐹 2 𝐹 3 • Goal is to calculate gradients of the error with respect to parameters U, V, and W 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො and learn desired parameters using 𝑊 𝑊 𝑊 Stochastic Gradient Descent 𝑋 𝑋 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 1 𝑦 2 𝑦 3 Back Propagation Through Time (BPTT) • To update in one training example 𝑧 1 𝑧 2 𝑧 3 (sequence), we sum up the gradients at each time of the sequence: 𝐹 1 𝐹 2 𝐹 3 𝜖𝐹 𝜖𝐹 𝑢 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො 𝜖𝑋 = 𝜖𝑋 𝑊 𝑊 𝑊 𝑢 𝑋 𝑋 ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 1 𝑦 2 𝑦 3 12
3/3/2020 Learning Parameters 𝑧 1 𝑧 2 𝑧 3 ℎ 𝑢 = tanh(𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 ) 𝑨 𝑢 = 𝑉𝑦 𝑢 + 𝑋ℎ 𝑢−1 𝐹 1 𝐹 2 𝐹 3 ℎ 𝑢 = tanh(𝑨 𝑢 ) 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො 𝛽 𝑙 = 𝜖ℎ 𝑙 𝜇 𝑙 = 𝜖ℎ 𝑙 • Let 𝑊 𝑊 𝑊 = 1 − ℎ 𝑙 2 𝜖𝑨 𝑙 𝑋 𝑋 𝜖𝑋 ℎ 1 ℎ 2 ℎ 3 𝛾 𝑙 = 𝜖𝐹 𝑙 = 𝑝 𝑙 − 𝑧 𝑙 𝑊 𝑉 𝑉 𝑉 𝜖ℎ 𝑙 𝑦 1 𝑦 2 𝑦 3 Learning Parameters 𝜖𝑋 = 𝜖𝐹 𝑙 𝜖𝐹 𝑙 𝜖ℎ 𝑙 𝑧 1 𝑧 2 𝑧 3 𝜖𝑋 = 𝛾 𝑙 𝜇 𝑙 𝜖ℎ 𝑙 𝐹 1 𝐹 2 𝐹 3 𝜇 𝑙 = 𝜖ℎ 𝑙 𝜖𝑋 = 𝜖ℎ 𝑙 𝜖𝑨 𝑙 𝜖𝑋 = 𝛽 𝑙 (ℎ 𝑙−1 + 𝑋𝜇 𝑙−1 ) 𝜖𝑨 𝑙 𝑧 1 ො 𝑧 2 ො 𝑧 3 ො 𝑊 𝑊 𝑊 𝜔 𝑙 = 𝜖ℎ 𝑙 𝜖𝑨 𝑙 𝑋 𝑋 𝜖𝑉 = 𝛽 𝑙 𝜖𝑉 = 𝛽 𝑙 (𝑦 𝑙 + 𝑋𝜔 𝑙−1 ) ℎ 1 ℎ 2 ℎ 3 𝑉 𝑉 𝑉 𝑦 1 𝑦 2 𝑦 3 13
Recommend
More recommend