Deep Learning
Recurrent Networks
10/16/2017
1
Recurrent Networks 10/16/2017 1 Which open source project? - - PowerPoint PPT Presentation
Deep Learning Recurrent Networks 10/16/2017 1 Which open source project? Related math. What is it talking about? And a Wikipedia page explaining it all The unreasonable effectiveness of recurrent neural networks.. All previous examples
1
– These are “Time delay” neural nets, AKA convnets
– These are recurrent neural networks
Stock vector X(t) X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+6)
– These are “Time delay” neural nets, AKA convnets
– These are recurrent neural networks
Time X(t) Y(t) t=0 h-1
bit number
– Input is binary – Will require large number of training instances
– Network trained for N-bit numbers will not work for N+1 bit numbers
1 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 MLP 1 0 1 0 1 0 1 1 1 1 0
1 1 RNN unit Previous carry Carry
– At least one hidden layer of size N plus an output neuron – Fixed input size
1 0 0 0 1 1 0 0 1 0 MLP 1
1 1 RNN unit Previous
Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t)
X(t) Y(t) h-1 X(t) Y(t) h-1 h-2 h-3 h-2
X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+5)
– The function 𝑔() has bounded output for bounded input
– 𝑌(𝑢) is bounded
– This is a highly desirable characteristic X(t+1) X(t+2) X(t+3) X(t+4) X(t+5) X(t+6) X(t+7) Y(t+5)
Time X(t) Y(t) t=0 h-1
– What if the activations are linear?
Time X(t) Y(t) t=0 h-1
Time X(t) Y(t) t=0 h-1
– Will attempt to extrapolate to non-linear systems subsequently
– 𝑨𝑙 = 𝑋
ℎℎ𝑙−1 + 𝑋 𝑦𝑦𝑙,
ℎ𝑙= 𝑨𝑙
Time X(t) Y(t) t=0 h-1
ℎℎ𝑙−1 + 𝑋 𝑦𝑦𝑙
– ℎ𝑙−1 = 𝑋
ℎℎ𝑙−2 + 𝑋 𝑦𝑦𝑙−1
ℎ 2ℎ𝑙−2 + 𝑋 ℎ𝑋 𝑦𝑦𝑙−1 + 𝑋 𝑦𝑦𝑙
ℎ 𝑙+1ℎ−1 + 𝑋 ℎ 𝑙𝑋 𝑦𝑦0 + 𝑋 ℎ 𝑙−1𝑋 𝑦𝑦1 + 𝑋 ℎ 𝑙−2𝑋 𝑦𝑦2 + ⋯
– = ℎ−1𝐼𝑙(1−1) + 𝑦0𝐼𝑙(10) + 𝑦1𝐼𝑙(11) + 𝑦2𝐼𝑙(12) + ⋯
ℎ𝑙 = ℎ−1𝐼𝑙(1−1) + 𝑦0𝐼𝑙(10) + 𝑦1𝐼𝑙(11) + 𝑦2𝐼𝑙(12) + ⋯
Time X(t) Y(t) t=0 h-1
𝑢𝑣1 + 𝑏2𝜇2 𝑢 𝑣2 + ⋯ + 𝑏𝑜𝜇𝑜 𝑢 𝑣𝑜
𝑢→∞ 𝑋𝑢ℎ = 𝑏𝑛𝜇𝑛 𝑢 𝑣𝑛 where 𝑛 = argmax 𝑘
𝑢𝑣1 + 𝑏2𝜇2 𝑢 𝑣2 + ⋯ + 𝑏𝑜𝜇𝑜 𝑢 𝑣𝑜
𝑢→∞ 𝑋𝑢ℎ = 𝑏𝑛𝜇𝑛 𝑢 𝑣𝑛 where 𝑛 = argmax 𝑘
For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix
𝑢𝑣1 + 𝑏2𝜇2 𝑢 𝑣2 + ⋯ + 𝑏𝑜𝜇𝑜 𝑢 𝑣𝑜
𝑢→∞ 𝑋𝑢ℎ = 𝑏𝑛𝜇𝑛 𝑢 𝑣𝑛 where 𝑛 = argmax 𝑘
For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. And so on..
𝑢𝑣1 + 𝑏2𝜇2 𝑢 𝑣2 + ⋯ + 𝑏𝑜𝜇𝑜 𝑢 𝑣𝑜
𝑢→∞ 𝑋𝑢ℎ = 𝑏𝑛𝜇𝑛 𝑢 𝑣𝑛 where 𝑛 = argmax 𝑘
For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢 th power of the largest eigen value of the hidden-layer weight matrix Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. And so on.. If 𝑆𝑓(𝜇𝑛𝑏𝑦) > 1 it will blow up, otherwise it will contract and shrink to 0 rapidly
– ℎ 𝑢 = 𝑋ℎ 𝑢 − 1 + 𝐷𝑦(𝑢) – ℎ0 𝑢 = 𝑋𝑢𝑑𝑦 0
𝑢𝑣1 + 𝑏2𝜇2 𝑢 𝑣2 + ⋯ + 𝑏𝑜𝜇𝑜 𝑢 𝑣𝑜
– lim
𝑢→∞ 𝑋𝑢ℎ = 𝑏𝑛𝜇𝑛 𝑢 𝑣𝑛 where 𝑛 = argmax 𝑘
𝜇𝑘 For any input, for large 𝑢 the length of the hidden vector will expand or contract according to the 𝑢-th power of the largest eigen value of the hidden-layer weight matrix Unless it has no component along the eigen vector corresponding to the largest eigen value. In that case it will grow according to the second largest Eigen value.. And so on.. If 𝑆𝑓(𝜇𝑛𝑏𝑦) > 1 it will blow up, otherwise it will contract and shrink to 0 rapidly What about at middling values of 𝑢? It will depend on the
𝜇𝑛𝑏𝑦 = 0.9 𝜇𝑛𝑏𝑦 = 1 𝜇𝑛𝑏𝑦 = 1.1 𝜇𝑛𝑏𝑦 = 1 𝜇𝑛𝑏𝑦 = 1.1
𝜇𝑛𝑏𝑦 = 0.9 𝜇𝑛𝑏𝑦 = 1 𝜇𝑛𝑏𝑦 = 1.1 𝜇𝑛𝑏𝑦 = 1 𝜇𝑛𝑏𝑦 = 1.1 Complex Eigenvalues 𝜇2𝑜𝑒 = 0.5 𝜇2𝑜𝑒 = 0.1
– Symmetric weight matrix
– Sigmoid: Saturates in a limited number of steps, regardless of 𝑥 – Tanh: Sensitive to 𝑥, but eventually saturates
– Relu: Sensitive to 𝑥, can blow up
– Sigmoid: Saturates in a limited number of steps, regardless of 𝑥 – Tanh: Sensitive to 𝑥, but eventually saturates – Relu: For negative starts, has no response
– 1,1,1, … / 𝑂 – Behavior similar to scalar recursion – Interestingly, RELU is more prone to blowing up (why?)
sigmoid tanh relu
– −1, −1, −1, … / 𝑂 – Behavior similar to scalar recursion – Interestingly, RELU is more prone to blowing up (why?)
sigmoid tanh relu
functions
– Alternately, Routh’s criterion and/or pole-zero analysis – Positive definite functions evaluated at ℎ – Conclusions are similar: only the tanh activation gives us any reasonable behavior
– Bipolar activations (e.g. tanh) have the best behavior – Still sensitive to Eigenvalues of 𝑋 – Best case memory is short – Exponential memory behavior
ℎ 𝑢 = 0.5ℎ 𝑢 − 1 + 0.25ℎ 𝑢 − 5 + 𝑦(𝑢) ℎ 𝑢 = 0.5ℎ 𝑢 − 1 + 0.25ℎ 𝑢 − 5 + 0.1ℎ 𝑢 − 8 + 𝑦(𝑢)
– On board?
𝑍 = 𝑔
𝑂 𝑋 𝑂−1𝑔 𝑂−1 𝑋 𝑂−2𝑔 𝑂−2 … 𝑋 0𝑌
𝑙 is the weights matrix at the kth layer
𝐸𝑗𝑤(𝑌) = 𝐸 𝑔
𝑂 𝑋 𝑂−1𝑔 𝑂−1 𝑋 𝑂−2𝑔 𝑂−2 … 𝑋 0𝑌
W0 W1 W2
𝑌𝑔 = 𝛼𝑎𝑔. 𝑋. 𝛼 𝑌
– 𝑎 = 𝑋 𝑌 – 𝛼𝑎𝑔 is the jacobian matrix of 𝑔(𝑎)w.r.t 𝑎
Poor notation
𝐸𝑗𝑤(𝑌) = 𝐸 𝑔
𝑂 𝑋 𝑂−1𝑔 𝑂−1 𝑋 𝑂−2𝑔 𝑂−2 … 𝑋 0𝑌
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
– 𝛼
𝑔𝑙𝐸𝑗𝑤 is the gradient 𝐸𝑗𝑤(𝑌) of the error w.r.t the output of the
kth layer of the network
𝑙−1
– 𝛼𝑔
𝑜 is jacobian of 𝑔 𝑂() w.r.t. to its current input
– All blue terms are matrices
𝑢() is the derivative of the output of the (layer of)
ℎ𝑗
1 (𝑢) = 𝑔 1 𝑨𝑗 1 𝑢
𝑌 ℎ1 𝑍 𝛼𝑔
𝑢 𝑨𝑗 =
𝑔
𝑢,1 ′ (𝑨1)
⋯ 𝑔
𝑢,2 ′ (𝑨2)
⋯ ⋮ ⋮ ⋱ ⋮ ⋯ 𝑔
𝑢,𝑂 ′ (𝑨𝑂)
– The diagonals of the Jacobian are bounded
ℎ𝑗
1 (𝑢) = 𝑔 1 𝑨𝑗 1 𝑢
𝑌 ℎ1 𝑍 𝛼𝑔
𝑢 𝑨𝑗 =
𝑔
𝑢,1 ′ (𝑨1)
⋯ 𝑔
𝑢,2 ′ (𝑨2)
⋯ ⋮ ⋮ ⋱ ⋮ ⋯ 𝑔
𝑢,𝑂 ′ (𝑨𝑂)
have derivatives that are always less than 1
– The derivative of tanh()is always less than 1
𝛼𝑔
𝑢 𝑨𝑗 =
𝑔
𝑢,1 ′ (𝑨1)
⋯ 𝑔
𝑢,2 ′ (𝑨2)
⋯ ⋮ ⋮ ⋱ ⋮ ⋯ 𝑔
𝑢,𝑂 ′ (𝑨𝑂)
𝛼
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
𝑔𝑙𝐸𝑗𝑤 will
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
𝑔𝑙𝐸𝑗𝑤 will
– Resulting in insignificant or unstable gradient descent updates – Problem gets worse as network depth increases
𝛼
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
– Different activations: Exponential linear units, RELU, sigmoid, than – Each layer is 1024 layers wide – Gradients shown at initialization
𝑜𝑓𝑣𝑠𝑝𝑜𝐹 where 𝑋
𝑜𝑓𝑣𝑠𝑝𝑜 is the vector of incoming weights to each neuron
– I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
ELU activation, Batch gradients
Output layer Input layer
– Different activations: Exponential linear units, RELU, sigmoid, than – Each layer is 1024 layers wide – Gradients shown at initialization
𝑜𝑓𝑣𝑠𝑝𝑜𝐹 where 𝑋
𝑜𝑓𝑣𝑠𝑝𝑜 is the vector of incoming weights to each neuron
– I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
RELU activation, Batch gradients
Output layer Input layer
– Different activations: Exponential linear units, RELU, sigmoid, than – Each layer is 1024 layers wide – Gradients shown at initialization
𝑜𝑓𝑣𝑠𝑝𝑜𝐹 where 𝑋
𝑜𝑓𝑣𝑠𝑝𝑜 is the vector of incoming weights to each neuron
– I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Sigmoid activation, Batch gradients
Output layer Input layer
– Different activations: Exponential linear units, RELU, sigmoid, than – Each layer is 1024 layers wide – Gradients shown at initialization
𝑜𝑓𝑣𝑠𝑝𝑜𝐹 where 𝑋
𝑜𝑓𝑣𝑠𝑝𝑜 is the vector of incoming weights to each neuron
– I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
Tanh activation, Batch gradients
Output layer Input layer
– Different activations: Exponential linear units, RELU, sigmoid, than – Each layer is 1024 layers wide – Gradients shown at initialization
𝑜𝑓𝑣𝑠𝑝𝑜𝐹 where 𝑋
𝑜𝑓𝑣𝑠𝑝𝑜 is the vector of incoming weights to each neuron
– I.e. the gradient of the loss w.r.t. the entire set of weights to each neuron
ELU activation, Individual instances
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
– Gradients from errors at t = 𝑈 will vanish by the time they’re propagated to 𝑢 = 0
X(0)
hf(-1)
Y(T)
h-1
𝑌(0) 𝑌(1) 𝑌(2) 𝑌(𝑈 − 2) 𝑌(𝑈 − 1) 𝑌(𝑈) 𝑍(0) 𝑍(1) 𝑍(2) 𝑍(𝑈 − 2) 𝑍(𝑈 − 1) 𝑍(𝑈)
pattern 2
– RNN will “forget” pattern 1 if intermediate stuff is too long – “Jane” the next pronoun referring to her will be “she”
when necessary
– Can be performed with a multi-tap recursion, but how many taps? – Need an alternate way to “remember” stuff
1
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸. 𝛼𝑔 𝑂. 𝑋 𝑂−1. 𝛼𝑔 𝑂−1. 𝑋 𝑂−2 … 𝛼𝑔 𝑙+1𝑋 𝑙
𝑔𝑙𝐸𝑗𝑤 = 𝛼𝐸𝐷𝜏𝑂𝐷𝜏𝑂−1𝐷 … 𝜏𝑙
– No weights, no nonlinearities – Only scaling is through the s “gating” term that captures other triggers – E.g. “Have I seen Pattern2”?
Time
ℎ(𝑢) ℎ(𝑢 + 1) ℎ(𝑢 + 2) ℎ(𝑢 + 3) ℎ(𝑢 + 4) 𝜏(𝑢 + 1) 𝜏(𝑢 + 2) 𝜏(𝑢 + 3) 𝜏(𝑢 + 4) t+1 t+2 t+3 t+4
ℎ(𝑢) ℎ(𝑢 + 1) ℎ(𝑢 + 2) ℎ(𝑢 + 3) ℎ(𝑢 + 4) 𝜏(𝑢 + 1) 𝜏(𝑢 + 2) 𝜏(𝑢 + 3) 𝜏(𝑢 + 4) 𝑌(𝑢 + 1) 𝑌(𝑢 + 2) 𝑌(𝑢 + 3) 𝑌(𝑢 + 4) Time
ℎ(𝑢) ℎ(𝑢 + 1) ℎ(𝑢 + 2) ℎ(𝑢 + 3) ℎ(𝑢 + 4) 𝜏(𝑢 + 1) 𝜏(𝑢 + 2) 𝜏(𝑢 + 3) 𝜏(𝑢 + 4) 𝑌(𝑢 + 1) 𝑌(𝑢 + 2) 𝑌(𝑢 + 3) 𝑌(𝑢 + 4) Other stuff Time
ℎ(𝑢) ℎ(𝑢 + 1) ℎ(𝑢 + 2) ℎ(𝑢 + 3) ℎ(𝑢 + 4) 𝜏(𝑢 + 1) 𝜏(𝑢 + 2) 𝜏(𝑢 + 3) 𝜏(𝑢 + 4) 𝑌(𝑢 + 1) 𝑌(𝑢 + 2) 𝑌(𝑢 + 3) 𝑌(𝑢 + 4) Other stuff Time
ℎ(𝑢) ℎ(𝑢 + 1) ℎ(𝑢 + 2) ℎ(𝑢 + 3) ℎ(𝑢 + 4) 𝜏(𝑢 + 1) 𝜏(𝑢 + 2) 𝜏(𝑢 + 3) 𝜏(𝑢 + 4) 𝑌(𝑢 + 1) 𝑌(𝑢 + 2) 𝑌(𝑢 + 3) 𝑌(𝑢 + 4) Other stuff Time
inputs
– As mentioned earlier, tanh() is the generally used activation for the hidden layer
– And addition of history, which too is gated..
forget it
– More precisely, how much of the history to carry over – Also called the “forget” gate – Note, we’re actually distinguishing between the cell memory 𝐷 and the state ℎ that is coming over time! They’re related though
– A perceptron layer that determines if there’s something interesting in the input – A gate that decides if its worth remembering – If so its added to the current memory cell
– A perceptron layer that determines if there’s something interesting in the input – A gate that decides if its worth remembering – If so its added to the current memory cell
– Simply compress it with tanh to make it lie between 1 and -1
forward
– While we’re at it, lets toss in an output gate
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
Gates Variables
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 𝑦𝑢+1 𝐷𝑢+1 ሚ 𝐷𝑢+1
s() s() s()
tanh tanh
ℎ𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 𝐷𝑢+1
s() s() s()
tanh tanh
𝐷ℎ ℎ𝑢+1 𝑦𝑢+1 ሚ 𝐷𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝
ℎ𝑢+1 𝑦𝑢+1 ሚ 𝐷𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ . ∘ 𝜏′ . 𝑋 𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 +
ℎ𝑢+1 𝑦𝑢+1 ሚ 𝐷𝑢+1 𝑔
𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1 𝑔
𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ . ∘ 𝜏′ . 𝑋 𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔
𝑦𝑢+1 ሚ 𝐷𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝑦𝑢+1 ሚ 𝐷𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝛼ℎ𝑢𝐸𝑗𝑤 = 𝛼
𝑨𝑢𝐸𝑗𝑤𝛼ℎ𝑢𝑨𝑢
𝑦𝑢+1 ሚ 𝐷𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝛼ℎ𝑢𝐸𝑗𝑤 = 𝛼
𝑨𝑢𝐸𝑗𝑤𝛼ℎ𝑢𝑨𝑢 + 𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝐷𝑢 ∘ 𝜏′ . 𝑋 ℎ𝑔
𝑦𝑢+1 ሚ 𝐷𝑢+1 𝑗𝑢+1 𝑝𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝛼ℎ𝑢𝐸𝑗𝑤 = 𝛼
𝑨𝑢𝐸𝑗𝑤𝛼ℎ𝑢𝑨𝑢 + 𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝐷𝑢 ∘ 𝜏′ . 𝑋 ℎ𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
ℎ𝑗
𝑦𝑢+1 ሚ 𝐷𝑢+1 𝑗𝑢+1 𝑝𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝛼ℎ𝑢𝐸𝑗𝑤 = 𝛼
𝑨𝑢𝐸𝑗𝑤𝛼ℎ𝑢𝑨𝑢 + 𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝐷𝑢 ∘ 𝜏′ . 𝑋 ℎ𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
ℎ𝑗 +
𝛼𝐷𝑢+1𝐸𝑗𝑤 ∘ 𝑗𝑢+1 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
ℎ𝑗
𝑦𝑢+1 ሚ 𝐷𝑢+1 𝑗𝑢+1 𝑝𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝛼ℎ𝑢𝐸𝑗𝑤 = 𝛼
𝑨𝑢𝐸𝑗𝑤𝛼ℎ𝑢𝑨𝑢 + 𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝐷𝑢 ∘ 𝜏′ . 𝑋 ℎ𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
ℎ𝑗 +
𝛼𝐷𝑢+1𝐸𝑗𝑤 ∘ 𝑝𝑢+1 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
ℎ𝑗 + 𝛼ℎ𝑢+1𝐸𝑗𝑤 ∘ 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
ℎ𝑝
𝑦𝑢+1 ሚ 𝐷𝑢+1 𝑗𝑢+1 𝑝𝑢+1
𝑦𝑢 ℎ𝑢−1 ℎ𝑢 𝐷𝑢−1 𝐷𝑢 𝑔
𝑢
𝑗𝑢 𝑝𝑢 ሚ 𝐷𝑢
s() s() s()
tanh tanh
𝑨𝑢 𝐷𝑢 ℎ𝑢+1 𝐷𝑢+1
s() s() s()
tanh tanh
𝛼𝐷𝑢𝐸𝑗𝑤 = 𝛼ℎ𝑢𝐸𝑗𝑤 ∘ 𝑝𝑢 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
𝐷ℎ + 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
𝐷𝑝 +
𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝑔
𝑢+1 + 𝐷𝑢 ∘ 𝜏′ . 𝑋 𝐷𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
𝐷𝑗
𝛼ℎ𝑢𝐸𝑗𝑤 = 𝛼
𝑨𝑢𝐸𝑗𝑤𝛼ℎ𝑢𝑨𝑢 + 𝛼ℎ𝑢𝐷𝑢+1 ∘ 𝐷𝑢 ∘ 𝜏′ . 𝑋 ℎ𝑔 + ሚ
𝐷𝑢+1 ∘ 𝜏′ . 𝑋
ℎ𝑗 +
𝛼𝐷𝑢+1𝐸𝑗𝑤 ∘ 𝑝𝑢+1 ∘ 𝑢𝑏𝑜ℎ′ . 𝑋
ℎ𝑗 + 𝛼ℎ𝑢+1𝐸𝑗𝑤 ∘ 𝑢𝑏𝑜ℎ .
∘ 𝜏′ . 𝑋
ℎ𝑝
𝑦𝑢+1 ሚ 𝐷𝑢+1 𝑗𝑢+1 𝑝𝑢+1
memories
– Pointless computation!
current input!
95
information will be let through the memory cell.
should be thrown away from memory cell.
will be passed to expose to the next time step.
RNN
LSTM Memory Cell
Time X(t) Y(t)
– Its also possible to have MLP feed-forward layers between the hidden layers..
X(0)
Y(0) t hf(-1)
X(1) X(2) X(T-2) X(T-1) X(T)
Y(1) Y(2)
Y(T-2) Y(T-1) Y(T) X(0) X(1) X(2) X(T-2) X(T-1) X(T)
hb(inf)
– Often no clear synchrony between input and desired output
– Obvious choices for divergence may not be differentiable (e.g. edit distance)
Time X(t) Y(t) t=0 h-1 DIVERGENCE Ydesired(t)
– All boxes are scalar – Sigmoid activations are commonly used in the hidden layer(s)
– NARX networks are less susceptible to vanishing gradients than conventional RNNs – Training often uses methods other than backprop/gradient descent, e.g. simulated annealing or genetic algorithms
2016
at two meters above ground level and the mean hourly temperature recorded during seven years, from 2002 to 2008
Inputs may use either past predicted output values, or past true values or the past error in prediction
Four score and seven years ??? A B R A H A M L I N C O L ??
– Pre-specify a vocabulary of N words in fixed (e.g. lexical)
– Represent each word by an N-dimensional vector with N-1 zeros and a single 1 (in the position of the word in the
– English will require about 100 characters, to include both cases, special characters such as commas, hyphens, apostrophes, etc., and the space character
1…𝑋 𝑜−1, predict 𝑋 𝑜
1…𝑋 𝑜−1 are both
𝑋
𝑜 = 𝑔 W_1, 𝑋 1, … , 𝑋 𝑜−1
Four score and seven years ??? Nx1 one-hot vectors 𝑔()
⋮ 1 1 ⋮ 1 ⋮
⋮
1 ⋮
𝑋
1
𝑋
2
𝑋
𝑜−1
𝑋
𝑜
1…𝑋 𝑜−1, predict 𝑋 𝑜
1…𝑋 𝑜−1 are both
𝑋
𝑜 = 𝑔 W_1, 𝑋 1, … , 𝑋 𝑜−1
Four score and seven years ??? Nx1 one-hot vectors 𝑔()
⋮ 1 1 ⋮ 1 ⋮
⋮
1 ⋮
𝑋
1
𝑋
2
𝑋
𝑜−1
𝑋
𝑜
cube
– Actual volume of space used = 0
– Density of points: 𝒫
𝑂 2𝑂
(1,0,0) (0,1,0) (0,0,1)
importance of words
– All word vectors are the same length
– The distance between every pair of words is the same
(1,0,0) (0,1,0) (0,0,1)
– The volume used is still 0, but density can go up by many orders of magnitude
𝑂 2𝑁
– If properly learned, the distances between projected points will capture semantic relations between the words
(1,0,0) (0,1,0) (0,0,1)
– The volume used is still 0, but density can go up by many orders of magnitude
𝑂 2𝑁
– If properly learned, the distances between projected points will capture semantic relations between the words
(1,0,0) (0,1,0) (0,0,1)
– Replace every one-hot vector 𝑋
𝑗 by 𝑄𝑋 𝑗
– 𝑄 is an 𝑁 × 𝑂 matrix – 𝑄𝑋
𝑗 is now an 𝑁-dimensional vector
– Learn P using an appropriate objective
𝑋
𝑜 = 𝑔 𝑄𝑋 1, 𝑄𝑋 2, … , 𝑄𝑋 𝑜−1
Four score and seven years ??? 𝑔()
⋮ 1 1 ⋮ 1 ⋮
⋮
1 ⋮
𝑋
1
𝑋
2
𝑋
𝑜−1
𝑋
𝑜
𝑄 𝑄 𝑄
(1,0,0) (0,1,0) (0,0,1)
tied weights
𝑋
𝑜 = 𝑔 𝑄𝑋 1, 𝑄𝑋 2, … , 𝑄𝑋 𝑜−1
(1,0,0) (0,1,0) (0,0,1)
⋮ ⋮ ⋮ ⋮ 𝑔()
1 ⋮
𝑋
𝑜
⋮ ⋮ ⋮
⋮ 1 1 ⋮ 1 ⋮
𝑋
1
𝑋
2
𝑋
𝑜−1
𝑂 𝑁
– “A neural probabilistic language model”, Bengio et al. 2003 – Hidden layer has Tanh() activation, output is softmax
representations 𝑄𝑋 of words 𝑄 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
𝑄 𝑋
4
𝑄 𝑋
5
𝑄 𝑋
6
𝑄 𝑋
7
𝑄 𝑋
8
𝑄 𝑋
9
𝑋
5
𝑋
6
𝑋
7
𝑋
8
𝑋
9
𝑋
10
– Without considering specific position
𝑄 Mean pooling 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
𝑄 𝑋
5
𝑄 𝑋
6
𝑄 𝑋
7
𝑋
4
𝑄 𝑋
7
𝑋
5
𝑋
6
𝑋
8
𝑋
9
𝑋
10
𝑋
4
Color indicates shared parameters
𝑄 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
𝑄 𝑋
4
𝑄 𝑋
5
𝑄 𝑋
6
𝑄 𝑋
7
𝑄 𝑋
8
𝑄 𝑋
9
𝑋
5
𝑋
6
𝑋
7
𝑋
8
𝑋
9
𝑋
10
𝑋
2
𝑋
3
𝑋
4
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
– And set it as the next word in the series
𝑄 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
– One-hot vectors
– Outputs an N-valued probability distribution rather than a one-hot vector
– And set it as the next word in the series
𝑄 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
𝑋
4
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination 𝑄 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
𝑄 𝑋
5
𝑋
4
– And draw the next word from the output probability distribution
– In some cases, e.g. generating programs, there may be a natural termination 𝑄 𝑋
1
𝑄 𝑋
2
𝑄 𝑋
3
𝑄 𝑄 𝑄 𝑄 𝑄 𝑄 𝑋
5
𝑋
6
𝑋
7
𝑋
8
𝑋
9
𝑋
10
𝑋
4
Trained on linux source code Actually uses a character-level model (predicts character sequences)
http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/
Time 𝑄
1
X(t) t=0 𝑄2 𝑄3 𝑄
4
𝑄5 𝑄6 𝑄7
– Future lecture – Also homework
Time 𝑋
1
X(t) t=0 𝑋
2
Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Venugopalan, Huijun Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko North American Chapter of the Association for Computational Linguistics, Denver, Colorado, June 2015.
– Can use causal (one-direction) or non-causal (bidirectional) context to make predictions – Potentially Turing complete
– Can “hold” memory for arbitrary durations
– Language modelling
– Machine translation – Speech recognition – Time-series prediction – Stock prediction – Many others..