Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 0 ◮ Ignore previous hidden memory h t − 1 17
Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 0 ◮ Ignore previous hidden memory h t − 1 ◮ Current hidden memory h t is only dependent on the new memory state ˜ h t 17
Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 1 ◮ Copy the previous hidden memory h t − 1 ◮ Current input is ignored completely 17
What is it for NLP? ◮ Short term dependencies means older history should be ignored ( r t → 0 ) 18
What is it for NLP? ◮ Short term dependencies means older history should be ignored ( r t → 0 ) ◮ Long term dependencies means older history should be retained as much as possible ( z t → 1 ) 18
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? ◮ r t = 1 and z t = 0 should happen at all the time steps. Very low chance of happening!! 19
Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? ◮ r t = 1 and z t = 0 should happen at all the time steps. Very low chance of happening!! ◮ So we are fine... 19
Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? 20
Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like? 20
Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like? ◮ Requires multivariate chain rule!!! 20
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 21
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 21
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 21
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 21
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. 21
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. ◮ There is a jump between timesteps 18 and 20 . 21
GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. ◮ There is a jump between timesteps 18 and 20 . ◮ It is a shortcut across time. 21
Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? 22
Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? Answer dz dx = dz dy dy dx 22
Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? Answer dz dx = dz dy dy dx Example ◮ z = exp ( y ) ◮ y = x 2 dz dy = exp ( y ) ◮ dy dx = 2 x ◮ dz dx = exp ( y ) × 2 x ◮ 22
Multivariate chain rule (Very Important!) 23
Multivariate chain rule example ◮ z = x + y 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ dx dt = 2 t ◮ 24
Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ dx dt = 2 t ◮ ◮ Finally, dz dt = exp ( t ) + 2 t 24
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) 25
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives 25
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 25
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 25
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 25
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 3. W hh → h 1 → h 2 → h 3 25
Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 3. W hh → h 1 → h 2 → h 3 ◮ h 2 → h 3 means ∂h 3 ∂h 2 25
What does it mean? What is the derivative of h 3 with respect to W hh ? 26
What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ 26
What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh 26
What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh ∂h 3 ∂h 2 2. ∂h 2 ∂W hh 26
What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh ∂h 3 ∂h 2 2. ∂h 2 ∂W hh ∂h 3 ∂h 2 ∂h 1 3. ∂h 2 ∂h 1 ∂W hh 26
RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 27
RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) 27
RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) 27
RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) ∂h j ∂h j − 1 = W (1 − ( h j ) 2 ) (Derivative of tanh ( x ) function is ◮ 1 − ( tanh ( x )) 2 ) 27
RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) ∂h j ∂h j − 1 = W (1 − ( h j ) 2 ) (Derivative of tanh ( x ) function is ◮ 1 − ( tanh ( x )) 2 ) ◮ Repeated multiplication of W hh when t >> k causes vanishing gradient or exploding gradient 27
GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? 28
GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t 28
GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t ◮ How does it look? 28
GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t ◮ How does it look? ◮ [ z ((1 − z ) W z (1 − h ) + 1)] + [ r (1 − z )(1 − h 2 ) U (1 + h t − 1 (1 − r ) W r )] 28
Recommend
More recommend