inf5820 language technological applications gated rnns 3 2
play

INF5820: Language technological applications Gated RNNs (3:2) - PowerPoint PPT Presentation

INF5820: Language technological applications Gated RNNs (3:2) Taraka Rama University of Oslo 30 October 2018 Agenda Break only for 5 minutes Lecture ends at 3:45 GRU LSTM Connections An analysis of why GRUs address


  1. Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 0 ◮ Ignore previous hidden memory h t − 1 17

  2. Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 0 ◮ Ignore previous hidden memory h t − 1 ◮ Current hidden memory h t is only dependent on the new memory state ˜ h t 17

  3. Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 1 ◮ Copy the previous hidden memory h t − 1 ◮ Current input is ignored completely 17

  4. What is it for NLP? ◮ Short term dependencies means older history should be ignored ( r t → 0 ) 18

  5. What is it for NLP? ◮ Short term dependencies means older history should be ignored ( r t → 0 ) ◮ Long term dependencies means older history should be retained as much as possible ( z t → 1 ) 18

  6. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? 19

  7. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : 19

  8. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) 19

  9. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t 19

  10. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t 19

  11. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh 19

  12. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? 19

  13. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? ◮ r t = 1 and z t = 0 should happen at all the time steps. Very low chance of happening!! 19

  14. Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? ◮ r t = 1 and z t = 0 should happen at all the time steps. Very low chance of happening!! ◮ So we are fine... 19

  15. Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? 20

  16. Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like? 20

  17. Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like? ◮ Requires multivariate chain rule!!! 20

  18. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 21

  19. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 21

  20. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 21

  21. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 21

  22. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. 21

  23. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. ◮ There is a jump between timesteps 18 and 20 . 21

  24. GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. ◮ There is a jump between timesteps 18 and 20 . ◮ It is a shortcut across time. 21

  25. Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? 22

  26. Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? Answer dz dx = dz dy dy dx 22

  27. Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? Answer dz dx = dz dy dy dx Example ◮ z = exp ( y ) ◮ y = x 2 dz dy = exp ( y ) ◮ dy dx = 2 x ◮ dz dx = exp ( y ) × 2 x ◮ 22

  28. Multivariate chain rule (Very Important!) 23

  29. Multivariate chain rule example ◮ z = x + y 24

  30. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) 24

  31. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt 24

  32. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ 24

  33. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ 24

  34. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ 24

  35. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ dx dt = 2 t ◮ 24

  36. Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ dx dt = 2 t ◮ ◮ Finally, dz dt = exp ( t ) + 2 t 24

  37. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) 25

  38. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives 25

  39. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 25

  40. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 25

  41. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 25

  42. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 3. W hh → h 1 → h 2 → h 3 25

  43. Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 3. W hh → h 1 → h 2 → h 3 ◮ h 2 → h 3 means ∂h 3 ∂h 2 25

  44. What does it mean? What is the derivative of h 3 with respect to W hh ? 26

  45. What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ 26

  46. What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh 26

  47. What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh ∂h 3 ∂h 2 2. ∂h 2 ∂W hh 26

  48. What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh ∂h 3 ∂h 2 2. ∂h 2 ∂W hh ∂h 3 ∂h 2 ∂h 1 3. ∂h 2 ∂h 1 ∂W hh 26

  49. RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 27

  50. RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) 27

  51. RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) 27

  52. RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) ∂h j ∂h j − 1 = W (1 − ( h j ) 2 ) (Derivative of tanh ( x ) function is ◮ 1 − ( tanh ( x )) 2 ) 27

  53. RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) ∂h j ∂h j − 1 = W (1 − ( h j ) 2 ) (Derivative of tanh ( x ) function is ◮ 1 − ( tanh ( x )) 2 ) ◮ Repeated multiplication of W hh when t >> k causes vanishing gradient or exploding gradient 27

  54. GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? 28

  55. GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t 28

  56. GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t ◮ How does it look? 28

  57. GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t ◮ How does it look? ◮ [ z ((1 − z ) W z (1 − h ) + 1)] + [ r (1 − z )(1 − h 2 ) U (1 + h t − 1 (1 − r ) W r )] 28

Recommend


More recommend