INF5820: Language technological applications Gated RNNs (3:2) - PowerPoint PPT Presentation

Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 0 ◮ Ignore previous hidden memory h t − 1 17

Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 0 ◮ Ignore previous hidden memory h t − 1 ◮ Current hidden memory h t is only dependent on the new memory state ˜ h t 17

Update Gate and GRU ◮ Update gate: z t = σ ( W ( z ) x t + U ( z ) h t − 1 ) ◮ Reset gate: r t = σ ( W ( r ) x t + U ( r ) h t − 1 ) ◮ Internal memory content: ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ Final memory: h t = z t ◦ h t − 1 + (1 − z t ) ◦ ˜ h t If z t = 1 ◮ Copy the previous hidden memory h t − 1 ◮ Current input is ignored completely 17

What is it for NLP? ◮ Short term dependencies means older history should be ignored ( r t → 0 ) 18

What is it for NLP? ◮ Short term dependencies means older history should be ignored ( r t → 0 ) ◮ Long term dependencies means older history should be retained as much as possible ( z t → 1 ) 18

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? ◮ r t = 1 and z t = 0 should happen at all the time steps. Very low chance of happening!! 19

Is RNN a special case of GRU? Can you think when RNN is a special case of GRU? ◮ When r t = 1 and z t = 0 : ◮ ˜ h t = tanh ( Wx t + r t ◦ Uh t − 1 ) ◮ h t = 0 ∗ h t − 1 + 1 ∗ ˜ h t ◮ h t = ˜ h t ◮ W is W xh and U is W hh ◮ Back to Vanilla RNN... Is it a problem? ◮ r t = 1 and z t = 0 should happen at all the time steps. Very low chance of happening!! ◮ So we are fine... 19

Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? 20

Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like? 20

Does GRU fixes vanishing gradient? Two ways to analyze this: ◮ What does the hidden state look like in GRU after few time steps? ◮ How does gradient calculation look like? ◮ Requires multivariate chain rule!!! 20

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 21

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 21

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 21

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 21

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. 21

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. ◮ There is a jump between timesteps 18 and 20 . 21

GRU: Hidden state calculation We will expand h t from t = 20 to t = 19 . Again, all our matrices are 1 × 1 to simplify. ◮ h 20 = z 20 h 19 + (1 − z 20 )˜ h 20 ◮ h 19 = z 19 h 18 + (1 − z 19 )˜ h 19 ◮ Therefore, h 20 = z 20 z 19 h 18 + z 20 (1 − z 19 )˜ h 19 + z 20 (1 − z 20 )˜ h 20 ◮ If z 19 → 0 then only part of h 18 is relevant through ˜ h 19 ◮ If z 19 → 1 then ˜ h 19 is ignored completely. It is as if the input x 19 is completely ignored. ◮ There is a jump between timesteps 18 and 20 . ◮ It is a shortcut across time. 21

Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? 22

Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? Answer dz dx = dz dy dy dx 22

Single variable chain rule Question If z is dependent on y and y is dependent on x what is dz dx ? Answer dz dx = dz dy dy dx Example ◮ z = exp ( y ) ◮ y = x 2 dz dy = exp ( y ) ◮ dy dx = 2 x ◮ dz dx = exp ( y ) × 2 x ◮ 22

Multivariate chain rule (Very Important!) 23

Multivariate chain rule example ◮ z = x + y 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ dx dt = 2 t ◮ 24

Multivariate chain rule example ◮ z = x + y ◮ x = t 2 , y = exp ( t ) dz dt = ∂z dy dt + ∂z dx ◮ ∂y ∂x dt ∂z ∂y = 1 ◮ ∂z ∂x = 1 ◮ dy dt = exp ( t ) ◮ dx dt = 2 t ◮ ◮ Finally, dz dt = exp ( t ) + 2 t 24

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) 25

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives 25

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 25

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 25

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 25

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 3. W hh → h 1 → h 2 → h 3 25

Multivariate chain rule in summary What is the derivative of h 3 with respect to W hh ? ( ∂h 3 ∂W hh ) ◮ Enumerate all paths from W hh to h 3 and sum the derivatives ◮ A arrow between two nodes is derivative between the two nodes. 1. W hh → h 3 directly 2. W hh → h 2 → h 3 3. W hh → h 1 → h 2 → h 3 ◮ h 2 → h 3 means ∂h 3 ∂h 2 25

What does it mean? What is the derivative of h 3 with respect to W hh ? 26

What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ 26

What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh 26

What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh ∂h 3 ∂h 2 2. ∂h 2 ∂W hh 26

What does it mean? What is the derivative of h 3 with respect to W hh ? ∂h 3 ∂W hh is sum of the following ◮ ∂h 3 1. ∂W hh ∂h 3 ∂h 2 2. ∂h 2 ∂W hh ∂h 3 ∂h 2 ∂h 1 3. ∂h 2 ∂h 1 ∂W hh 26

RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 27

RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) 27

RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) 27

RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) ∂h j ∂h j − 1 = W (1 − ( h j ) 2 ) (Derivative of tanh ( x ) function is ◮ 1 − ( tanh ( x )) 2 ) 27

RNN expansion Equations of basic RNN ◮ h t = tanh ( W xh x t + W hh h t − 1 + b h ) ◮ ˆ y t = W hy h t + b y ◮ p t = softmax ( ˆ y t ) ∂W hh = � T ∂L ∂L t ∂W hh (Sum the loss over each time step t ) ◮ t =1 ∂ ˆ y t � t ∂h k ∂W hh = ∂L t ∂L t ∂h t ∂W hh (Each h t is computed using W hh . ◮ k =1 ∂ ˆ y t ∂h t ∂h k Apply multivariate chain rule.) ∂h j ∂h k = � t ∂h t ∂h j − 1 (Each h j is immediately dependent on h j − 1 . ◮ j = k +1 Apply single variable chain rule.) ∂h j ∂h j − 1 = W (1 − ( h j ) 2 ) (Derivative of tanh ( x ) function is ◮ 1 − ( tanh ( x )) 2 ) ◮ Repeated multiplication of W hh when t >> k causes vanishing gradient or exploding gradient 27

GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? 28

GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t 28

GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t ◮ How does it look? 28

GRU expansion ∂h j How does ∂h j − 1 look in the case of GRU? ◮ Is not so simple as in the case of RNN. Why? ◮ GRU has two gates that play a role in the computation of hidden state h t ◮ How does it look? ◮ [ z ((1 − z ) W z (1 − h ) + 1)] + [ r (1 − z )(1 − h 2 ) U (1 + h t − 1 (1 − r ) W r )] 28

INF5820: Language technological applications Gated RNNs (3:2) - PowerPoint PPT Presentation

INF5820: Language technological applications Gated RNNs (3:2) Taraka Rama University of Oslo 30 October 2018 Agenda Break only for 5 minutes Lecture ends at 3:45 GRU LSTM Connections An analysis of why GRUs address

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan

Range gated cameras technology and its applications Friday, October 10, 14 Range gated cameras

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid,

INF5820: Language technological applications Lecture 6 Evaluating Word Embeddings and Using them

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

1 Mammalian Neurons Have Several Types of Voltage-Gated Ion Channels Why do neurons need so many

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Terahertz Detection Terahertz Detection with 2D Plasmons in a Grating Gated with 2D Plasmons in

a private, gated in-town waterfront community in League City, Tx CreekWood Harbour, clearly the

1 Early Computers Were Made of Thousands of Identical Electronic Components II. Fine Control of

1 V m = the Value of the Na Battery Plus the I Na is Isolated By Blocking I K Voltage Drop Across g

Gated Path Planning Networks Lisa Lee Machine Learning Department Carnegie Mellon University

Numerical approximation of acoustic scattering by fractal screens Andrea Moiola

rt t sst

Weak Galerkin Finite Element Methods for Elliptic and Parabolic Problems on Polygonal Meshes

Autofocusing with the help of the empirical Haar transform Przemysaw Sliwi nski and

Geometric Kramers-Fokker-Planck operators with boundary Francis Nier, conditions IRMAR, Univ.

Profinite semigroups Dominique Perrin 4 d ecembre 2015 Dominique Perrin Profinite semigroups

rs t rt

Spectral Characteristics of the Solar Transition Region David Graham and James Grayson Mentor:

INF5820: Language technological applications Gated RNNs (3:2) - PowerPoint PPT Presentation

INF5820: Language technological applications Gated RNNs (3:2) Taraka Rama University of Oslo 30 October 2018 Agenda Break only for 5 minutes Lecture ends at 3:45 GRU LSTM Connections An analysis of why GRUs address

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan

Range gated cameras technology and its applications Friday, October 10, 14 Range gated cameras

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

INF5820: Language technological applications Course summary Andrey Kutuzov, Lilja vrelid,

INF5820: Language technological applications Lecture 6 Evaluating Word Embeddings and Using them

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

1 Mammalian Neurons Have Several Types of Voltage-Gated Ion Channels Why do neurons need so many

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Terahertz Detection Terahertz Detection with 2D Plasmons in a Grating Gated with 2D Plasmons in

a private, gated in-town waterfront community in League City, Tx CreekWood Harbour, clearly the

1 Early Computers Were Made of Thousands of Identical Electronic Components II. Fine Control of

1 V m = the Value of the Na Battery Plus the I Na is Isolated By Blocking I K Voltage Drop Across g

Gated Path Planning Networks Lisa Lee Machine Learning Department Carnegie Mellon University

Numerical approximation of acoustic scattering by fractal screens Andrea Moiola

rt t sst

Weak Galerkin Finite Element Methods for Elliptic and Parabolic Problems on Polygonal Meshes

Autofocusing with the help of the empirical Haar transform Przemysaw Sliwi nski and

Geometric Kramers-Fokker-Planck operators with boundary Francis Nier, conditions IRMAR, Univ.

Profinite semigroups Dominique Perrin 4 d ecembre 2015 Dominique Perrin Profinite semigroups

rs t rt

Spectral Characteristics of the Solar Transition Region David Graham and James Grayson Mentor:

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN