the vanishing gradient problem revisited highway and
play

The vanishing gradient problem revisited: Highway and residual - PowerPoint PPT Presentation

The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep Learning for NLP Revisiting the vanishing gradient problem Stems from the fact that the derivative of the activation is between zero and one and


  1. The vanishing gradient problem revisited: Highway and residual connections CS 6956: Deep Learning for NLP

  2. Revisiting the vanishing gradient problem Stems from the fact that the derivative of the activation is between zero and one… … and as the number steps of gradient computation grows, these get multiplied Not just applicable for LSTMs 1

  3. Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Many layers in between Inputs 2

  4. Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Gradient vanishes as the depth grows Many layers in between Inputs 3

  5. Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Gradient vanishes as the depth grows Many layers The loss is no longer in between influenced by the inputs for very deep networks! Inputs 4

  6. Revisiting the vanishing gradient problem Not just applicable for LSTMs Loss Outputs Gradient vanishes as the depth grows Can we use ideas from LSTMs/GRUs to fix this problem? Many layers The loss is no longer in between influenced by the inputs for very deep networks! Inputs 5

  7. Revisiting the vanishing gradient problem Intuition: Consider a single layer π’Ž " = 𝑕 π’Ž "%& 𝐗 + 𝒄 "%& The t-1 th layer is used to calculate the value of the t th layer 6

  8. Revisiting the vanishing gradient problem Intuition: Consider a single layer π’Ž " = 𝑕 π’Ž "%& 𝐗 + 𝒄 "%& Instead of a non-linear update that directly calculates the next layer, let us try a linear update π’Ž " = π’Ž "%& + 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) 7

  9. Revisiting the vanishing gradient problem Intuition: Consider a single layer π’Ž " = 𝑕 π’Ž "%& 𝐗 + 𝒄 "%& Instead of a non-linear update that directly calculates the next layer, let us try a linear update π’Ž " = π’Ž "%& + 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) The gradients can be propagated all the way to the input without attenuation 8

  10. Residual networks [He et al 2015] Each layer is reformulated as π’Ž " = π’Ž "%& + 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) π’Ž " 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) π’Ž "%& Original layer 9

  11. Residual networks [He et al 2015] Each layer is reformulated as π’Ž " = π’Ž "%& + 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) π’Ž " + π’Ž " 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) π’Ž "%& π’Ž "%& Original layer Residual connection 10

  12. Residual networks [He et al 2015] Each layer is reformulated as π’Ž " = π’Ž "%& + 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) The computation graph g is not trained to predict the next layer It predicts an update to the current layer value instead That is, it can be seen as a residual function (that is the difference between the layers) 11

  13. Highway connections [Srivastava et al 2015] Extend the idea, using gates to stabilize learning β€’ First, compute a proposed update 𝐃 = 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) β€’ Next, compute how much of the proposed update should be retained 𝐔 = 𝜏 𝐦 "%& 𝐗 1 + 𝐜 1 β€’ Finally, compute the actual value of the next layer π’Ž " = 1 βˆ’ 𝐔 βŠ™ π’Ž "%& + 𝐔 βŠ™ 𝐃 12

  14. Highway connections [Srivastava et al 2015] Extend the idea, using gates to stabilize learning β€’ First, compute a proposed update 𝐃 = 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) β€’ Next, compute how much of the proposed update should be retained 𝐔 = 𝜏 𝐦 "%& 𝐗 1 + 𝐜 1 β€’ Finally, compute the actual value of the next layer π’Ž " = 1 βˆ’ 𝐔 βŠ™ π’Ž "%& + 𝐔 βŠ™ 𝐃 13

  15. Highway connections [Srivastava et al 2015] Extend the idea, using gates to stabilize learning β€’ First, compute a proposed update 𝐃 = 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) β€’ Next, compute how much of the proposed update should be retained 𝐔 = 𝜏 𝐦 "%& 𝐗 1 + 𝐜 1 β€’ Finally, compute the actual value of the next layer π’Ž " = 1 βˆ’ 𝐔 βŠ™ π’Ž "%& + 𝐔 βŠ™ 𝐃 14

  16. Highway connections [Srivastava et al 2015] Extend the idea, using gates to stabilize learning β€’ First, compute a proposed update 𝐃 = 𝑕(π’Ž "%& 𝐗 + 𝒄 "%& ) β€’ Next, compute how much of the proposed update should be retained 𝐔 = 𝜏 𝐦 "%& 𝐗 1 + 𝐜 1 β€’ Finally, compute the actual value of the next layer π’Ž " = 1 βˆ’ 𝐔 βŠ™ π’Ž "%& + 𝐔 βŠ™ 𝐃 15

  17. Why residual/highway connections? β€’ As networks become deeper, or as sequences get larger, we can no longer hope for gradients to be carried through the network β€’ If we want to capture long-range dependencies with the input, we need this mechanism β€’ More generally, a blueprint of an idea that can be combined with your neural network model if it gets too deep 16

Recommend


More recommend