Gradients of Deep Networks Chris Cremer March 29 2017
Neural Net 𝑋 𝑋 𝑋 𝑋 $ % & ( Output Input Hidden Hidden Hidden " 𝑍 X Activation Activation Activation 𝐵 & 𝐵 % 𝐵 $ 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …)
Recurrent Neural Net http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Recurrent Neural Net 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( " 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐵 0 = [0] 𝐵 & 𝐵 % 𝐵 $ 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Notice that weights are the same
Recurrent Neural Network – One Timestep 𝐵 ).$ 𝐵 ) - 𝑋 𝑔 𝑌 )
Gradient Descent " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % 78 78 78 We want : , ; , < ,… so that we can do • 9 9 9 78 gradient descent: 𝑋 =>? = 𝑋 @AB − 𝛽 9 EFG 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( ", 𝑍) 𝑋 " 𝐷(𝑍 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑍 Where 𝐷 is a cost function (Squared error, Cross-Entropy, …)
Backprop (Chain Rule) 78 78 78 We want : , ; , < … 9 9 9 " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % Example 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) ", 𝑍 = (𝑍 " − 𝑍) % 𝐷 𝑍 𝜖𝐷 𝜖𝐷 " − 𝑍) " = 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝑔 = non-linear activation " = 2(𝑍 𝜖𝑍 𝜖𝑍 function (sigmoid, tanh, ReLu, Softplus, …) " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 𝐵 ) = 𝜏 𝑋 ) - 𝐵 ).$ 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝑋 𝜖𝑋 𝜖𝑍 𝜖𝐵 ) ( ( = 𝐵 ) (1 − 𝐵 ) ) 𝐷 = cost function 𝜖𝐵 ).$ " " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 = 𝜖𝐷 " - 𝜖𝑍 - 𝜖𝐵 & (Squared error, Cross- 𝜖𝐵 % 𝜖𝐵 % 𝜖𝐵 & 𝜖𝐵 % 𝜖𝑍 𝜖𝑍 Entropy, …) 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) 𝜖𝐵 ).$ 𝜖𝐵 ).$
Vanishing/Exploding Gradient " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % ) - 𝐵 ).$ ) 𝜖𝐵 ) = 𝜖𝑔(𝑋 = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) 𝜖𝐵 ).$ 𝜖𝐵 ).$ " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 - 𝜖𝐵 & - 𝜖𝐵 & = 𝜖𝐷 ) ) [ " - (𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 𝜖𝑋 𝜖𝐵 & 𝜖𝐵 % 𝜖𝐵 % 𝜖𝑍 𝜖𝑍 $ 𝑈 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑚𝑏𝑧𝑓𝑠𝑡 = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t if 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) > 1 Gradient Explodes ) ) [ (𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 Gradient Vanishes if 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) < 1
Resnet/HighwayNet/GRU/LSTM • NNs: • ResNet (2015) • Highway Net (2015) • RNNs: • LSTM (1997) • GRU (2014) K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
Residual Network (ResNet) " 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 & ( $ % Idea: - If layer is useless (ie lose 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) +𝐵 ).$ information), can skip it - Easier for network to have zero weights, than be identity
ResNet Gradient 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) +𝐵 ).$ 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) + 𝐵 ).$ = 𝑔 t 𝑋 ) - 𝐵 ).$ 𝑋 ) + 1 𝜖𝐵 ).$ 𝜖𝐵 ).$ (𝑔 t 𝑋 ) + 1) [ = (𝑎 + 1) [ ) - 𝐵 ).$ 𝑋 𝑥ℎ𝑓𝑠𝑓 𝑎 = 𝑔 t 𝑋 ) - 𝐵 ).$ 𝑋 ) (𝑎 + 1) % = 𝑎 % + 2𝑎 + 1 (𝑎 + 1) & = 𝑎 & + 3𝑎 % + 3𝑎 + 1 (𝑎 + 1) ( = 𝑎 ( + 4𝑎 & + 6𝑎 % + 4𝑎 + 1 • Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … •
Highway Network 𝐺 𝐺 𝐺 " 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 & ( $ % 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 )% - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)
Highway Net Gradient 𝐵 ) = 𝑔 𝑋 ) - 𝐵 ).$ - B + 𝐵 ).$ - 1 − 𝐶 𝐶 = 𝜏(𝑋 )% - 𝐵 ).$ ) = 𝑔 𝑋 ) - 𝐵 ).$ - 𝜏(𝑋 )% - 𝐵 ).$ ) + 𝐵 ).$ - 1 − 𝜏(𝑋 )% - 𝐵 ).$ ) = 𝑔 𝑋 ) - 𝐵 ).$ - 𝜏(𝑋 )% - 𝐵 ).$ ) + 𝐵 ).$ − 𝜏(𝑋 )% - 𝐵 ).$ ) - 𝐵 ).$ 𝜖𝐵 ) = 1 𝜖𝐵 ).$ • Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … •
Back to RNNs 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( 𝑋 " 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐵 0 = [0] 𝐵 & 𝐵 % 𝐵 $ 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) Vanishing/Exploding Gradient = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) ) 𝜖𝐵 ).$ 𝜖𝐵 ).$ Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Note: weights are the same
RNN Gated Recurrent Unit + 𝐵 ) x 𝐵 ).$ 1 − 𝜏 - 𝑋 𝜏 x % 𝑌 ) - 𝑋 𝑔 𝐵 ) = 𝑔(𝑋 - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 % - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)
RNN Another view of GRUs + 𝐵 ) x 𝐵 ).$ 1 − 𝜏 - 𝑋 𝜏 x % 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( 𝑌 ) 𝑋 " 𝑋 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑍 𝑔 - 𝑋 𝐵 ) = 𝑔(𝑋 - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 % - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)
GRU/LSTM: More Gates GRU LSTM
Memory Concerns • If T=10000, you need to keep 10000 activations/states in memory
Deep Network Gradients Conclusion • The models we saw all use the same idea • One of the earlier uses of skip connections was in the Nonlinear AutoRegressive with eXogenous inputs method (NARX; Lin et al., 1996), where they improved the RNN’s ability to infer finite state machines. - Ilya Sustkever PhD thesis 2013 NNs: ResNet (2015) Neither ResNet or Highway reference GRUs/LSTMs Highway Net (2015) RNNs: LSTM (1997) GRU (2014)
Thanks
Recommend
More recommend