Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - PowerPoint PPT Presentation

Gradients of Deep Networks Chris Cremer March 29 2017

Neural Net 𝑋 𝑋 𝑋 𝑋 $ % & ( Output Input Hidden Hidden Hidden " 𝑍 X Activation Activation Activation 𝐵 & 𝐵 % 𝐵 $ 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …)

Recurrent Neural Net http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Net 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( " 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐵 0 = [0] 𝐵 & 𝐵 % 𝐵 $ 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Notice that weights are the same

Recurrent Neural Network – One Timestep 𝐵 ).$ 𝐵 ) - 𝑋 𝑔 𝑌 )

Gradient Descent " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % 78 78 78 We want : , ; , < ,… so that we can do • 9 9 9 78 gradient descent: 𝑋 =>? = 𝑋 @AB − 𝛽 9 EFG 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( ", 𝑍) 𝑋 " 𝐷(𝑍 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑍 Where 𝐷 is a cost function (Squared error, Cross-Entropy, …)

Backprop (Chain Rule) 78 78 78 We want : , ; , < … 9 9 9 " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % Example 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) ", 𝑍 = (𝑍 " − 𝑍) % 𝐷 𝑍 𝜖𝐷 𝜖𝐷 " − 𝑍) " = 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑑𝑝𝑡𝑢 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝑔 = non-linear activation " = 2(𝑍 𝜖𝑍 𝜖𝑍 function (sigmoid, tanh, ReLu, Softplus, …) " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 𝐵 ) = 𝜏 𝑋 ) - 𝐵 ).$ 𝑒𝑓𝑠𝑗𝑤𝑏𝑢𝑗𝑤𝑓 𝑝𝑔 𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜 𝑔𝑣𝑜𝑑𝑢𝑗𝑝𝑜 𝜖𝑋 𝜖𝑋 𝜖𝑍 𝜖𝐵 ) ( ( = 𝐵 ) (1 − 𝐵 ) ) 𝐷 = cost function 𝜖𝐵 ).$ " " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 = 𝜖𝐷 " - 𝜖𝑍 - 𝜖𝐵 & (Squared error, Cross- 𝜖𝐵 % 𝜖𝐵 % 𝜖𝐵 & 𝜖𝐵 % 𝜖𝑍 𝜖𝑍 Entropy, …) 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) 𝜖𝐵 ).$ 𝜖𝐵 ).$

Vanishing/Exploding Gradient " ", 𝑍) 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 𝐷(𝑍 & ( $ % ) - 𝐵 ).$ ) 𝜖𝐵 ) = 𝜖𝑔(𝑋 = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) 𝜖𝐵 ).$ 𝜖𝐵 ).$ " 𝜖𝐷 = 𝜖𝐷 " - 𝜖𝑍 - 𝜖𝐵 & - 𝜖𝐵 & = 𝜖𝐷 ) ) [ " - (𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 𝜖𝑋 𝜖𝐵 & 𝜖𝐵 % 𝜖𝐵 % 𝜖𝑍 𝜖𝑍 $ 𝑈 = 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑚𝑏𝑧𝑓𝑠𝑡 = number of timesteps For NNs, t goes from T to 0 For RNNs, W is the same for every t if 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) > 1 Gradient Explodes ) ) [ (𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 Gradient Vanishes if 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 ) < 1

Resnet/HighwayNet/GRU/LSTM • NNs: • ResNet (2015) • Highway Net (2015) • RNNs: • LSTM (1997) • GRU (2014) K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

Residual Network (ResNet) " 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 & ( $ % Idea: - If layer is useless (ie lose 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) +𝐵 ).$ information), can skip it - Easier for network to have zero weights, than be identity

ResNet Gradient 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) +𝐵 ).$ 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) + 𝐵 ).$ = 𝑔 t 𝑋 ) - 𝐵 ).$ 𝑋 ) + 1 𝜖𝐵 ).$ 𝜖𝐵 ).$ (𝑔 t 𝑋 ) + 1) [ = (𝑎 + 1) [ ) - 𝐵 ).$ 𝑋 𝑥ℎ𝑓𝑠𝑓 𝑎 = 𝑔 t 𝑋 ) - 𝐵 ).$ 𝑋 ) (𝑎 + 1) % = 𝑎 % + 2𝑎 + 1 (𝑎 + 1) & = 𝑎 & + 3𝑎 % + 3𝑎 + 1 (𝑎 + 1) ( = 𝑎 ( + 4𝑎 & + 6𝑎 % + 4𝑎 + 1 • Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … •

Highway Network 𝐺 𝐺 𝐺 " 𝑋 𝑋 X 𝑋 𝐵 $ 𝑋 𝐵 % 𝐵 & 𝑍 & ( $ % 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 )% - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)

Highway Net Gradient 𝐵 ) = 𝑔 𝑋 ) - 𝐵 ).$ - B + 𝐵 ).$ - 1 − 𝐶 𝐶 = 𝜏(𝑋 )% - 𝐵 ).$ ) = 𝑔 𝑋 ) - 𝐵 ).$ - 𝜏(𝑋 )% - 𝐵 ).$ ) + 𝐵 ).$ - 1 − 𝜏(𝑋 )% - 𝐵 ).$ ) = 𝑔 𝑋 ) - 𝐵 ).$ - 𝜏(𝑋 )% - 𝐵 ).$ ) + 𝐵 ).$ − 𝜏(𝑋 )% - 𝐵 ).$ ) - 𝐵 ).$ 𝜖𝐵 ) = 1 𝜖𝐵 ).$ • Vanishing gradient problem: gradient persists through layers Exploding gradient problem: weight decay, weight norm, layer norm, batch norm, … •

Back to RNNs 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( 𝑋 " 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑋 𝑍 Output Init Hidden Hidden Hidden " 𝑍 Activation Activation Activation Activation 𝐵 0 = [0] 𝐵 & 𝐵 % 𝐵 $ 𝜖𝐵 ) = 𝜖𝑔(𝑋 ) - 𝐵 ).$ ) Vanishing/Exploding Gradient = 𝑔′(𝑋 ) - 𝐵 ).$ )𝑋 𝐵 ) = 𝑔(𝑋 ) - 𝐵 ).$ ) ) 𝜖𝐵 ).$ 𝜖𝐵 ).$ Where 𝑔 = non-linear activation function (sigmoid, tanh, ReLu, Softplus, Maxout, …) Note: weights are the same

RNN Gated Recurrent Unit + 𝐵 ) x 𝐵 ).$ 1 − 𝜏 - 𝑋 𝜏 x % 𝑌 ) - 𝑋 𝑔 𝐵 ) = 𝑔(𝑋 - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 % - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)

RNN Another view of GRUs + 𝐵 ) x 𝐵 ).$ 1 − 𝜏 - 𝑋 𝜏 x % 𝑌 $ 𝑌 % 𝑌 & 𝑌 ( 𝑌 ) 𝑋 " 𝑋 𝑋 𝑋 𝐵 0 𝐵 $ 𝐵 % 𝐵 & 𝑍 𝑔 - 𝑋 𝐵 ) = 𝑔(𝑋 - 𝐵 ).$ ) - B + 𝐵 ).$ - (1 − 𝐶) 𝐶 = 𝜏(𝑋 % - 𝐵 ).$ ) 𝜏 = sigmoid since output (0,1)

GRU/LSTM: More Gates GRU LSTM

Memory Concerns • If T=10000, you need to keep 10000 activations/states in memory

Deep Network Gradients Conclusion • The models we saw all use the same idea • One of the earlier uses of skip connections was in the Nonlinear AutoRegressive with eXogenous inputs method (NARX; Lin et al., 1996), where they improved the RNN’s ability to infer finite state machines. - Ilya Sustkever PhD thesis 2013 NNs: ResNet (2015) Neither ResNet or Highway reference GRUs/LSTMs Highway Net (2015) RNNs: LSTM (1997) GRU (2014)

Thanks

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net - PowerPoint PPT Presentation

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ % & ( Output Input Hidden Hidden Hidden " X Activation Activation Activation & % $ ) = ( ) -

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Outline Last time Image gradients Seam carving gradients as energy Edges

Natural Policy Gradients (cont.) Katerina Fragkiadaki Revision Policy Gradients 1.

The oxygen abundance gradients of galaxies in the Eagle simulations Patricia B. Tissera

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

EXPLICITLY IMPOSING CONSTRAINTS IN DEEP NETWORKS VIA CONDITIONAL GRADIENTS GIVES IMPROVED

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Deep-Learning: Unsupervised Generative models Deep Belief Networks Deep Stacked AutoEncoders

Modeling Velocity Gradients in an OBC, First-Break Positioning Algorithm Noel Zinn Western

Acoustic Liquid- -Level Determination of Level Determination of Acoustic Liquid Gradients and

The Effects of Thermal Gradients in Automotive Battery Packs Balancing Strategy Dr Alastair

Implicit Reparameterization Gradients Michael Figurnov, Shakir Mohamed, Andriy Mnih Poster: Room

Anartificialneuron Artificialneuralnetworks y = f ( S ) x 0 =+1 Background

Logistic Regression INFO-4604, Applied Machine Learning University of Colorado Boulder September

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

Neural Net Backpropagation 3/20/17 Recall: Limitations of Perceptrons vs. AND and OR are

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

AN INTRODUCTION TO NEURAL NETWORKS Scott Kuindersma November 12, 2009 SUPERVISED LEARNING

Data Mining Lecture Notes for Chapter 4 Artificial Neural Networks Introduction to Data Mining ,