Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license
Recurrent Neural Networks Recurrent neural networks (RNNs) are damn useful. Figure : RNN structure Image Source: karpathy.github.io/2015/05/21/rnn-effectiveness 2 of 24
Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24
Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24
Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24
Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24
Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24
Dropout in recurrent neural networks Let’s use dropout then. But lots of research has claimed that that’s a bad idea : ◮ Pachitariu & Sahani , 2013 ◮ noise added in the recurrent connections of an RNN leads to model instabilities ◮ Bayer et al. , 2013 ◮ with dropout, the RNNs dynamics change dramatically ◮ Pham et al. , 2014 ◮ dropout in recurrent layers disrupts the RNNs ability to model sequences ◮ Zaremba et al. , 2014 ◮ applying dropout to the non-recurrent connections alone results in improved performance ◮ Bluche et al. , 2015 ◮ exploratory analysis of the performance of dropout before, inside, and after the RNNs 4 of 24
Dropout in recurrent neural networks → has settled on using dropout for inputs and outputs alone : h t − 1 h t + 1 h t x t − 1 x t x t + 1 Figure : Naive application of dropout in RNNs (colours = different dropout masks) 5 of 24
Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24
Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24
Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24
Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? First, some background on Bayesian modelling and VI in Bayesian neural networks. 6 of 24
Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� � posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24
Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� � posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24
Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� � posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24
Approximate inference ◮ Approximate p ( ω | X , Y ) with simple dist. q θ ( ω ) ◮ Minimise divergence from posterior w.r.t. θ KL ( q θ ( ω ) || p ( ω | X , Y )) ◮ Identical to minimising prior likelihood � � �� � � �� � L VI ( θ ) := − q θ ( ω ) log p ( Y | X , ω ) d ω + KL ( q θ ( ω ) || p ( ω )) ◮ We can approximate the predictive distribution � q θ ( y ∗ | x ∗ ) = p ( y ∗ | x ∗ , ω ) q θ ( ω ) d ω . 8 of 24
Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 . ... � � � � �� 9 of 24 ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω
Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24
Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24
Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24
Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� � L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24
Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� � L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24
Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� � L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24
Stochastic approx. inf. in Bayesian NNs ◮ Unbiased estimator: � � � E � L ( θ ) = L ( θ ) ω ∼ q θ ( ω ) ◮ Converges to the same optima as L ( θ ) ◮ For inference, repeat: ◮ Sample � ω ∼ q θ ( ω ) ◮ And minimise (one step) � � � � � � �� � L ( θ ) = − log p Y | X , � + KL q θ || p ω ω ω w.r.t. θ . 11 of 24
Recommend
More recommend