Dropout in RNNs Following a VI Interpretation Yarin Gal - PowerPoint PPT Presentation

Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license

Recurrent Neural Networks Recurrent neural networks (RNNs) are damn useful. Figure : RNN structure Image Source: karpathy.github.io/2015/05/21/rnn-effectiveness 2 of 24

Recurrent Neural Networks But these also overfit very quickly... Figure : Overfitting This means... ◮ We can’t use large models ◮ We have to use early stopping ◮ We can’t use small data ◮ We have to waste data for validation sets... 3 of 24

Dropout in recurrent neural networks Let’s use dropout then. But lots of research has claimed that that’s a bad idea : ◮ Pachitariu & Sahani , 2013 ◮ noise added in the recurrent connections of an RNN leads to model instabilities ◮ Bayer et al. , 2013 ◮ with dropout, the RNNs dynamics change dramatically ◮ Pham et al. , 2014 ◮ dropout in recurrent layers disrupts the RNNs ability to model sequences ◮ Zaremba et al. , 2014 ◮ applying dropout to the non-recurrent connections alone results in improved performance ◮ Bluche et al. , 2015 ◮ exploratory analysis of the performance of dropout before, inside, and after the RNNs 4 of 24

Dropout in recurrent neural networks → has settled on using dropout for inputs and outputs alone : h t − 1 h t + 1 h t x t − 1 x t x t + 1 Figure : Naive application of dropout in RNNs (colours = different dropout masks) 5 of 24

Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? 6 of 24

Dropout in recurrent neural networks Why not use dropout with recurrent layers? ◮ It doesn’t work ◮ Noise drowns the signal ◮ Because it’s not used correctly? First, some background on Bayesian modelling and VI in Bayesian neural networks. 6 of 24

Bayesian modelling and inference ◮ Observed inputs X = { x i } N i = 1 and outputs Y = { y i } N i = 1 ◮ Capture stochastic process believed to have generated outputs ◮ Def. ω model parameters as r.v. ◮ Prior dist. over ω : p ( ω ) ◮ Likelihood: p ( Y | ω , X ) ◮ Posterior: p ( ω | X , Y ) = p ( Y | ω , X ) p ( ω ) (Bayes’ theorem) p ( Y | X ) ◮ Predictive distribution given new input x ∗ � p ( y ∗ | x ∗ , X , Y ) = p ( y ∗ | x ∗ , ω ) p ( ω | X , Y ) d ω � �� posterior ◮ But... p ( ω | X , Y ) is often intractable 7 of 24

Approximate inference ◮ Approximate p ( ω | X , Y ) with simple dist. q θ ( ω ) ◮ Minimise divergence from posterior w.r.t. θ KL ( q θ ( ω ) || p ( ω | X , Y )) ◮ Identical to minimising prior likelihood � � �� L VI ( θ ) := − q θ ( ω ) log p ( Y | X , ω ) d ω + KL ( q θ ( ω ) || p ( ω )) ◮ We can approximate the predictive distribution � q θ ( y ∗ | x ∗ ) = p ( y ∗ | x ∗ , ω ) q θ ( ω ) d ω . 8 of 24

Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 . ... � � � � �� 9 of 24 ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω

Bayesian neural networks ◮ Place prior p ( W i ) : W i ∼ N ( 0 , I ) for i ≤ L (and write ω := { W i } L i = 1 ). � � � � � � ◮ Output is a r.v. f x , ω = W L σ ... W 2 σ W 1 x + b 1 ... . � � � � �� ◮ Softmax likelihood for class.: p y | x , ω = softmax f x , ω � � � � � � , τ − 1 I or a Gaussian for regression: p y | x , ω = N y ; f x , ω . ◮ But difficult to evaluate posterior � � ω | X , Y p . 9 of 24

Approximate inference in Bayesian NNs � � � � ◮ Def q θ ω | X , Y to approximate posterior p ω ◮ KL divergence to minimise: � � � � �� KL q θ || p ω | X , Y ω � � � � � � � � � �� ∝ − q θ log p Y | X , ω d ω + KL q θ || p ω ω ω =: L ( θ ) ◮ Approximate the integral with MC integration � ω ∼ q θ ( ω ) : � � � � � � �� L ( θ ) := − log p Y | X , � + KL || p q θ ω ω ω 10 of 24

Stochastic approx. inf. in Bayesian NNs ◮ Unbiased estimator: � � � E � L ( θ ) = L ( θ ) ω ∼ q θ ( ω ) ◮ Converges to the same optima as L ( θ ) ◮ For inference, repeat: ◮ Sample � ω ∼ q θ ( ω ) ◮ And minimise (one step) � � � � � � �� L ( θ ) = − log p Y | X , � + KL q θ || p ω ω ω w.r.t. θ . 11 of 24

Dropout in RNNs Following a VI Interpretation Yarin Gal - PowerPoint PPT Presentation

Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license Recurrent Neural Networks Recurrent neural networks

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff

INTERPRETATION INTERPRETATION INTERPRETATION INTERPRETATION How can I know what How can I know

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Preve Prevention ntion of of Dro Dropout pout in Vo in Vocatio cational Training nal

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

Trends in Interpretation SCIC-Universities Conference 6-7 April 2017 Ana MOUZINHO DE

Distributed Optimization of CNNs and RNNs GTC 2015 William Chan williamchan.ca

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

CSC421/2516 Lecture 13: Recurrent Neural Networks Roger Grosse and Jimmy Ba Roger Grosse and

Arrp A Functional Language with Multi-dimensional Signals and Recurrence Equations Jakob Leben,

Polynomial Solutions of Recurrence Relations O. Shkaravska M. van Eekelen A. Tamalet Digital

Unsupervised Recurrent Neural Network Grammars Yoon Kim Alexander Rush Lei Yu Adhiguna Kuncoro

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Neural Networks (RNN) Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff

Dropout in RNNs Following a VI Interpretation Yarin Gal - PowerPoint PPT Presentation

Dropout in RNNs Following a VI Interpretation Yarin Gal yg279@cam.ac.uk Unless specified otherwise, photos are either original work or taken from Wikimedia, under Creative Commons license Recurrent Neural Networks Recurrent neural networks

Recursive Neural Networks and Its Applications LU Yangyang luyy11@sei.pku.edu.cn KERE Seminar

MIXTURE DENSITY NETWORKS MIXTURE DENSITY NETWORKS Charles Martin SO FAR; RNNS THAT MODEL

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang*, Tianyi Zhou*, Jeff

INTERPRETATION INTERPRETATION INTERPRETATION INTERPRETATION How can I know what How can I know

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Introduction to CNNs and RNNs with PyTorch Introduction to CNNs and RNNs with PyTorch Presented

CS6501: Deep Learning for Visual Recognition Recurrent Neural Networks (RNNs) Todays Class

Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham Th eodore

Dropout as a Structured Shrinkage Prior Eric Nalisnick , Jos Miguel Hernndez-Lobato , Padhraic

AMMI Introduction to Deep Learning 6.3. Dropout Fran cois Fleuret

Follow the Leader with Dropout Perturbations Tim van Erven COLT, 2014 Joint work with: Wojciech

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Preve Prevention ntion of of Dro Dropout pout in Vo in Vocatio cational Training nal

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 A first

Trends in Interpretation SCIC-Universities Conference 6-7 April 2017 Ana MOUZINHO DE

Distributed Optimization of CNNs and RNNs GTC 2015 William Chan williamchan.ca

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

CSC421/2516 Lecture 13: Recurrent Neural Networks Roger Grosse and Jimmy Ba Roger Grosse and

Arrp A Functional Language with Multi-dimensional Signals and Recurrence Equations Jakob Leben,

Polynomial Solutions of Recurrence Relations O. Shkaravska M. van Eekelen A. Tamalet Digital

Unsupervised Recurrent Neural Network Grammars Yoon Kim Alexander Rush Lei Yu Adhiguna Kuncoro

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Neural Networks (RNN) Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Subhashini Jeff

Jumpout : Improved Dropout for Deep Neural Networks with ReLUs Shengjie Wang, Tianyi Zhou, Jeff