learning
play

Learning: A Bayesian solution Dmitry P. Vetrov Research professor - PowerPoint PPT Presentation

Open Problems in Deep Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru Idea of the talk 2 Deep Learning Revolution in machine learning Deep neural


  1. Open Problems in Deep Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru

  2. Idea of the talk 2

  3. Deep Learning • Revolution in machine learning • Deep neural networks approach to human intelligence on a number of problems • May solve quite non-standard problems such as image2caption and artistic style transfer

  4. Open problems in Deep learning • Overfitting Neural networks are prone to catastrophic overfitting on noisy data • Interpretability Nobody knows HOW neural network makes decisions – crucial for healthcare and finances. Legislative restrictions are expected • Uncertainty estimation Current neural networks are very over-confident even when they make mistakes. In many applications (e.g. self-driving cars) it is important to estimate the uncertainty of prediction • Adversarial examples Neural networks can be easily fooled by barely visible perturbations of data

  5. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  6. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  7. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  8. Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem

  9. Frequentist vs. Bayesian frameworks

  10. Frequentist vs. Bayesian frameworks • It can be shown that • In other words frequentist framework is a limit case of Bayesian one! • The number of tunable parameters in Modern ML models is comparable with the sizes of training data d n • We have no choice but to be Bayesian!

  11. Bayesian Neural networks • In Bayesian DNNs we treat weights of neural network 𝜄 as random variables • First we define reasonable prior p ( 𝜄 ) • Next we perform Bayesian inference when giving training data and derive posterior

  12. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much

  13. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model

  14. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network

  15. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network • Model selection Automatically selects the simplest possible model that explains observed data thus performing Occam’s razor

  16. Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network • Model selection Automatically selects the simplest possible model that explains observed data thus performing Occam’s razor • Scalability Stochastic variational inference allows to approximate posteriors using deep neural networks

  17. Dropout • Purely heuristic regularization procedure • Inject either Bernoulli or gaussian noise to the weights during training • The magnitudes of the noise are set manually

  18. Bayesian dropout • Theoretically justified procedure • Corresponds to training of Bayesian ensemble under specific but interpretable prior • Allows to define dropout rates automatically

  19. Visualization LeNet-5: fully-connected layer LeNet-5: convolutional layer (100 x 100 patch)

  20. Avoiding narrow extrema • [Stochastic variational] Bayesian inference corresponds to the injection of noise in gradients • The larger is noise the less is spatial resolution • Bayesian DNN simply DOES NOT SEE narrow local minima

  21. Avoiding catastrophic overfitting • Bayesian model selection procedures effectively apply well- known Occam’s razor • They search for the simplest model capable to explain training data • If there are no dependencies between inputs and outputs Bayesian DNN will never be able to learn them since there always exists a simpler NULL-model “With all things being ng equal, l, the simplest lest expla lanati nation on tend nds s to be the e right one.” Wi William am of of Ockh kham

  22. Ensembles of ML algorithms • If we have several ML algorithms their average is generally better than the application of single best one • The problem is we need to train and keep them all in memory • Such technique is not scalable! • Bayesian ensembles are very compact (yet consist of continuum number of elements) – you only need to sample from posterior Single best accuracy Single algorithms Ensemble

  23. Real data example

  24. Robustness to adversarial attacks • Adversarial examples is another problem in DNN • Single DNNs are very sensitive to adversarial attacks • Ensembles of continuum of DNNs almost cannot be fooled “panda” “gibbon”

  25. Setting desirable properties By selecting the proper prior we may encourage the desired properties in Bayesian DNN: • Sparsity (compression) • Group sparsity (acceleration) • Rich ensembles (improves final accuracy, better uncertainty estimation) • Reliability (robustness to adversarial attacks) • Interpretability (hard attention maps) Techniques to become Bayesian soon • GANs • Normalization algorithms (batchnorm, weightnorm, etc.)

  26. Conclusions • Bayesian framework is extremely powerful and extends ML tools • We do have scalable algorithms for approximate Bayesian inference • Bayes + Deep Learning = • Even the first attempts of NeuroBayesian inference give impressive results • Summer school on NeuroBayesian methods, August, 2018, Moscow, http://deepbayes.ru

Recommend


More recommend