Open Problems in Deep Learning: A Bayesian solution Dmitry P. Vetrov Research professor at HSE, Head of Bayesian methods research group http://bayesgroup.ru
Idea of the talk 2
Deep Learning • Revolution in machine learning • Deep neural networks approach to human intelligence on a number of problems • May solve quite non-standard problems such as image2caption and artistic style transfer
Open problems in Deep learning • Overfitting Neural networks are prone to catastrophic overfitting on noisy data • Interpretability Nobody knows HOW neural network makes decisions – crucial for healthcare and finances. Legislative restrictions are expected • Uncertainty estimation Current neural networks are very over-confident even when they make mistakes. In many applications (e.g. self-driving cars) it is important to estimate the uncertainty of prediction • Adversarial examples Neural networks can be easily fooled by barely visible perturbations of data
Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
Bayesian framework • Treats everything as a random variables • Allows to encode our ignorance in terms of distributions • Makes use of Bayes theorem
Frequentist vs. Bayesian frameworks
Frequentist vs. Bayesian frameworks • It can be shown that • In other words frequentist framework is a limit case of Bayesian one! • The number of tunable parameters in Modern ML models is comparable with the sizes of training data d n • We have no choice but to be Bayesian!
Bayesian Neural networks • In Bayesian DNNs we treat weights of neural network 𝜄 as random variables • First we define reasonable prior p ( 𝜄 ) • Next we perform Bayesian inference when giving training data and derive posterior
Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much
Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model
Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network
Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network • Model selection Automatically selects the simplest possible model that explains observed data thus performing Occam’s razor
Advantages of Bayesian framework • Regularization Prevents overfitting on the training data because prior does not allow to tune parameters too much • Extensibility Bayesian inference results to posterior which can be now used as prior in next model • Ensembling Posterior distribution over the weights defines the ensemble of neural networks rather than single network • Model selection Automatically selects the simplest possible model that explains observed data thus performing Occam’s razor • Scalability Stochastic variational inference allows to approximate posteriors using deep neural networks
Dropout • Purely heuristic regularization procedure • Inject either Bernoulli or gaussian noise to the weights during training • The magnitudes of the noise are set manually
Bayesian dropout • Theoretically justified procedure • Corresponds to training of Bayesian ensemble under specific but interpretable prior • Allows to define dropout rates automatically
Visualization LeNet-5: fully-connected layer LeNet-5: convolutional layer (100 x 100 patch)
Avoiding narrow extrema • [Stochastic variational] Bayesian inference corresponds to the injection of noise in gradients • The larger is noise the less is spatial resolution • Bayesian DNN simply DOES NOT SEE narrow local minima
Avoiding catastrophic overfitting • Bayesian model selection procedures effectively apply well- known Occam’s razor • They search for the simplest model capable to explain training data • If there are no dependencies between inputs and outputs Bayesian DNN will never be able to learn them since there always exists a simpler NULL-model “With all things being ng equal, l, the simplest lest expla lanati nation on tend nds s to be the e right one.” Wi William am of of Ockh kham
Ensembles of ML algorithms • If we have several ML algorithms their average is generally better than the application of single best one • The problem is we need to train and keep them all in memory • Such technique is not scalable! • Bayesian ensembles are very compact (yet consist of continuum number of elements) – you only need to sample from posterior Single best accuracy Single algorithms Ensemble
Real data example
Robustness to adversarial attacks • Adversarial examples is another problem in DNN • Single DNNs are very sensitive to adversarial attacks • Ensembles of continuum of DNNs almost cannot be fooled “panda” “gibbon”
Setting desirable properties By selecting the proper prior we may encourage the desired properties in Bayesian DNN: • Sparsity (compression) • Group sparsity (acceleration) • Rich ensembles (improves final accuracy, better uncertainty estimation) • Reliability (robustness to adversarial attacks) • Interpretability (hard attention maps) Techniques to become Bayesian soon • GANs • Normalization algorithms (batchnorm, weightnorm, etc.)
Conclusions • Bayesian framework is extremely powerful and extends ML tools • We do have scalable algorithms for approximate Bayesian inference • Bayes + Deep Learning = • Even the first attempts of NeuroBayesian inference give impressive results • Summer school on NeuroBayesian methods, August, 2018, Moscow, http://deepbayes.ru
Recommend
More recommend