how good is the bayes posterior in deep neural networks
play

How Good is the Bayes Posterior in Deep Neural Networks Really? - PowerPoint PPT Presentation

Code: github.com/google-research/google-research/tree/ master/cold_posterior_bnn How Good is the Bayes Posterior in Deep Neural Networks Really? Florian Wenzel (Google Research Berlin) Joint first authors: Kevin Roth, Bas Veeling, and: Jakub


  1. Code: github.com/google-research/google-research/tree/ master/cold_posterior_bnn How Good is the Bayes Posterior in Deep Neural Networks Really? Florian Wenzel (Google Research Berlin) Joint first authors: Kevin Roth, Bas Veeling, and: Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin Florian Wenzel, 15 June 2020

  2. Bayesian Deep Learning Goal: enable Bayesian inference for deep networks to improve robustness of predictions! Active research field where most work focuses on improving approximate inference to get closer to the Bayes posterior 2 Florian Wenzel, 15 June 2020

  3. But is the Bayes posterior actually good? 3 Florian Wenzel, 15 June 2020

  4. <latexit sha1_base64="rTwF5qKwJnpd9PmqZdb1cRT7ZVM=">AB/HicbVC7TsMwFHV4lvIKdGSJqJCYqoSHYKxgYSwSfUhNVDmO01p17Mi+Qaqi8isDCDEyoew8Tc4bQZoOZLlo3PulY9PmHKmwXW/rZXVtfWNzcpWdXtnd2/fPjsaJkpQtEcql6IdaUM0HbwIDTXqoTkJOu+H4tvC7j1RpJsUDTFIaJHgoWMwIBiMN7JofSh7pSWKu3IcRBTwd2HW34c7gLBOvJHVUojWwv/xIkiyhAgjHWvc9N4UgxwoY4XRa9TNU0zGeEj7hgqcUB3ks/BT58QokRNLZY4AZ6b+3shxot8ZjLBMNKLXiH+5/UziK+DnIk0AyrI/KE4w5Ip2jCiZiBPjEwUM1kdMsIKEzB9VU0J3uKXl0nrOGdNy7vL+rNm7KOCjpCx+gUegKNdEdaqE2ImiCntErerOerBfr3fqYj65Y5U4N/YH1+QOnQZVt</latexit> <latexit sha1_base64="IY1J2jvCqr5IEA9S/aVNBI46QY=">ACSXicbVDLSgMxFM20Pmp9V26CRahBSkzPlAQoagLlxXsAzplzKRpG5p5kNwRy9jfc+POnf/gxoUirsy0XWjbAyGHc+5N7j1uKLgC03wzUumFxaXlzEp2dW19YzO3tV1TQSQpq9JABLhEsUE91kVOAjWCUjnitY3e1fJX79gUnFA/8OBiFreaTr8w6nBLTk5O7Dgu0Goq0Gnr5iG3oMyPA2x6BHiUivh7iIr7AYWHg8KdHh2trtr6I7XM896Wik8ubJXMEPEusCcmjCSpO7tVuBzTymA9UEKWalhlCKyYSOBVsmLUjxUJC+6TLmpr6xGOqFY+SGOJ9rbRxJ5D6+IBH6t+OmHgqGVBXJguqaS8R53nNCDpnrZj7YQTMp+OPOpHAEOAkVtzmklEQA0IlVzPimPSEJBh5/VIVjTK8+S2mHJOiqd3B7ny5eTODJoF+2hArLQKSqjG1RBVUTRM3pHn+jLeDE+jG/jZ1yaMiY9O+gfUulfRdCzKA=</latexit> <latexit sha1_base64="WGOL1vxHGku1eDCobZodwAJrpgc=">ACEHicbVDLSsNAFJ34rPUVdelmsIh1UxIf6EYo6sJlBfuAJpTJdNIOnTyYuRFKzCe48VfcuFDErUt3/o2TtgtvTDM4Zx7ueceLxZcgWV9G3PzC4tLy4WV4ura+samubXdUFEiKavTSESy5RHFBA9ZHTgI1olI4EnWNMbXOV685JxaPwDoYxcwPSC7nPKQFNdcyDuOwEBPqUiPQ6w/Y8SLRVcNAf6kDfQYkO8QXuGOWrIo1KjwL7AkoUnVOuaX041oErAQqCBKtW0rBjclEjgVLCs6iWIxoQPSY20NQxIw5ajgzK8r5ku9iOpXwh4xP6eSEmgco+6MzevprWc/E9rJ+CfuykP4wRYSMeL/ERgiHCeDu5ySiIoQaESq69YtonklDQGRZ1CPb0ybOgcVSxjyuntyel6uUkjgLaRXuojGx0hqroBtVQHVH0iJ7RK3oznowX4934GLfOGZOZHfSnjM8fHtWcoQ=</latexit> Bayesian Neural Networks (BNNs) Input Output Hidden Neural Network ) = p ( y i | x i , θ ) p ( D| θ ) = Different models obtained by different θ 4 Florian Wenzel, 15 June 2020

  5. <latexit sha1_base64="wiljDcQFTYfNdIHzdnPtl0J/I+0=">ACDHicbVDLSsNAFJ34rPVdelmsAh1UxIf6LKoC5cV7AOaUCaTSTt08mDmRigxH+DGX3HjQhG3foA7/8ZJm4W2HhjmcO653HuPGwuwDS/jYXFpeWV1dJaeX1jc2u7srPbVlEiKWvRSESy6xLFBA9ZCzgI1o0lI4ErWMcdXeX1zj2TikfhHYxj5gRkEHKfUwJa6leqc12I+GpcaC/1IYhA5I9YDsgMKREpNfZkXaZdXMCPE+sglRgWa/8mV7EU0CFgIVRKmeZcbgpEQCp4JlZTtRLCZ0RAasp2lIAqacdHJMhg+14mE/kvqFgCfq746UBCrfVjvzHdVsLRf/q/US8C+clIdxAiyk0F+IjBEOE8Ge1wyCmKsCaGS610xHRJKOj8yjoEa/bkedI+rlsn9bPb02rjsoijhPbRAaohC52jBrpBTdRCFD2iZ/SK3own48V4Nz6m1gWj6NlDf2B8/gBvL5vc</latexit> <latexit sha1_base64="IY1J2jvCqr5IEA9S/aVNBI46QY=">ACSXicbVDLSgMxFM20Pmp9V26CRahBSkzPlAQoagLlxXsAzplzKRpG5p5kNwRy9jfc+POnf/gxoUirsy0XWjbAyGHc+5N7j1uKLgC03wzUumFxaXlzEp2dW19YzO3tV1TQSQpq9JABLhEsUE91kVOAjWCUjnitY3e1fJX79gUnFA/8OBiFreaTr8w6nBLTk5O7Dgu0Goq0Gnr5iG3oMyPA2x6BHiUivh7iIr7AYWHg8KdHh2trtr6I7XM896Wik8ubJXMEPEusCcmjCSpO7tVuBzTymA9UEKWalhlCKyYSOBVsmLUjxUJC+6TLmpr6xGOqFY+SGOJ9rbRxJ5D6+IBH6t+OmHgqGVBXJguqaS8R53nNCDpnrZj7YQTMp+OPOpHAEOAkVtzmklEQA0IlVzPimPSEJBh5/VIVjTK8+S2mHJOiqd3B7ny5eTODJoF+2hArLQKSqjG1RBVUTRM3pHn+jLeDE+jG/jZ1yaMiY9O+gfUulfRdCzKA=</latexit> Bayesian Neural Networks (BNNs) Input Output Hidden Bayesian Neural Network p ( θ , D ) = p ( y i | x i , θ ) p ( θ ) Posterior: Distribution over likely models given the data p ( θ |D ) 5 Florian Wenzel, 15 June 2020

  6. <latexit sha1_base64="fXqnuC6GFqFpqTVuYlqBxuVUNsM=">ACXHicbVHLSgMxFM2MVfuwOiq4cRMsQoUyzLSKLou6cFnBPqBTSiZNbWjmQXJHKGN/0l03/opm2i607YGQw7n3ck9O/FhwBY6zMy93P7BYb5QLB2Vj0+s07OihJWZtGIpI9nygmeMjawEGwXiwZCXzBuv70Kat3P5hUPArfYBazQUDeQz7mlICWhpby/EiM1CzQV+rBhAGZD90a3iXd8uNGrZtG3uKBziu7ujAn9gLCEwoEenz/AYPrYpjO0vgbeKuSQWt0RpaX94oknAQqCKNV3nRgGKZHAqWDzopcoFhM6Je+sr2lIAqYG6TKcOb7WygiPI6lPCHip/p1ISaAyv7ozM6k2a5m4q9ZPYPwSHkYJ8BCulo0TgSGCGdJ4xGXjIKYaUKo5NorphMiCQX9H0Udgrv5G3Sqdtuw757va0H9dx5NElukJV5KJ71EQvqIXaiKIF+jHyRsH4NnNmySyvWk1jPXO/sG8+AWhaLY/</latexit> <latexit sha1_base64="uJWKBoYxcgei08sye4qA0oB9Su0=">ACD3icbVC7TgMxEPTxDOEVoKSxiEBJQXTHQ1BG0FCRApF0U+Z5NY8Z0tew8RnfIHNPwKDQUI0dLS8Tc4IQUQRrI8mtnV7k6kpbDo+5/e1PTM7Nx8biG/uLS8slpYW7+yKjUcalxJZW4iZkGKBGoUMKNsDiSMJ1Dsd+te3YKxQySX2NTRi1klEW3CGTmoWdkJtlEZFQ7jTtLRbK4WRki3bj92XhdgFZINyuVko+hV/BDpJgjEpkjHOm4WPsKV4GkOCXDJr64GvsZExg4JLGOTD1IJmvMc6UHc0YTHYRja6Z0C3ndKibWXcS5CO1J8dGYvtcENXGTPs2r/eUPzPq6fYPm5kItEpQsK/B7VTSd39w3BoSxjgKPuOMG6E25XyLjOMo4sw70I/p48Sa72KsF+5fDioFg9GceRI5tki5RIQI5IlZyRc1IjnNyTR/JMXrwH78l79d6+S6e8c8G+QXv/QtTmJw4</latexit> BNNs: Predictions In standard deep learning we optimize θ SGD (MAP) BNNs use samples from the posterior (ensemble of models) θ 1 , θ 2 , θ 3 , ... ∼ p ( θ |D ) ∝ exp( − U ( θ )) 6 Florian Wenzel, 15 June 2020

  7. <latexit sha1_base64="aOGT/zEJg6itkXTtYS7WSQOTsjM=">ACFnicbVDLSgMxFM3UV62vUZdugkVQ0DLjA12KblxWsA/olCGTpm1oZhKSO2IZ+xVu/BU3LhRxK+78G9PHQq0HQg7n3Mu90RKcAOe9+XkZmbn5hfyi4Wl5ZXVNXd9o2pkqimrUCmkrkfEMETVgEOgtWVZiSOBKtFvcuhX7tl2nCZ3EBfsWZMOglvc0rASqF7EBCltLwLTBqHBqvdPr7Hd/s4iKRomX5svyALgMyCM1e6Ba9kjcCnib+hBTRBOXQ/QxakqYxS4AKYkzD9xQ0M6KBU8EGhSA1TBHaIx3WsDQhMTPNbHTWAO9YpYXbUtuXAB6pPzsyEpvhirYyJtA1f72h+J/XSKF91sx4olJgCR0PaqcCg8TDjHCLa0ZB9C0hVHO7K6ZdogkFm2TBhuD/PXmaVA9L/lHp5Pq4eH4xiSOPtA2kU+OkXn6AqVUQVR9ICe0At6dR6dZ+fNeR+X5pxJzyb6BefjG31un5Y=</latexit> <latexit sha1_base64="fXqnuC6GFqFpqTVuYlqBxuVUNsM=">ACXHicbVHLSgMxFM2MVfuwOiq4cRMsQoUyzLSKLou6cFnBPqBTSiZNbWjmQXJHKGN/0l03/opm2i607YGQw7n3ck9O/FhwBY6zMy93P7BYb5QLB2Vj0+s07OihJWZtGIpI9nygmeMjawEGwXiwZCXzBuv70Kat3P5hUPArfYBazQUDeQz7mlICWhpby/EiM1CzQV+rBhAGZD90a3iXd8uNGrZtG3uKBziu7ujAn9gLCEwoEenz/AYPrYpjO0vgbeKuSQWt0RpaX94oknAQqCKNV3nRgGKZHAqWDzopcoFhM6Je+sr2lIAqYG6TKcOb7WygiPI6lPCHip/p1ISaAyv7ozM6k2a5m4q9ZPYPwSHkYJ8BCulo0TgSGCGdJ4xGXjIKYaUKo5NorphMiCQX9H0Udgrv5G3Sqdtuw757va0H9dx5NElukJV5KJ71EQvqIXaiKIF+jHyRsH4NnNmySyvWk1jPXO/sG8+AWhaLY/</latexit> BNNs: Predictions Predict by using an average of models X p ( y | x, θ s ) ≈ θ SGD (MAP) s θ 1 , θ 2 , θ 3 , ... ∼ p ( θ |D ) In this talk: A model is good if it predicts well (e.g. low cross entropy loss) 7 Florian Wenzel, 15 June 2020

  8. Bayesian Neural Networks (BNNs) Promises of BNNs * : • Robustness in generalization • Better uncertainty quantification (calibration) • Enables new deep learning applications (continual learning, sequential decision making, …) * [e.g., Neal 1995, Gal et al. 2016, Wilson 2019, Ovadia et al. 2019]. 8 Florian Wenzel, 15 June 2020

  9. Bayesian Neural Networks (BNNs) But in practice BNNs are rarely used! 9 Florian Wenzel, 15 June 2020

  10. Bayesian Neural Networks (BNNs) In practice: • Often, the Bayes posterior is worse than SGD point estimates • But Bayes predictions can be improved by the use of the Cold Posterior* For temperature T<1: We sharpen the posterior (over-count evidence) *Explicitly (or implicitly) used by most recent Bayesian DL papers [e.g., Li et al. 2016, Zhang et al. 2020, Ashukha et al. 2020]. 10 Florian Wenzel, 15 June 2020

  11. Bayesian Neural Networks (BNNs) θ Cold Posterior For temperature T<1: We sharpen the posterior (over-count evidence) 11 Florian Wenzel, 15 June 2020

  12. ResNet-20 / CIFAR-10 True Bayes posterior Optimal cold posterior CNN-LSTM / IMDB 12 Florian Wenzel, 15 June 2020

  13. The cold posterior sharply deviates from the Bayesian paradigm. What is the use of more accurate posterior approximations if the posterior is poor? 13 Florian Wenzel, 15 June 2020

  14. Our paper: Hypothesis for the origin of the improved performance of cold posteriors Inference Likelihood Prior Inaccurate SDE Simulation? Dirty likelihoods? Current priors used for BNN parameters are (batch-normalization, poor? Bias of SG-MCMC? dropout, data augmentation) The effect becomes Minibatch noise (which is stronger with increasing not Gaussian)? model depths and capacity? Bias-variance tradeoff induced by cold posterior? 14 Florian Wenzel, 15 June 2020

  15. Our paper: Hypothesis for the origin of the improved performance of cold posteriors Inference Likelihood Prior Inaccurate SDE Simulation? Dirty likelihoods? Current priors used for BNN parameters are (batch-normalization, poor? Bias of SG-MCMC? dropout, data augmentation) The effect becomes Minibatch noise (which is stronger with increasing not Gaussian)? model depths and capacity? Bias-variance tradeoff induced by cold posterior? 15 Florian Wenzel, 15 June 2020

  16. Our paper: Hypothesis for the origin of the improved performance of cold posteriors Inference Likelihood Prior Inaccurate SDE Simulation? Dirty likelihoods? Current priors used for BNN parameters are (batch-normalization, poor? Bias of SG-MCMC? dropout, data augmentation) The effect becomes Minibatch noise (which is stronger with increasing not Gaussian)? model depths and capacity? Bias-variance tradeoff induced by cold posterior? 16 Florian Wenzel, 15 June 2020

Recommend


More recommend