Bayesian Neural Networks from a Gaussian Process Perspective Andrew - PowerPoint PPT Presentation

Bayesian Neural Networks from a Gaussian Process Perspective Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw Courant Institute of Mathematical Sciences Center for Data Science New York University Gaussian Process Summer School September 16, 2020 1 / 47

Last Time... Machine Learning for Econometrics (The Start of My Journey...) Autoregressive Conditional Heteroscedasticity (ARCH) 2003 Nobel Prize in Economics y ( t ) = N ( y ( t ); 0 , a 0 + a 1 y ( t − 1 ) 2 ) 2 / 47

Autoregressive Conditional Heteroscedasticity (ARCH) 2003 Nobel Prize in Economics y ( t ) = N ( y ( t ); 0 , a 0 + a 1 y ( t − 1 ) 2 ) Gaussian Copula Process Volatility (GCPV) (My First PhD Project) y ( x ) = N ( y ( x ); 0 , f ( x ) 2 ) f ( x ) ∼ GP ( m ( x ) , k ( x , x ′ )) ◮ Can approximate a much greater range of variance functions ◮ Operates on continuous inputs x ◮ Can effortlessly handle missing data ◮ Can effortlessly accommodate multivariate inputs x (covariates other than time) ◮ Observation: performance extremely sensitive to even small changes in kernel hyperparameters 3 / 47

Heteroscedasticity revisited... Which of these models do you prefer, and why? Choice 1 y ( x ) | f ( x ) , g ( x ) ∼ N ( y ( x ); f ( x ) , g ( x ) 2 ) f ( x ) ∼ GP , g ( x ) ∼ GP Choice 2 y ( x ) | f ( x ) , g ( x ) ∼ N ( y ( x ); f ( x ) g ( x ) , g ( x ) 2 ) f ( x ) ∼ GP , g ( x ) ∼ GP 4 / 47

Some conclusions... ◮ Flexibility isn’t the whole story, inductive biases are at least as important. ◮ Degenerate model specification can be helpful , rather than something to necessarily avoid. ◮ Asymptotic results often mean very little. Rates of convergence, or even intuitions about non-asymptotic behaviour, are more meaningful. ◮ Infinite models (models with unbounded capacity) are almost always desirable, but the details matter. ◮ Releasing good code is crucial. ◮ Try to keep the approach as simple as possible. ◮ Empirical results often provide the most effective argument. 5 / 47

Model Selection 700 Airline Passengers (Thousands) 600 500 400 300 200 100 1949 1951 1953 1955 1957 1959 1961 Year Which model should we choose? 10 4 3 � w j x j � w j x j (1): f 1 ( x ) = w 0 + w 1 x (2): f 2 ( x ) = (3): f 3 ( x ) = j = 0 j = 0 6 / 47

A Function-Space View Consider the simple linear model, f ( x ) = w 0 + w 1 x , (1) w 0 , w 1 ∼ N ( 0 , 1 ) . (2) 25 20 15 10 Output, f(x) 5 0 −5 −10 −15 −20 −25 −10 −8 −6 −4 −2 0 2 4 6 8 10 Input, x 7 / 47

Model Construction and Generalization p ( D|M ) Well-Specified Model Calibrated Inductive Biases Example: CNN Simple Model Poor Inductive Biases Example: Linear Function Complex Model Poor Inductive Biases Example: MLP Dataset Corrupted CIFAR-10 MNIST CIFAR-10 Structured Image Datasets 8 / 47

How do we learn? ◮ The ability for a system to learn is determined by its support (which solutions are a priori possible) and inductive biases (which solutions are a priori likely). ◮ We should not conflate flexibility and complexity . ◮ An influx of new massive datasets provide great opportunities to automatically learn rich statistical structure, leading to new scientific discoveries. Bayesian Deep Learning and a Probabilistic Perspective of Generalization Wilson and Izmailov, 2020 arXiv 2002.08791 9 / 47

What is Bayesian learning? ◮ The key distinguishing property of a Bayesian approach is marginalization instead of optimization. ◮ Rather than use a single setting of parameters w , use all settings weighted by their posterior probabilities in a Bayesian model average . 10 / 47

Why Bayesian Deep Learning? Recall the Bayesian model average (BMA): � p ( y | x ∗ , D ) = p ( y | x ∗ , w ) p ( w |D ) dw . (3) ◮ Think of each setting of w as a different model. Eq. (3) is a Bayesian model average over models weighted by their posterior probabilities. ◮ Represents epistemic uncertainty over which f ( x , w ) fits the data. ◮ Can view classical training as using an approximate posterior q ( w | y , X ) = δ ( w = w MAP ) . ◮ The posterior p ( w |D ) (or loss L = − log p ( w |D ) ) for neural networks is extraordinarily complex, containing many complementary solutions, which is why BMA is especially significant in deep learning. ◮ Understanding the structure of neural network loss landscapes is crucial for better estimating the BMA. 11 / 47

Mode Connectivity Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs . T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson. NeurIPS 2018. Loss landscape figures in collaboration with Javier Ideami (losslandscape.com). 12 / 47

Mode Connectivity 13 / 47

Better Marginalization � p ( y | x ∗ , D ) = p ( y | x ∗ , w ) p ( w |D ) dw . (4) ◮ MultiSWAG forms a Gaussian mixture posterior from multiple independent SWAG solutions. ◮ Like deep ensembles, MultiSWAG incorporates multiple basins of attraction in the model average, but it additionally marginalizes within basins of attraction for a better approximation to the BMA. 17 / 47

Better Marginalization: MultiSWAG [1] Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift . Ovadia et. al, 2019 [2] Bayesian Deep Learning and a Probabilistic Perspective of Generalization . Wilson and Izmailov, 2020 18 / 47

Double Descent Belkin et. al (2018) Reconciling modern machine learning practice and the bias-variance trade-off . Belkin et. al, 2018 19 / 47

Double Descent Should a Bayesian model experience double descent? 20 / 47

Bayesian Model Averaging Alleviates Double Descent CIFAR-100, 20% Label Corruption 50 SGD SWAG 45 Test Error (%) Multi-SWAG 40 35 30 10 20 30 40 50 ResNet-18 Width Bayesian Deep Learning and a Probabilistic Perspective of Generalization . Wilson & Izmailov, 2020 21 / 47

Neural Network Priors A parameter prior p ( w ) = N ( 0 , α 2 ) with a neural network architecture f ( x , w ) induces a structured distribution over functions p ( f ( x )) . Deep Image Prior ◮ Randomly initialized CNNs without training provide excellent performance for image denoising, super-resolution, and inpainting: a sample function from p ( f ( x )) captures low-level image statistics, before any training. Random Network Features ◮ Pre-processing CIFAR-10 with a randomly initialized untrained CNN dramatically improves the test performance of a Gaussian kernel on pixels from 54% accuracy to 71%, with an additional 2% from ℓ 2 regularization. [1] Deep Image Prior . Ulyanov, D., Vedaldi, A., Lempitsky, V. CVPR 2018. [2] Understanding Deep Learning Requires Rethinking Generalzation . Zhang et. al, ICLR 2016. [3] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020. 22 / 47

Tempered Posteriors In Bayesian deep learning it is typical to consider the tempered posterior: 1 Z ( T ) p ( D| w ) 1 / T p ( w ) , p T ( w |D ) = (5) where T is a temperature parameter, and Z ( T ) is the normalizing constant corresponding to temperature T . The temperature parameter controls how the prior and likelihood interact in the posterior: ◮ T < 1 corresponds to cold posteriors , where the posterior distribution is more concentrated around solutions with high likelihood. ◮ T = 1 corresponds to the standard Bayesian posterior distribution. ◮ T > 1 corresponds to warm posteriors , where the prior effect is stronger and the posterior collapse is slower. E.g.: The safe Bayesian . Grunwald, P. COLT 2012. 23 / 47

Cold Posteriors Wenzel et. al (2020) highlight the result that for p ( w ) = N ( 0 , I ) cold posteriors with T < 1 often provide improved performance. How good is the Bayes posterior in deep neural networks really? Wenzel et. al, ICML 2020. 24 / 47

Prior Misspecification? They suggest the result is due to prior misspecification, showing that sample functions p ( f ( x )) seem to assign one label to most classes on CIFAR-10. 25 / 47

Changing the prior variance scale α Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020. 26 / 47

The effect of data on the posterior Class Probability 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Class Probability 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 √ (b) 10 datapoints (a) Prior ( α = 10) 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 (c) 100 datapoints (d) 1000 datapoints 27 / 47

Neural Networks from a Gaussian Process Perspective From a Gaussian process perspective, what properties of the prior over functions induced by a Bayesian neural network might you check to see if it seems reasonable? 28 / 47

Bayesian Neural Networks from a Gaussian Process Perspective Andrew - PowerPoint PPT Presentation

Bayesian Neural Networks from a Gaussian Process Perspective Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw Courant Institute of Mathematical Sciences Center for Data Science New York University Gaussian Process Summer School September

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Gaussian Process Behaviour in Wide Deep Neural Networks Alexander G. de G. Matthews DeepMind

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

How can different campuses inter-operate? From ora et labora to collaborate and federate

Longstanding Discrepancies in Stratospheric Water Vapor Measurements Revisited During the 2011

Reduced Manifolds and Trajectory Curvature J. M. Powers Department of Aerospace and Mechanical

Fourier-pseudospectral method for Cahn-Hilliard Equation on GPU Kangping Zhu Courant Institute

Optimization for Machine Learning Tom Schaul schaul@cims.nyu.edu Recap: Learning Machines

Sampling low-dimensional Markovian dynamics for learning certified reduced models from data Wayne

Examples of online analysis tools for gene expression data Tools integrated in data repositories

Efficiency Announcements Measuring Efficiency Recursive Computation of the Fibonacci Sequence