Function Space Priors in Bayesian Deep Learning Roger Grosse

Motivation • Today Bayesian deep learning is most often tested on • regularization (Bayesian Occam’s Razor, description length regularization) • smoothing the predictions • calibration and confidence intervals • novelty and out-of-distribution detection • noise to encourage exploration in RL • But these all have non-Bayesian approaches that are competitive

The Three X’s Explanation Exploration Extrapolation

Compositional GP Kernels Gaussian processes are distributions over functions, specified by kernels. Primitive kernels: Composite kernels: Per SE Lin × Lin SE × Per RQ Lin Lin + Per Lin × Per

Automatic Statistician - Duvenaud et al., 2013, “Structure discovery in nonparametric regression through compositional kernel search”

- Lloyd et al., 2014, “Automatic construction and natural-language description of nonparametric regression models”

Structured Priors and Deep Learning • This demonstrates the power and flexibility of function space priors. • Problems • Requires a discrete search over the space of kernel structures (tries thousands to analyze a dataset) • Need to re-fit the kernel hyperparameters for each candidate structure • Can Bayesian deep learning discover and exploit structured function space priors? • Discover: Neural Kernel Network learns compositional kernels • Exploit: functional variational BNN performs variational inference in function space • Caveat: we haven’t yet figured out how to do both simultaneously

Differentiable Compositional Kernel Learning for Gaussian Processes Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li ICML 2018

Neural Kernel Network • Neural Kernel Network: a neural net architecture that takes two input locations and computes the kernel between them • Layers are defined using the composition rules, so every unit corresponds to a valid kernel. • Good at representing the same compositional structures as the Automatic Statistician, but is end-to-end differentiable

Learning Flexible GP Kernels • Extrapolates time series datasets similarly to the Automatic Statistician • Runs in minutes rather than hours: Airline Mauna Solar Automatic Statistician 6147 51065 37716 NKN 201 576 962

Learning Flexible GP Kernels • Extrapolating 2-D patterns Ground truth Observation NKN Spectral Mixture (10 components) NKN prediction

<latexit sha1_base64="1kZFLH0CfGcPFr1Die3lSA4/HCc=">AnSXiclVpLc+S2ER5783A2r3VyzIWJKspuLG9ptHbF1dZj32MrNeOpF1XNIUSGJmaJEFwRH0s7yX+TX5Jrc8gvyM3JL5ZQGCHLwIkeWSioS3V93o9FAN3rGz+IoZ5ub/7o4wc/+vFPfvrJzx7+/Be/NWvH36mzc5KWiAzwMSE/qdj3IcRyk+ZxGL8XcZxSjxY/zWv97l9LdzTPOIpGfsLsOXCZqm0SQKEIOh8aOnk8e34/6GN4pDwvIN73Z89MT72hOjT7zPvFHAx+GBjxw9GT9a23y6KX48+6EvH9Z68udk/On6g1FIgiLBKQtilOcX/c2MXS4QZVEQ4/LhqMhxhoJrNMUX8JiBOeXCzGx0vsjITehFD4S5knRlXEAiV5fpf4wJkgNstNGh90S4KNvnqchGlWcFwGlSKJkXsMeJxL3lhRHA4jt4QAGNwFYvmCGKAga+1LTkLEH0joblw4ejFN8EJElQGi5G+DYDCeVixJX7/uJ5WeoM13EYzSWdJou9ctw8f3tQmtzTaI5T4N74MNqwBO0gmnPayCamhCYolmoCFC+OTMnvMSVvcFBbOlsmhxRCIsXsbsly8DJUiqsto5DxDp0kBTnmhV9kyObpMo0/mqS/Wh6rNCPOX0PQ8hRDJpnxmiBH6Z/DXnOvh/3QJjKI0z0iOQwjmbkAFIWK/L12+bsElpX/M1yUhjgDCcCYjX4vfvmjuZiUpHs4Zqi82LoUIL5VCR0v1vpfr21a2W0XMCfpbOCS8vD3PZrSAo4LAYNp9DbMENQ5oWfM4h2sMAbjTxuhBnsMbrDdBDeghocx2YIFskBp0OIHpi6owRQxym4PHKSzm5IufjeFnjKcCbCZWjGUcFgTx+iTFhjChUbviaYvMUFgDsLxZJjNiQ3pfo+7reywtGrs27ZswBx4JQzSwqQAC5IJkoaeYqZEoxn5pou1QJXKMKpjaGStOfgSfw8eg8rc2iNCy8lZuRimuJ4yHfJ0IwPSEnbAYvmIunky02eW5tcYzwoL/qX5iDfCStgZ81ObsNfKbu9U9RA7IcLl4xL+L/VbclAGHAPkwenQo/g3PBsfz0OXEhBMX7QbjD8uKZDbxaPF579sTE3+BoOmPKir1t4ZBLJd9cE64ZB/tC/4XGK3zprT1rAQ32TwXopjZ3w8HLJ/RW568m2YnyIySXyXdZzclaXvJbOKQD5FuHA3aA4NHgXfijXxEl17uROgersZWqrG8reDafI5cOxV1cEnLUPceRe0b1ES2705VyL7cmibauS8VplckgXkJx6tT7QTYU4TBFbPksPaJ1vh7zZWLcky3lrFqxvVWYHaTgmNtToe0IqDISUpv3IgWtWF2cxMtSqD9J465PKdyM57CBJRE3a3jmMTChVu6pLpvcUElXoidi2sM4lDfkUgsTdiM8ysQI5Jniu59MCkT4pU6puYBSOiU6wUtXcWNMpnmGqF74tWHumk5t3loepaok1+bk0+mZQLuDsZSHDrHlwnaeQXlYAMSjuaeJxgigjIXE/XSycuRid51M0vJ6IPumZjwAbpvBV5tfjcBdS20s46yO4fpgGv0QAuDd4ab2b3jKRFmHqnDrkuabXBMJ2lHCS+sZYmpMdfPLeRiYFuvhSoH4NsZWEch3x+tCXH34YxUz7/iAWVFCRb69DMtdF31nSXfiYctWqXgZ3C6DOesuwVC+mhscxdnMWmpxGpxnEOi4Y9vEKPFDdMJ5y/rNLDB4INUc4sXUFNXUzDwOJamuaCPXvIwVrOJWPxQqCj+S+9aR7IfVEWiG+/KGvreU0zQCW4LHGBpgBN9T/3VJhK5z0XljuSCLn/QGaHPzM3TOskOkcv5dspsmXqXYM0LXYydDlHCfJAGTaSP2vn28MK3+ete6ISt3y3RC5JlUSF1ZRKZ+SUibQJTwbtnZPX1spUiPvWStmLdOoapKFseh3MrUvMtGDNdUvB217WBLsQhwy3Pn8xveMX+NzJ8wIFDYdZVgRxlEnluy2FzZk8RuSrO1ATwpuRaLNOvgOtPO3qzsFlsfdUnhnKGm1T5toQpMo3Q6rA7l6tx2iTkilCtJTSVoPj2LEpxDySb6dcjBUF0GgXprncbScdvzaX0VaGrGdlbD6zDSeMhwEUapVkCZLRUMKJXWMcJnpoeMLieJ1ldBFTcDvaXVZfypSM51WRFjLMyMrQeHR1x/0ZxiG27HdBds3Y8jaWLbwlSAKgflAaWh9spqpF1fA1HSuTNSDp/FtKuNPNksDpKoW/qfeaEXeN1JDBf2ptpapegdRyTjPalmV0WJNXTJw7kxhgNXc4BbQkz4ZXREHVsZChoM67DVhFo74eLdX4EmAsCB/q8qsMadeSCPWrsI5F4cPdq9IAjWVRkF3rsoTbC2OKcK4M3xp1oxt2x8AmD9KGum9S01R8RKGvS7VNnaz19V28OJtkA5DRO9eNFflkW9fJUyu8zQqHVjLFV6E1pm3+5oYixv0/rgioYNcOwgFsyMSF8qvAfWqVdQXMrxbYZS/tlshYpjn6LgGjOPf9RCqXwzQDOSEhbrzobypqvTrubz0t3WeOrO9fdTju1neaGtxnQ6jrZzxzifIYy3LRTKZ7jlQ1VCdI7qnLQGSN62hO+vUeWqTOug78+BY3thtkNodn9ZKFtFXLJvXK97gmU4x7Y7QGqWPXtXYlVBncNfwe4VKbYFaDHCLaAtWLpMUZuPjfeUsRYDLJijl9mlVjR3wTS4sB1o5fW5MzM17CRnOv+bdv6qLI0BCsRlhJ3umxuHfXi1dcOt8vkflK9JYe6HKV9PuHAVoQuCUpFouLViuceaL5hXCRlzvwzUft2sUH2xefCV8pkBpGPGwx1VfgxLkCNko1UYe2rArlXuzSPHRqdsAcqoemaquzATFd3STUzqyz5uCXJ/ChzQ0XTbtjUsk9fUeVdLdj8c1IDduWJxKquTjrX7D0hDTaD6BtCv7ZvDUfCsnLHljRXnbsg5w4n/P+/NzyEcz04aCzrGWla0WafXdBd6ld3yEnsobN2xAq0bi/dhDEuJY3DR12sGw5X4N93i1zyHu8UYGEM3CI8TkrVIU1ynyY1R6ylQ9AzbEzReOfB8mAmLHj9b65pfE7Ic3W0/7z5+fqLtW+kl8g+6T3u94feo97/d5fet/0XvVOeue9oPe3t97/+j9c/1f6/9Z/+/6/yrWjz+SmN/2tJ8/Pfg/zulEWg=</latexit> Structured Kernels for Bayes Opt • Structured kernels can help BayesOpt search much faster. • E.g., if a function is additive, i.e. , then f ( x 1 , . . . , x N ) = f ( x 1 ) + · · · + f ( x N ) the search is linear rather than exponential. (e.g. Kandasamy et al., 2015) • BayesOpt with an NKN kernel can learn to make use of additive structure when it exists.

Structured Kernels for Bayes Opt • Note: Bayesian neural nets don’t achieve this by default. • Even though they’re good at representing additive functions, they don’t seem to enjoy the inductive bias. 6tyEtDng-10D 2rDFOH −200 G3-5B) 11G −220 H0C )B11 −240 (nsHPEOH(5) )unFtiRn vDOuH −260 −280 −300 −320 −340 −360 0 10000 20000 30000 40000 50000 CRst

Functional Variational BNNs Guodong Jiaxin Shengyang Zhang Shi Sun ICLR 2019

Functional variational BNNs • Define a stochastic process prior (e.g. a GP) • Goal: train a generator network to produce functions as close as possible to the stochastic process posterior y x 1 x 2 • The stochastic weights and units are shared between all input locations. Hence, even the stochastic units represent epistemic, not aleatoric, uncertainty.

Function Space Priors in Bayesian Deep Learning Roger Grosse - PowerPoint PPT Presentation

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep learning is most often tested on regularization (Bayesian Occams Razor, description length regularization) smoothing the predictions

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

Mixture of g Priors for Bayesian Variable Selection Feng Liang, Rui Paulo et al. Sheng Zhang

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Generalized Bayesian Inference with Sets of Conjugate Priors for Dealing with Prior-Data Conflict

Selecting priors Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics

Choosing Priors Probability Intervals 18.05 Spring 2014 Conjugate priors A prior is conjugate

Conjugate Priors: Beta and Normal 18.05 Spring 2018 Review: Continuous priors, discrete data

P-values, Probability, Priors, Rabbits, P-values, Probability, Priors, Rabbits, Quantifauxcation,

Informative Priors for Graphical Model Structure James Cussens, University of York

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Lecture 1 08/24/15 Instructor: Yu-San Lin yusan@psu.edu

Smaller, more accurate regression forests using tree alternating optimization Arman

Machine Learning Software: Design and Practical Use Chih-Jen Lin National Taiwan University

Nonlinear Modulational Instability of Dispersive PDE Models Jiayin Jin, Shasha Liao, and Zhiwu

CS 285 Instructor: Sergey Levine UC Berkeley Recap: whats the problem? this is easy (mostly)

Jin Lin, Ernesto Su, Xinmin Tian Intel Corporation LLVM Developers Meeting 2018, October

Partnering with countries, cities and industries William Lin EVP, regions, cities and solutions

Basic Algorithms for Periodic-Linear Inequalities and Integer Polyhedra Alain Keterlin / Camus