Function Space Priors in Bayesian Deep Learning Roger Grosse
Motivation • Today Bayesian deep learning is most often tested on • regularization (Bayesian Occam’s Razor, description length regularization) • smoothing the predictions • calibration and confidence intervals • novelty and out-of-distribution detection • noise to encourage exploration in RL • But these all have non-Bayesian approaches that are competitive
The Three X’s Explanation Exploration Extrapolation
Compositional GP Kernels Gaussian processes are distributions over functions, specified by kernels. Primitive kernels: Composite kernels: Per SE Lin × Lin SE × Per RQ Lin Lin + Per Lin × Per
Automatic Statistician - Duvenaud et al., 2013, “Structure discovery in nonparametric regression through compositional kernel search”
Automatic Statistician - Duvenaud et al., 2013, “Structure discovery in nonparametric regression through compositional kernel search”
- Lloyd et al., 2014, “Automatic construction and natural-language description of nonparametric regression models”
Structured Priors and Deep Learning • This demonstrates the power and flexibility of function space priors. • Problems • Requires a discrete search over the space of kernel structures (tries thousands to analyze a dataset) • Need to re-fit the kernel hyperparameters for each candidate structure • Can Bayesian deep learning discover and exploit structured function space priors? • Discover: Neural Kernel Network learns compositional kernels • Exploit: functional variational BNN performs variational inference in function space • Caveat: we haven’t yet figured out how to do both simultaneously
Differentiable Compositional Kernel Learning for Gaussian Processes Shengyang Sun, Guodong Zhang, Chaoqi Wang, Wenyuan Zeng, Jiaman Li ICML 2018
Neural Kernel Network • Neural Kernel Network: a neural net architecture that takes two input locations and computes the kernel between them • Layers are defined using the composition rules, so every unit corresponds to a valid kernel. • Good at representing the same compositional structures as the Automatic Statistician, but is end-to-end differentiable
Learning Flexible GP Kernels • Extrapolates time series datasets similarly to the Automatic Statistician • Runs in minutes rather than hours: Airline Mauna Solar Automatic Statistician 6147 51065 37716 NKN 201 576 962
Learning Flexible GP Kernels • Extrapolating 2-D patterns Ground truth Observation NKN Spectral Mixture (10 components) NKN prediction
<latexit sha1_base64="1kZFLH0CfGcPFr1Die3lSA4/HCc=">AnSXiclVpLc+S2ER5783A2r3VyzIWJKspuLG9ptHbF1dZj32MrNeOpF1XNIUSGJmaJEFwRH0s7yX+TX5Jrc8gvyM3JL5ZQGCHLwIkeWSioS3V93o9FAN3rGz+IoZ5ub/7o4wc/+vFPfvrJzx7+/Be/NWvH36mzc5KWiAzwMSE/qdj3IcRyk+ZxGL8XcZxSjxY/zWv97l9LdzTPOIpGfsLsOXCZqm0SQKEIOh8aOnk8e34/6GN4pDwvIN73Z89MT72hOjT7zPvFHAx+GBjxw9GT9a23y6KX48+6EvH9Z68udk/On6g1FIgiLBKQtilOcX/c2MXS4QZVEQ4/LhqMhxhoJrNMUX8JiBOeXCzGx0vsjITehFD4S5knRlXEAiV5fpf4wJkgNstNGh90S4KNvnqchGlWcFwGlSKJkXsMeJxL3lhRHA4jt4QAGNwFYvmCGKAga+1LTkLEH0joblw4ejFN8EJElQGi5G+DYDCeVixJX7/uJ5WeoM13EYzSWdJou9ctw8f3tQmtzTaI5T4N74MNqwBO0gmnPayCamhCYolmoCFC+OTMnvMSVvcFBbOlsmhxRCIsXsbsly8DJUiqsto5DxDp0kBTnmhV9kyObpMo0/mqS/Wh6rNCPOX0PQ8hRDJpnxmiBH6Z/DXnOvh/3QJjKI0z0iOQwjmbkAFIWK/L12+bsElpX/M1yUhjgDCcCYjX4vfvmjuZiUpHs4Zqi82LoUIL5VCR0v1vpfr21a2W0XMCfpbOCS8vD3PZrSAo4LAYNp9DbMENQ5oWfM4h2sMAbjTxuhBnsMbrDdBDeghocx2YIFskBp0OIHpi6owRQxym4PHKSzm5IufjeFnjKcCbCZWjGUcFgTx+iTFhjChUbviaYvMUFgDsLxZJjNiQ3pfo+7reywtGrs27ZswBx4JQzSwqQAC5IJkoaeYqZEoxn5pou1QJXKMKpjaGStOfgSfw8eg8rc2iNCy8lZuRimuJ4yHfJ0IwPSEnbAYvmIunky02eW5tcYzwoL/qX5iDfCStgZ81ObsNfKbu9U9RA7IcLl4xL+L/VbclAGHAPkwenQo/g3PBsfz0OXEhBMX7QbjD8uKZDbxaPF579sTE3+BoOmPKir1t4ZBLJd9cE64ZB/tC/4XGK3zprT1rAQ32TwXopjZ3w8HLJ/RW568m2YnyIySXyXdZzclaXvJbOKQD5FuHA3aA4NHgXfijXxEl17uROgersZWqrG8reDafI5cOxV1cEnLUPceRe0b1ES2705VyL7cmibauS8VplckgXkJx6tT7QTYU4TBFbPksPaJ1vh7zZWLcky3lrFqxvVWYHaTgmNtToe0IqDISUpv3IgWtWF2cxMtSqD9J465PKdyM57CBJRE3a3jmMTChVu6pLpvcUElXoidi2sM4lDfkUgsTdiM8ysQI5Jniu59MCkT4pU6puYBSOiU6wUtXcWNMpnmGqF74tWHumk5t3loepaok1+bk0+mZQLuDsZSHDrHlwnaeQXlYAMSjuaeJxgigjIXE/XSycuRid51M0vJ6IPumZjwAbpvBV5tfjcBdS20s46yO4fpgGv0QAuDd4ab2b3jKRFmHqnDrkuabXBMJ2lHCS+sZYmpMdfPLeRiYFuvhSoH4NsZWEch3x+tCXH34YxUz7/iAWVFCRb69DMtdF31nSXfiYctWqXgZ3C6DOesuwVC+mhscxdnMWmpxGpxnEOi4Y9vEKPFDdMJ5y/rNLDB4INUc4sXUFNXUzDwOJamuaCPXvIwVrOJWPxQqCj+S+9aR7IfVEWiG+/KGvreU0zQCW4LHGBpgBN9T/3VJhK5z0XljuSCLn/QGaHPzM3TOskOkcv5dspsmXqXYM0LXYydDlHCfJAGTaSP2vn28MK3+ete6ISt3y3RC5JlUSF1ZRKZ+SUibQJTwbtnZPX1spUiPvWStmLdOoapKFseh3MrUvMtGDNdUvB217WBLsQhwy3Pn8xveMX+NzJ8wIFDYdZVgRxlEnluy2FzZk8RuSrO1ATwpuRaLNOvgOtPO3qzsFlsfdUnhnKGm1T5toQpMo3Q6rA7l6tx2iTkilCtJTSVoPj2LEpxDySb6dcjBUF0GgXprncbScdvzaX0VaGrGdlbD6zDSeMhwEUapVkCZLRUMKJXWMcJnpoeMLieJ1ldBFTcDvaXVZfypSM51WRFjLMyMrQeHR1x/0ZxiG27HdBds3Y8jaWLbwlSAKgflAaWh9spqpF1fA1HSuTNSDp/FtKuNPNksDpKoW/qfeaEXeN1JDBf2ptpapegdRyTjPalmV0WJNXTJw7kxhgNXc4BbQkz4ZXREHVsZChoM67DVhFo74eLdX4EmAsCB/q8qsMadeSCPWrsI5F4cPdq9IAjWVRkF3rsoTbC2OKcK4M3xp1oxt2x8AmD9KGum9S01R8RKGvS7VNnaz19V28OJtkA5DRO9eNFflkW9fJUyu8zQqHVjLFV6E1pm3+5oYixv0/rgioYNcOwgFsyMSF8qvAfWqVdQXMrxbYZS/tlshYpjn6LgGjOPf9RCqXwzQDOSEhbrzobypqvTrubz0t3WeOrO9fdTju1neaGtxnQ6jrZzxzifIYy3LRTKZ7jlQ1VCdI7qnLQGSN62hO+vUeWqTOug78+BY3thtkNodn9ZKFtFXLJvXK97gmU4x7Y7QGqWPXtXYlVBncNfwe4VKbYFaDHCLaAtWLpMUZuPjfeUsRYDLJijl9mlVjR3wTS4sB1o5fW5MzM17CRnOv+bdv6qLI0BCsRlhJ3umxuHfXi1dcOt8vkflK9JYe6HKV9PuHAVoQuCUpFouLViuceaL5hXCRlzvwzUft2sUH2xefCV8pkBpGPGwx1VfgxLkCNko1UYe2rArlXuzSPHRqdsAcqoemaquzATFd3STUzqyz5uCXJ/ChzQ0XTbtjUsk9fUeVdLdj8c1IDduWJxKquTjrX7D0hDTaD6BtCv7ZvDUfCsnLHljRXnbsg5w4n/P+/NzyEcz04aCzrGWla0WafXdBd6ld3yEnsobN2xAq0bi/dhDEuJY3DR12sGw5X4N93i1zyHu8UYGEM3CI8TkrVIU1ynyY1R6ylQ9AzbEzReOfB8mAmLHj9b65pfE7Ic3W0/7z5+fqLtW+kl8g+6T3u94feo97/d5fet/0XvVOeue9oPe3t97/+j9c/1f6/9Z/+/6/yrWjz+SmN/2tJ8/Pfg/zulEWg=</latexit> Structured Kernels for Bayes Opt • Structured kernels can help BayesOpt search much faster. • E.g., if a function is additive, i.e. , then f ( x 1 , . . . , x N ) = f ( x 1 ) + · · · + f ( x N ) the search is linear rather than exponential. (e.g. Kandasamy et al., 2015) • BayesOpt with an NKN kernel can learn to make use of additive structure when it exists.
Structured Kernels for Bayes Opt • Note: Bayesian neural nets don’t achieve this by default. • Even though they’re good at representing additive functions, they don’t seem to enjoy the inductive bias. 6tyEtDng-10D 2rDFOH −200 G3-5B) 11G −220 H0C )B11 −240 (nsHPEOH(5) )unFtiRn vDOuH −260 −280 −300 −320 −340 −360 0 10000 20000 30000 40000 50000 CRst
Functional Variational BNNs Guodong Jiaxin Shengyang Zhang Shi Sun ICLR 2019
Functional variational BNNs • Define a stochastic process prior (e.g. a GP) • Goal: train a generator network to produce functions as close as possible to the stochastic process posterior y x 1 x 2 • The stochastic weights and units are shared between all input locations. Hence, even the stochastic units represent epistemic, not aleatoric, uncertainty.
Recommend
More recommend