I NTEGRATION OVER HYPERPARAMETERS AND ESTIMATION OF PREDICTIVE PERFORMANCE Aki Vehtari Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Finland aki.vehtari@aalto.fi Priors and integration for GP hyperparameters Vehtari
Outline ◮ GP hyperparameter inference ◮ Priors on GP hyperparameters ◮ Benefits of integration vs. point estimate ◮ MCMC, CCD Priors and integration for GP hyperparameters Vehtari
Gaussian processes and hyperparameters ◮ Gaussian processes are priors on function space ◮ GPs are usually constructed with a parametric covariance function ◮ we need to think about priors on those parameters Priors and integration for GP hyperparameters Vehtari
Gaussian processes and hyperparameters ◮ Gaussian processes are priors on function space ◮ GPs are usually constructed with a parametric covariance function ◮ we need to think about priors on those parameters ◮ If we have “big data” and small number of hyperparameters ◮ priors and integration over the posterior is not so important ◮ even more so when sparse approximations, which limit the complexity of the models, are used Priors and integration for GP hyperparameters Vehtari
1D demo ◮ 1D demo originally by Michael Betancourt Priors and integration for GP hyperparameters Vehtari
1D demo Priors and integration for GP hyperparameters Vehtari
1D demo summary ◮ Likelihood for lengthscale beyond the data scale is flat and non-identifiable because the functions looks all the same ◮ add prior making large lengthscale less likely ◮ If no repeated measurements non-identifiability between signal magnitude and noise magnitude when lengthscale short ◮ add prior making short lengthscale less likely ◮ add prior on measurement noise ◮ make repeated measurements ◮ Nonidentifiability between lengthscale and magnitude Priors and integration for GP hyperparameters Vehtari
Non-Gaussian likelihoods ◮ Poisson ◮ variance is equal to mean, and thus can’t overfit Priors and integration for GP hyperparameters Vehtari
Non-Gaussian likelihoods ◮ Poisson ◮ variance is equal to mean, and thus can’t overfit ◮ except if data is not conditionally Poisson distributed Priors and integration for GP hyperparameters Vehtari
Non-Gaussian likelihoods ◮ Poisson ◮ variance is equal to mean, and thus can’t overfit ◮ except if data is not conditionally Poisson distributed ◮ Binary classification (logit/probit) ◮ unbounded likelihood if separable ◮ with short if enough lengthscale separable Priors and integration for GP hyperparameters Vehtari
Sparse approximations ◮ Sparse approximations limit the complexity ◮ FITC type models work only with large lengthscale Priors and integration for GP hyperparameters Vehtari
Higher dimensions ◮ Separate lengthscale for each dimension, aka ARD ◮ lengthscale is related to non-linearity Priors and integration for GP hyperparameters Vehtari
Toy example f 1 ( x 1 ) f 2 ( x 2 ) f 3 ( x 3 ) f 4 ( x 4 ) 2 f ( x ) = f 1 ( x 1 ) + · · · + f 8 ( x 8 ) , 1 0 − 1 � f , 0 . 3 2 � − 2 y ∼ N , − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 f 5 ( x 5 ) f 6 ( x 6 ) f 7 ( x 7 ) f 8 ( x 8 ) 2 � � f j = 1 for all j . Var 1 0 ⇒ All inputs equally relevant − 1 − 2 − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 1 True relevance 0 . 5 0 2 4 6 8 Input Priors and integration for GP hyperparameters Vehtari
Toy example f 1 ( x 1 ) f 2 ( x 2 ) f 3 ( x 3 ) f 4 ( x 4 ) 2 f ( x ) = f 1 ( x 1 ) + · · · + f 8 ( x 8 ) , 1 0 − 1 � f , 0 . 3 2 � − 2 y ∼ N , − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 f 5 ( x 5 ) f 6 ( x 6 ) f 7 ( x 7 ) f 8 ( x 8 ) 2 � � f j = 1 for all j . Var 1 0 ⇒ All inputs equally relevant − 1 − 2 − 1 0 1 − 1 0 1 − 1 0 1 − 1 0 1 1 True relevance Optimized ARD-values, 0 . 5 ARD-value ARD ( j ) = 1 /ℓ j (averaged over 100 data realizations, n = 200) 0 2 4 6 8 Input Priors and integration for GP hyperparameters Vehtari
Bayesian optimization ◮ GPs have been used too much as black boxes ◮ Bonus: use shape constrained GPs (see, e.g., Siivola et al., 2017) Priors and integration for GP hyperparameters Vehtari
Periodic covariance function ◮ If you know the period fix it ◮ If you don’t know, there can be serious identifiability problems unless informative priors are used Priors and integration for GP hyperparameters Vehtari
Parametric model plus GP ◮ For example, linear model plus GP ◮ with long lengthscale GP is like a linear model which causes non-identifiability and problems in interpretation Priors and integration for GP hyperparameters Vehtari
Parametric model plus GP ◮ For example, linear model plus GP ◮ with long lengthscale GP is like a linear model which causes non-identifiability and problems in interpretation ◮ Same for other parametric model + GP ◮ need more informative priors Priors and integration for GP hyperparameters Vehtari
GP plus GP Relative Number of Births 110 Trends 100 90 Slow trend Fast non−periodic component 80 Mean 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 110 Day of week effect 100 1972 1976 90 1980 1984 80 1988 Mon Tue Wed Thu Fri Sat Sun 110 Seasonal effect 100 1972 1976 90 1980 1984 80 1988 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 110 Day of year effect Valentine’s day 100 April 1st Memorial day Halloween Leap day Labor day 90 Thanksgiving Priors and integration for GP hyperparameters Independence day Vehtari 80 New year Christmas Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
GP plus GP ◮ Identifiability problems as different components are explaining same features in the data ◮ priors which “encourage” specialization of the components Priors and integration for GP hyperparameters Vehtari
Summary on priors and benefits of integration ◮ Specific prior recommendations for length scale ◮ inverse gamma has a sharp left tail that puts negligible mass on small length-scales, but a generous right tail, allowing for large length-scales (but still reducing non-identifiability) ◮ generalized inverse Gaussian has an inverse gamma left tail (if p ≤ 0) and a Gaussian right tail (avoids identifiability issue when combined with linear model) Priors and integration for GP hyperparameters Vehtari
Summary on priors and benefits of integration ◮ Specific prior recommendations for length scale ◮ inverse gamma has a sharp left tail that puts negligible mass on small length-scales, but a generous right tail, allowing for large length-scales (but still reducing non-identifiability) ◮ generalized inverse Gaussian has an inverse gamma left tail (if p ≤ 0) and a Gaussian right tail (avoids identifiability issue when combined with linear model) ◮ Specific weakly informative prior recommendations for signal and noise magnitude ◮ half-normals are often enough if length-scale has informative prior ◮ if information about measurement accuracy is available, informative prior such as gamma or scaled inverse Chi 2 for variance Priors and integration for GP hyperparameters Vehtari
GPs in Stan ◮ Stan manual 2.16.0 (and later) chapter 16 http://mc-stan.org/users/documentation/index.html ◮ code and documentation by Rob Trangucci ◮ prior recommendations by Rob Trangucci, Michael Betancourt, Aki Vehtari ◮ Code examples https://github.com/rtrangucci/gps in stan ◮ by Rob Trangucci Priors and integration for GP hyperparameters Vehtari
Hamiltonian Monte Carlo + NUTS ◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy level ◮ Parameters ◮ step size, number of steps in each chain Priors and integration for GP hyperparameters Vehtari
Hamiltonian Monte Carlo + NUTS ◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy level ◮ Parameters ◮ step size, number of steps in each chain ◮ No U-Turn Sampling ◮ adaptively selects number of steps to improve robustness and efficiency Priors and integration for GP hyperparameters Vehtari
Hamiltonian Monte Carlo + NUTS ◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy level ◮ Parameters ◮ step size, number of steps in each chain ◮ No U-Turn Sampling ◮ adaptively selects number of steps to improve robustness and efficiency ◮ Adaptation in Stan ◮ Step size adjustment (mass matrix) is estimated during initial adaptation phase Priors and integration for GP hyperparameters Vehtari
Hamiltonian Monte Carlo + NUTS ◮ Uses gradient information for more efficient sampling ◮ Alternating dynamic simulation and sampling of the energy level ◮ Parameters ◮ step size, number of steps in each chain ◮ No U-Turn Sampling ◮ adaptively selects number of steps to improve robustness and efficiency ◮ Adaptation in Stan ◮ Step size adjustment (mass matrix) is estimated during initial adaptation phase ◮ Demo ◮ https://chi-feng.github.io/mcmc-demo/app.html# RandomWalkMH,donut ◮ note that HMC/NUTS in this demo is not exactly same as in Stan Priors and integration for GP hyperparameters Vehtari
CCD ◮ Deterministic placement of integration points Priors and integration for GP hyperparameters Vehtari
Recommend
More recommend