Primer on Bayesian Inference and Gaussian Processes Guido Sanguinetti School of Informatics, University of Edinburgh Dagstuhl, March 2018 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 1 / 35
Talk outline Bayesian regression 1 Gaussian Processes 2 Bayesian prediction with GPs 3 Bayesian Optimisation 4 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 2 / 35
The Bayesian Way ALL model ingredients are random variables Statistical framework for quantifying uncertainty when only some variables are observed We have prior distributions on unobserved variables, and model the dependence of the observations on the unobserved variables This allows us to make inferences on the unobserved variables Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 3 / 35
Bayesian inference and predictions Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p ( y | ✓ ) is called the likelihood Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35
Bayesian inference and predictions Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p ( y | ✓ ) is called the likelihood The revised belief on the latents is the posterior p ( ✓ | y ) = 1 Z p ( y | ✓ ) p ( ✓ ) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35
Bayesian inference and predictions Models consist of joint probability distributions (with a structure) over observed and unobserved (latent) variables Unobserved variables ✓ have a prior distribution The conditional probability of the observations y given the latents p ( y | ✓ ) is called the likelihood The revised belief on the latents is the posterior p ( ✓ | y ) = 1 Z p ( y | ✓ ) p ( ✓ ) The predictive distribution for new observations is Z p ( y new | y old ) = d ✓ p ( y new | ✓ ) p ( ✓ | y old ) Di ffi culty is computing the integrals Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 4 / 35
Bayesian supervised (discriminative) learning We focus on the (discriminative) supervised learning scenario: data are input-output pairs ( x , y ) Standard assumption: inputs are noise-free, outputs are noisy observations of a function f ( x ): y ⇠ P s . t . E [ y ] = f ( x ) The function f is a random function Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 5 / 35
Bayesian supervised (discriminative) learning We focus on the (discriminative) supervised learning scenario: data are input-output pairs ( x , y ) Standard assumption: inputs are noise-free, outputs are noisy observations of a function f ( x ): y ⇠ P s . t . E [ y ] = f ( x ) The function f is a random function Simplest example f = P i w i � i ( x ) with phi i fixed basis functions , and w i random weights Consequently, the variables f ( x i ) at input points are (correlated) random variables Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 5 / 35
Important exercise Let � 1 ( x ) , . . . , � N ( x ) be a fixed set of functions, and let f ( x ) = P w i � i ( x ). If w ⇠ N (0 , I ), compute: The single-point marginal distribution of f ( x ) 1 The two-point marginal distribution of f ( x 1 ) , f ( x 2 ) 2 Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 6 / 35
Solution (sketch) Obviously the distributions are Gaussians Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35
Solution (sketch) Obviously the distributions are Gaussians Obviously both distributions have mean zero Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35
Solution (sketch) Obviously the distributions are Gaussians Obviously both distributions have mean zero To compute the (co)variance, take products and expectations and remember that h w i w j i = � ij Defining φ ( x ) = ( � 1 ( x ) , . . . , � N ( x )), we get that h f ( x i ) f ( x j ) i = φ ( x i ) T φ ( x j ) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 7 / 35
The Gram matrix Generalising the exercise to more than two points, we get that any finite dimensional marginal of this process is multivariate Gaussian The covariance matrix of this function is given by evaluating a function of two variables at all possible pairs The function is defined by the set of basis functions k ( x i , x j ) = φ ( x i ) T φ ( x j ) The covariance matrix is often called Gram matrix and is (necessarily) symmetric and positive definite Bayesian prediction in regression then is essentially the same as computing conditionals for Gaussians (more later) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 8 / 35
Stationary variance We have seen that the variance of a random combination of functions depends on space as P � 2 i ( x ) Given any compact set, (e.g. hypercube with centre in the origin), we can find a finite set of basis functions s.t. P � 2 i ( x ) = const (partition of unity, e.g. triangulations or smoother alternatives) We can construct a sequence of such sets which covers the whole of R D in the limit Therefore, we can construct a sequence of priors which all have constant prior variance across all space Covariances would still be computed by evaluating a Gram matrix (and need not be constant) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 9 / 35
Function space view The argument before shows that we can put a prior over infinite-dimensional spaces of functions s.t. all finite dimensional marginals are multivariate Gaussian The constructive argument, often referred to as weights space view , is useful for intuition but impractical It does demonstrate the existence of truly infinite dimensional Gaussian processes Once we accept that Gaussian processes exist, we are better o ff proceeding along a more abstract line Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 10 / 35
GP definition A Gaussian Process (GP) is a stochastic process indexed by a continuous variable x s.t. all finite dimensional marginals are multivariate Gaussian A GP is uniquely defined by its mean and covariance functions, denoted by µ ( x ) and k ( x , x 0 ): f ⇠ GP ( µ, k ) $ f = ( f ( x 1 ) , . . . , f ( x N )) ⇠ N ( µ , K ) , µ = ( µ ( x 1 ) , . . . .µ ( x N )) , K = ( k ( x i , x j )) i , j The covariance function must satisfy some conditions (Mercer’s theorem), essentially it needs to evaluate to a symmetric positive definite function for all sets of input points Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 11 / 35
Covariance functions The covariance function encapsulates the basis functions used ! it determines the type of functions which can be sampled The radial basis functions (RBF or squared exponential) covariance function � ( x i � x j ) 2 � k ( x i , x j ) = ↵ 2 exp � 2 corresponds to Gaussian bumps basis functions and yields smooth bumpy samples The Ornstein-Uhlenbeck (OU) covariance � | x i � x j | � k ( x i , x j ) = ↵ 2 exp � 2 yields rough paths which are nowhere di ff erentiable Both RBF and OU are stationary and encode exponentially decaying correlations Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 12 / 35
More on covariance functions Recent years have seen much work on designing/ selecting covariance functions One line of thought follows the fact that convex combination/ multiplication of covariance functions still yields a covariance function The automatic statistician project (Z. Ghahramani) combines these operations with a heuristic search to automatically select a covariance Another line of research constructs covariance functions out of steady-state autocorrelations of stochastic process models (work primarily by S¨ arkk¨ a and collaborators) Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 13 / 35
Observing GPs In a regression case, we assume to have observed the function values at some input values with i.i.d. Gaussian noise with variance � 2 What is the e ff ect of observation noise? Suppose we have a Gaussian vector f ⇠ N ( µ, Σ ), and observations y | f ⇠ N ( f , � 2 I ) Exercise: compute the marginal distribution of y Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 14 / 35
Predicting with GPs Suppose we have noisy observations y of a function value at inputs x , and want to predict the value at a new input x new The joint prior probability of function values at the observed and new input points is multivariate Gaussian By Bayes’ theorem, we have Z p ( f new | y ) / df ( x ) p ( f new , f ( x )) p ( y | f ( x )) (1) where f ( x ) is the vector of true function values at the input points For Gaussian observation noise, the integral is analytically computed Guido Sanguinetti (University of Edinburgh) ML primer Dagstuhl, March 2018 15 / 35
Recommend
More recommend