introduction to gaussian processes
play

Introduction to Gaussian Processes Iain Murray School of - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f 0.5 5 0 1 1 0.5 1.5 0.5 0 0.2 0.4 0.6


  1. Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh

  2. The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f −0.5 −5 0 −1 1 0.5 −1.5 0.5 0 0.2 0.4 0.6 0.8 1 x 2 1 x 1 0 x We have (possibly noisy) observations { x i , y i } n i =1

  3. Example Applications Real-valued regression: — Robotics: target state → required torque — Process engineering: predicting yield — Surrogate surfaces for optimization or simulation Many problems are not regression: Classification, rating/ranking, discovery, embedding, clustering, . . . But unknown functions may be part of larger model

  4. Model complexity The world is often complicated: 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 simple fit complex fit truth Problems: — Don’t want to underfit, and be too certain — Don’t want to overfit, and generalize poorly — Bayesian model comparison is often hard

  5. Predicting yield Factory settings x 1 → profit of 32 ± 5 monetary units Factory settings x 2 → profit of 100 ± 200 monetary units Which are the best settings x 1 or x 2 ? Knowing the error bars can be important

  6. Optimization In high dimensions it takes many function evaluations to be certain everywhere. Costly if experiments are involved. 1 0.5 0 −0.5 −1 −1.5 0 0.2 0.4 0.6 0.8 1 Error bars are needed to see if a region is still promising.

  7. Bayesian modelling If we come up with a parametric family of functions, f ( x ; θ ) and define a prior over θ , probability theory tells us how to make predictions given data. For flexible models, this usually involves intractable integrals over θ . We’re really good at integrating Gaussians though 2 Can we really solve significant 1 machine learning problems with 0 a simple multivariate Gaussian −1 distribution? −2 −2 −1 0 1 2

  8. Gaussian distributions Completely described by parameters µ and Σ : p ( f | Σ , µ ) = | 2 π Σ | − 1 � � 2 exp 2 ( f − µ ) T Σ − 1 ( f − µ ) − 1 µ and Σ are the mean and covariance: µ i = E [ f i ] Σ ij = E [ f i f j ] − µ i µ j If we know a distribution is Gaussian and know its mean and covariances, we know its density function.

  9. Marginal of Gaussian The marginal of a Gaussian distribution is Gaussian. � A �� f � a � � �� C p ( f , g ) = N ; , C ⊤ B g b As soon as you convince yourself that the marginal � p ( f ) = p ( f , g ) d g is Gaussian, you already know the means and covariances: p ( f ) = N ( f ; a , A )

  10. Conditional of Gaussian Any conditional of a Gaussian distribution is also Gaussian: � A �� f � a � � �� C p ( f , g ) = N ; , C ⊤ B g b p ( f | g ) = N ( f ; a + CB − 1 ( g − b ) , A − CB − 1 C ⊤ ) Showing this result requires some grunt work. But it is standard, and easily looked up.

  11. Noisy observations Previously we inferred f given g . What if we only saw a noisy observation, y ∼ N ( g , S ) ? p ( f , g , y ) = p ( f , g ) p ( y | g ) is Gaussian distributed a quadratic form inside the exponential after multiplying Posterior over f is still Gaussian: � p ( f | y ) ∝ p ( f , g , y ) d g RHS is Gaussian after marginalizing, so still a quadratic form in f inside an exponential.

  12. Laying out Gaussians A way of visualizing draws from a 2D Gaussian: 2 1 0 ⇔ 0 f 2 −0.5 f −1 −1 −2 −2 −1 0 1 2 x_1 x_2 f 1 1.5 1 0.5 Now it’s easy to show three draws 0 f from a 6D Gaussian: −0.5 −1 −1.5 x_1 x_2 x_3 x_4 x_5 x_6

  13. Building large Gaussians Three draws from a 25D Gaussian: 2 1 f 0 −1 x To produce this, we needed a mean: I used zeros(25,1) The covariances were set using a kernel function: Σ ij = k ( x i , x j ) . The x ’s are the positions that I planted the tics on the axis. Later we’ll find k ’s that ensure Σ is always positive semi-definite.

  14. GP regression model 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Noisy observations: f ∼ GP y i | f i ∼ N ( f i , σ 2 n ) f ∼ N (0 , K ) , K ij = k ( x i , x j ) where f i = f ( x i )

  15. GP Posterior Our prior over observations and targets is Gaussian: � K ( X, X ) + σ 2 �� y �� � � y �� K ( X, X ∗ ) n I � = N P ; 0 , K ( X ∗ , X ) K ( X ∗ , X ∗ ) f ∗ f ∗ Using the rule for conditionals, p ( f ∗ | y ) is Gaussian with: mean , ¯ f ∗ = K ( X ∗ , X )( K ( X, X ) + σ 2 n I ) − 1 y n I ) − 1 K ( X, X ∗ ) cov( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X, X ) + σ 2 The posterior over functions is a Gaussian Process.

  16. GP Posterior Two incomplete ways of visualizing what we know: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Draws ∼ p ( f | data) Mean and error bars

  17. Point predictions Conditional at one point x ∗ is a simple Gaussian: p ( f ( x ∗ ) | data) = N ( f ; m, s 2 ) Need covariances: K ij = k ( x i , x j ) , ( k ∗ ) i = k ( x ∗ , x i ) Special case of joint posterior: M = K + σ 2 n I m = k ⊤ ∗ M − 1 y s 2 = k ( x ∗ , x ∗ ) − k ⊤ ∗ M − 1 k ∗ � �� � positive

  18. Discovery or prediction? 1 ± 2 σ , p(y * |data) ± 2 σ , p(f * |data) 0.5 True f 0 Posterior Mean −0.5 Observations −1 −1.5 0 0.2 0.4 0.6 0.8 1 x * p ( f ∗ | data) = N ( f ∗ ; m, s 2 ) says what we know about the noiseless function. p ( y ∗ | data) = N ( y ∗ ; m, s 2 + σ 2 n ) predicts what we’ll see next.

  19. Review so far We can represent a function as a big vector f We assume that this unknown vector was drawn from a big correlated Gaussian distribution, a Gaussian process . (This might upset some mathematicians, but for all practical machine learning and statistical problems, this is fine.) Observing elements of the vector (optionally corrupted by Gaussian noise) creates a Gaussian posterior distribution. The posterior over functions is still a Gaussian process. Marginalization in Gaussians is trivial: just ignore all of the positions x i that are neither observed nor queried.

  20. Covariance functions The main part that has been missing so far is where the covariance function k ( x i , x j ) comes from. What else can it say, other than nearby points are similar?

  21. Covariance functions We can construct covariance functions from parametric models Simplest example: Bayesian linear regression: f ( x i ) = w ⊤ x i + b, w ∼ N (0 , σ 2 w I ) , b ∼ N (0 , σ 2 b ) ✯ 0 ✯ 0 ✟ ✟ cov( f i , f j ) = E [ f i f j ] − ✟✟✟✟✟✟ E [ f i ] E [ f j ] ✟✟✟✟✟✟ � � ( w ⊤ x i + b ) ⊤ ( w ⊤ x j + b ) = E w x ⊤ = σ 2 i x j + σ 2 b = k ( x i , x j ) Kernel parameters σ 2 w and σ 2 b are hyper-parameters in the Bayesian hierarchical model. More interesting kernels come from models with a large or infinite w Φ( x i ) ⊤ Φ( x j ) + σ 2 feature space: k ( x i , x j ) = σ 2 b , the ‘kernel trick’.

  22. What’s a valid kernel? We could ‘make up’ a kernel function k ( x i , x j ) But any ‘Gram matrix’ must be positive semi-definite:   k ( x 1 , x 1 ) · · · k ( x 1 , x N ) . z ⊤ K z ≥ 0 for all z  .  K = .  ,  k ( x N , x 1 ) · · · k ( x N , x N ) Achieved by positive semi-definite kernel , or Mercer kernel K +ve eigenvalues ⇒ K − 1 +ve eigenvalues ⇒ Gaussian normalizable Mercer kernels give inner-products of some feature vectors Φ( x ) But these Φ( x ) vectors may be infinite.

  23. Squared-exponential kernel An ∞ number of radial-basis functions can give D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 the most commonly-used kernel in machine learning. It looks like an (unnormalized) Gaussian, so is sometimes called the Gaussian kernel. A Gaussian process need not use the “Gaussian” kernel. In fact, other choices will often be better.

  24. Meaning of hyper-parameters Many kernels have similar types of parameters: D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Consider x i = x j , ⇒ marginal function variance is σ 2 f 20 σ f = 2 σ f = 10 10 0 −10 −20 −30 0 0.2 0.4 0.6 0.8 1

  25. Meaning of hyper-parameters ℓ d parameters give the length-scale in dimension- d � D � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Typical distance between peaks ≈ ℓ 2 l = 0.05 l = 0.5 1 0 −1 −2 −3 0 0.2 0.4 0.6 0.8 1

  26. Effect of hyper-parameters Different (SE) kernel parameters give different explanations of the data: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ℓ = 0 . 5 , σ n = 0 . 05 ℓ = 1 . 5 , σ n = 0 . 15

  27. Other kernels SE kernel produce very smooth and ‘boring’ functions Kernels are available for rough data, periodic data, strings, graphs, images, models, . . . Different kernels can be combined: k ( x i , x j ) = αk 1 ( x i , x j ) + βk 2 ( x i , x j ) Positive semi-definite if k 1 and k 2 are.

Recommend


More recommend