advances in gaussian processes
play

Advances in Gaussian Processes Tutorial at NIPS 2006 in Vancouver - PowerPoint PPT Presentation

Advances in Gaussian Processes Tutorial at NIPS 2006 in Vancouver Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, Tbingen December 4th, 2006 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes


  1. Advances in Gaussian Processes Tutorial at NIPS 2006 in Vancouver Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, Tübingen December 4th, 2006 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 1 / 55

  2. The Prediction Problem 420 CO 2 concentration, ppm 400 ? 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 2 / 55

  3. The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 3 / 55

  4. The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55

  5. The Prediction Problem 420 CO 2 concentration, ppm 400 380 360 340 320 1960 1980 2000 2020 year Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 5 / 55

  6. The Prediction Problem Ubiquitous questions: • Model fitting • how do I fit the parameters? • what about overfitting? • Model Selection • how to I find out which model to use? • how sure can I be? • Interpretation • what is the accuracy of the predictions? • can I trust the predictions, even if • . . . I am not sure about the parameters? • . . . I am not sure of the model structure? Gaussian processes solve some of the above, and provide a practical framework to address the remaining issues. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 6 / 55

  7. Outline Part I: foundations Part II: advanced topics • What is a Gaussian process • Example • from distribution to process • priors over functions • distribution over functions • hierarchical priors using • the marginalization property hyperparameters • learning the covariance • Inference function • Bayesian inference • Approximate methods for • posterior over functions classification • predictive distribution • marginal likelihood • Gaussian Process latent variable • Occam’s Razor models • automatic complexity penalty • Sparse methods Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 7 / 55

  8. The Gaussian Distribution The Gaussian distribution is given by p ( x | µ ✱ Σ ) = N ( µ ✱ Σ ) = ( 2 π ) − D / 2 | Σ | − 1 / 2 exp − 1 2 ( x − µ ) ⊤ Σ − 1 ( x − µ ) � � where µ is the mean vector and Σ the covariance matrix. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 8 / 55

  9. Conditionals and Marginals of a Gaussian joint Gaussian joint Gaussian conditional marginal Both the conditionals and the marginals of a joint Gaussian are again Gaussian. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 9 / 55

  10. What is a Gaussian Process? A Gaussian process is a generalization of a multivariate Gaussian distribution to infinitely many variables. Informally: infinitely long vector ≃ function Definition : a Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions. � A Gaussian distribution is fully specified by a mean vector, µ , and covariance matrix Σ : f = ( f 1 ✱ ✳ ✳ ✳ ✱ f n ) ⊤ ∼ N ( µ ✱ Σ ) ✱ indexes i = 1 ✱ ✳ ✳ ✳ ✱ n A Gaussian process is fully specified by a mean function m ( x ) and covariance function k ( x ✱ x ′ ) : m ( x ) ✱ k ( x ✱ x ′ ) � � f ( x ) ∼ GP indexes: x ✱ Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 10 / 55

  11. The marginalization property Thinking of a GP as a Gaussian distribution with an infinitely long mean vector and an infinite by infinite covariance matrix may seem impractical. . . . . . luckily we are saved by the marginalization property : Recall: � p ( x ) = p ( x ✱ y ) d y ✳ For Gaussians: �� a � A B � �� p ( x ✱ y ) = N p ( x ) = N ( a ✱ A ) ⇒ = B ⊤ b C ✱ Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 11 / 55

  12. Random functions from a Gaussian Process Example one dimensional Gaussian process: m ( x ) = 0 ✱ k ( x ✱ x ′ ) = exp (− 1 2 ( x − x ′ ) 2 ) � � p ( f ( x )) ∼ GP ✳ To get an indication of what this distribution over functions looks like, focus on a finite subset of function values f = ( f ( x 1 ) ✱ f ( x 2 ) ✱ ✳ ✳ ✳ ✱ f ( x n )) ⊤ , for which f ∼ N ( 0 ✱ Σ ) ✱ where Σ ij = k ( x i ✱ x j ) . Then plot the coordinates of f as a function of the corresponding x values. Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 12 / 55

  13. Some values of the random function 1.5 1 0.5 output, f(x) 0 −0.5 −1 −1.5 −5 0 5 input, x Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 13 / 55

  14. Sequential Generation Factorize the joint distribution n � p ( f 1 ✱ ✳ ✳ ✳ ✱ f n | x 1 ✱ ✳ ✳ ✳ x n ) = p ( f i | f i − 1 ✱ ✳ ✳ ✳ ✱ f 1 ✱ x i ✱ ✳ ✳ ✳ ✱ x 1 ) ✱ i = 1 and generate function values sequentially. What do the individual terms look like? For Gaussians: �� a � A B � �� p ( x | y ) = N ( a + BC − 1 ( y − b ) ✱ A − BC − 1 B ⊤ ) p ( x ✱ y ) = N ⇒ = B ⊤ b C ✱ Do try this at home! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 14 / 55

  15. Function drawn at random from a Gaussian Process with Gaussian covariance 8 7 6 5 4 3 2 1 0 −1 −2 6 4 6 2 4 0 2 0 −2 −2 −4 −4 −6 −6 Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 15 / 55

  16. Maximum likelihood, parametric model Supervised parametric learning: • data: x ✱ y • model: y = f w ( x ) + ε Gaussian likelihood: � exp (− 1 2 ( y c − f w ( x c )) 2 /σ 2 p ( y | x ✱ w ✱ M i ) ∝ noise ) ✳ c Maximize the likelihood: w ML = argmax p ( y | x ✱ w ✱ M i ) ✳ w Make predictions, by plugging in the ML estimate: p ( y ∗ | x ∗ ✱ w ML ✱ M i ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 16 / 55

  17. Bayesian Inference, parametric model Supervised parametric learning: • data: x ✱ y • model: y = f w ( x ) + ε Gaussian likelihood: � 2 ( y c − f w ( x c )) 2 /σ 2 exp (− 1 p ( y | x ✱ w ✱ M i ) ∝ noise ) ✳ c Parameter prior: p ( w | M i ) Posterior parameter distribution by Bayes rule p ( a | b ) = p ( b | a ) p ( a ) / p ( b ) : p ( w | x ✱ y ✱ M i ) = p ( w | M i ) p ( y | x ✱ w ✱ M i ) p ( y | x ✱ M i ) Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 17 / 55

  18. Bayesian Inference, parametric model, cont. Making predictions: � p ( y ∗ | x ∗ ✱ x ✱ y ✱ M i ) = p ( y ∗ | w ✱ x ∗ ✱ M i ) p ( w | x ✱ y ✱ M i ) d w Marginal likelihood: � p ( y | x ✱ M i ) = p ( w | M i ) p ( y | x ✱ w ✱ M i ) d w ✳ Model probability: p ( M i | x ✱ y ) = p ( M i ) p ( y | x ✱ M i ) p ( y | x ) Problem: integrals are intractable for most interesting models! Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 18 / 55

  19. Non-parametric Gaussian process models In our non-parametric model, the “parameters” is the function itself! Gaussian likelihood: y | x ✱ f ( x ) ✱ M i ∼ N ( f ✱ σ 2 noise I ) (Zero mean) Gaussian process prior: � m ( x ) ≡ 0 ✱ k ( x ✱ x ′ ) � f ( x ) | M i ∼ GP Leads to a Gaussian process posterior m post ( x ) = k ( x ✱ x )[ K ( x ✱ x ) + σ 2 noise I ] − 1 y ✱ � f ( x ) | x ✱ y ✱ M i ∼ GP k post ( x ✱ x ′ ) = k ( x ✱ x ′ ) − k ( x ✱ x )[ K ( x ✱ x ) + σ 2 noise I ] − 1 k ( x ✱ x ′ ) � ✳ And a Gaussian predictive distribution: y ∗ | x ∗ ✱ x ✱ y ✱ M i ∼ N k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 y ✱ � k ( x ∗ ✱ x ∗ ) + σ 2 noise − k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ ✱ x ) � Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55

  20. Prior and Posterior 2 2 1 1 output, f(x) output, f(x) 0 0 −1 −1 −2 −2 −5 0 5 −5 0 5 input, x input, x Predictive distribution: k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 y ✱ p ( y ∗ | x ∗ ✱ x ✱ y ) ∼ N � k ( x ∗ ✱ x ∗ ) + σ 2 noise − k ( x ∗ ✱ x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ ✱ x ) � Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55

Recommend


More recommend