advanced introduction to machine learning cmu 10715
play

Advanced Introduction to Machine Learning CMU-10715 Gaussian - PowerPoint PPT Presentation

Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabs Pczos Introduction http://www.gaussianprocess.org/ Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin 3 Contents


  1. Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos

  2. Introduction

  3. http://www.gaussianprocess.org/ Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin 3

  4. Contents  Introduction Regression  Properties of Multivariate Gaussian distributions   Ridge Regression  Gaussian Processes  Weight space view o Bayesian Ridge Regression + Kernel trick  Function space view o Prior distribution over functions + calculation posterior distributions 4

  5. Regression

  6. Why GPs for Regression? Regression methods: Linear regression, ridge regression, support vector regression, kNN regression, etc… Motivation 1: All the above regression method give point estimates. We would like a method that could also provide confidence during the estimation. Motivation 2: Let us kernelize linear ridge regression, and see what we get…

  7. Why GPs for Regression? GPs can answer the following questions • Here’s where the function will most likely be. (expected function) • Here are some examples of what it might look like. (sampling from the posterior distribution) • Here is a prediction of what you’ll see if you evaluate your function at x’, with confidence 7

  8. Properties of Multivariate Gaussian Distributions 8

  9. 1D Gaussian Distribution Parameters • Mean,  • Variance,  2 9

  10. Multivariate Gaussian 10

  11. Multivariate Gaussian  A 2-dimensional Gaussian is defined by • a mean vector  = [  1 ,  2 ]     2 2   • a covariance matrix: 1 , 1 2 , 1     2 2   1 , 2 2 , 2 where  i,j = E[ (x i –  i ) (x j –  j ) ] 2 is (co)variance  Note:  is symmetric, “positive semi - definite”:  x: x T  x  0 11

  12. Multivariate Gaussian examples   1 0 . 8  = (0,0)       0 . 8 1 12

  13. Multivariate Gaussian examples   1 0 . 8  = (0,0)       0 . 8 1 13

  14. Useful Properties of Gaussians  Marginal distributions of Gaussians are Gaussian  Given:      x ( x a x , ), ( , ) b a b         aa ab       ba bb  Marginal Distribution: 14

  15. Marginal distributions of Gaussians are Gaussian 15

  16. Block Matrix Inversion Theorem Definition: Schur complements 16

  17. Useful Properties of Gaussians  Conditional distributions of Gaussians are Gaussian  Notation:                    aa ab aa ab 1             ba bb ba bb  Conditional Distribution: 17

  18. Higher Dimensions  Visualizing > 3 dimensions is… difficult  Means and marginals are practical, but then we don’t see correlations between those variables  Marginals are Gaussian, e.g., f(6) ~ N( µ(6) , σ 2 (6)) Visualizing an 8-dimensional Gaussian f: σ 2 (6) µ(6) 6 1 2 3 4 5 7 8 18

  19. Yet Higher Dimensions Why stop there? Don’t panic: It’s just a function 19

  20. Getting Ridiculous Why stop there? 20

  21. Gaussian Process Definition :  Probability distribution indexed by an arbitrary set (integer, real, finite dimensional vector, etc)  Each element gets a Gaussian distribution over the reals with mean µ(x)  These distributions are dependent/correlated as defined by k (x,z)  Any finite subset of indices defines a multivariate Gaussian distribution 21

  22. Gaussian Process  Distribution over functions…. Yayyy! If our regression model is a GP, then it won’t be a point estimate anymore! It can provide regression estimates with confidence  Domain (index set) of the functions can be pretty much whatever • Reals • Real vectors • Graphs • Strings • Sets • … 22

  23. Bayesian Updates for GPs • How can we do regression and learn the GP from data? • We will be Bayesians today: • Start with GP prior • Get some data • Compute a posterior 23

  24. Samples from the prior distribution 24 Picture is taken from Rasmussen and Williams

  25. Samples from the posterior distribution 25 Picture is taken from Rasmussen and Williams

  26. Prior 26

  27. Data 27

  28. Posterior 28

  29. Contents  Introduction  Ridge Regression  Gaussian Processes • Weight space view  Bayesian Ridge Regression + Kernel trick • Function space view  Prior distribution over functions + calculation posterior distributions 29

  30. Ridge Regression Linear regression: Ridge regression: The Gaussian Process is a Bayesian Generalization of the kernelized ridge regression 30

  31. Contents  Introduction  Ridge Regression  Gaussian Processes • Weight space view  Bayesian Ridge Regression + Kernel trick • Function space view  Prior distribution over functions + calculation posterior distributions 31

  32. Weight Space View GP = Bayesian ridge regression in feature space + Kernel trick to carry out computations The training data 32

  33. Bayesian Analysis of Linear Regression with Gaussian noise 33

  34. Bayesian Analysis of Linear Regression with Gaussian noise The likelihood: 34

  35. Bayesian Analysis of Linear Regression with Gaussian noise The prior: Now, we can calculate the posterior: 35

  36. Bayesian Analysis of Linear Regression with Gaussian noise Ridge Regression After “completing the square” MAP estimation 36

  37. Bayesian Analysis of Linear Regression with Gaussian noise This posterior covariance matrix doesn’t depend on the observations y , A strange property of Gaussian Processes 37

  38. Projections of Inputs into Feature Space The reviewed Bayesian linear regression suffers from limited expressiveness To overcome the problem ) go to a feature space and do linear regression there a., explicit features b., implicit features (kernels) 38

  39. Explicit Features Linear regression in the feature space 39

  40. Explicit Features The predictive distribution after feature map: 40

  41. Explicit Features Shorthands: The predictive distribution after feature map: 41

  42. Explicit Features The predictive distribution after feature map: (*) A problem with (*) is that it needs an NxN matrix inversion... Theorem: (*) can be rewritten: 42

  43. Proofs • Mean expression. We need: Lemma: • Variance expression. We need: Matrix inversion Lemma: 43

  44. From Explicit to Implicit Features Reminder : This was the original formulation: 44

  45. From Explicit to Implicit Features The feature space always enters in the form of: No need to know the explicit N dimensional features. Their inner product is enough. Lemma: 45

  46. Results 46

  47. Results using Netlab , Sin function 47

  48. Results using Netlab, Sin function Increased # of training points 48

  49. Results using Netlab, Sin function Increased noise 49

  50. Results using Netlab, Sinc function 50

  51. Thanks for the Attention!  51

  52. Extra Material 52

  53. Contents  Introduction  Ridge Regression  Gaussian Processes • Weight space view  Bayesian Ridge Regression + Kernel trick • Function space view  Prior distribution over functions + calculation posterior distributions 53

  54. Function Space View  An alternative way to get the previous results  Inference directly in function space Definition: (Gaussian Processes) GP is a collection of random variables, s.t. any finite number of them have a joint Gaussian distribution 54

  55. Function Space View Notations: 55

  56. Function Space View Gaussian Processes: 56

  57. Function Space View The Bayesian linear regression is an example of GP 57

  58. Function Space View Special case 58

  59. Function Space View 59 Picture is taken from Rasmussen and Williams

  60. Function Space View Observation Explanation 60

  61. Prediction with noise free observations noise free observations 61

  62. Prediction with noise free observations Goal: 62

  63. Prediction with noise free observations Lemma: Proofs: a bit of calculation using the joint (n+m) dim density Remarks: 63

  64. Prediction with noise free observations 64 Picture is taken from Rasmussen and Williams

  65. Prediction using noisy observations The joint distribution: 65

  66. Prediction using noisy observations The posterior for the noisy observations: where In the weight space view we had: 66

  67. Prediction using noisy observations Short notations: 67

  68. Prediction using noisy observations Two ways to look at it: • Linear predictor • Manifestation of the Representer Theorem 68

  69. Prediction using noisy observations Remarks: 69

  70. GP pseudo code Inputs: 70

  71. GP pseudo code (continued) Outputs: 71

  72. Thanks for the Attention!  72

Recommend


More recommend