sparse gaussian process approximations
play

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - PowerPoint PPT Presentation

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge 1 / 90 Motivating application 1: Audio modelling audio 5


  1. A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC Quinonero-Candela & PITC Rasmussen, 2005 (FITC, PITC, DTC) DTC FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

  2. A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

  3. A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

  4. A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...) FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 10 / 90

  5. Factor Graphs: introduction / reminder factor graph examples 11 / 90

  6. Factor Graphs: introduction / reminder factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution: 12 / 90

  7. Factor Graphs: introduction / reminder factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution: 13 / 90

  8. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 14 / 90

  9. Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 15 / 90

  10. Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 16 / 90

  11. Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 17 / 90

  12. Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors 3. calibrate model (e.g. using KL divergence, many choices) equal to exact conditionals construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original 18 / 90

  13. Fully independent training conditional (FITC) approximation 1. augment model with M<T pseudo data 2. remove some of the dependencies (results in simpler model) all factors 3. calibrate model (e.g. using KL divergence, many choices) equal to exact conditionals construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 19 / 90

  14. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 20 / 90

  15. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 21 / 90

  16. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 22 / 90

  17. Fully independent training conditional (FITC) approximation How do we make predictions? construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 23 / 90

  18. Fully independent training conditional (FITC) approximation How do we make predictions? construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 24 / 90

  19. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 25 / 90

  20. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 26 / 90

  21. Fully independent training conditional (FITC) approximation construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 27 / 90

  22. Fully independent training conditional (FITC) approximation cost of computing likelihood is construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 28 / 90

  23. Fully independent training conditional (FITC) approximation cost of computing likelihood is construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 29 / 90

  24. Fully independent training conditional (FITC) approximation cost of computing likelihood is construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 30 / 90

  25. Fully independent training conditional (FITC) approximation cost of computing likelihood is original variances along diagonal: stops variances collapsing construct new generative model (with pseudo-data) indirect cheaper to perform exact learning and inference posterior calibrated to original approximation 31 / 90

  26. FITC: Demo (Snelson) 32 / 90

  27. FITC: Demo (Snelson) 33 / 90

  28. Fully independent training conditional (FITC) approximation parametric (although cleverly so) if I see more data, should I add extra pseudo-data? ◮ unnatural from a generative modelling perspective ◮ natural from a prediction perspective (posterior gets more complex) = ⇒ lost elegant separation of model, inference and approximation example of prior approximation Extensions: inter-domain GP (pseudo-data in a different space) partially independent training conditional and tree-structured approximations 34 / 90

  29. Variational free-energy method (VFE) lower bound the likelihood 35 / 90

  30. Variational free-energy method (VFE) lower bound the likelihood 36 / 90

  31. Variational free-energy method (VFE) lower bound the likelihood 37 / 90

  32. Variational free-energy method (VFE) lower bound the likelihood 38 / 90

  33. Variational free-energy method (VFE) lower bound the likelihood 39 / 90

  34. Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes 40 / 90

  35. Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form exact: 41 / 90

  36. Variational free-energy method (VFE) approximate posterior true posterior optimise variational free-energy wrt to these variational parameters 42 / 90

  37. Variational free-energy method (VFE) same form as prediction from GP-regression approximate posterior true posterior optimise variational free-energy wrt to these variational parameters 43 / 90

  38. Variational free-energy method (VFE) same form as prediction from GP-regression approximate posterior true posterior output locations inputs locations of and covariance 'pseudo' data 'pseudo' data optimise variational free-energy wrt to these variational parameters 44 / 90

  39. Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form predictive from GP regression exact: 45 / 90

  40. Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form predictive from GP regression exact: plug into Free-energy: 46 / 90

  41. Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form predictive from GP regression exact: plug into Free-energy: 47 / 90

  42. Variational free-energy method (VFE) lower bound the likelihood KL between stochastic processes assume approximate posterior factorisation with special form predictive from GP regression exact: plug into Free-energy: 48 / 90

  43. Variational free-energy method (VFE) lower bound the likelihood where DTC like uncertainty based correction 49 / 90

  44. Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians DTC like uncertainty based correction 50 / 90

  45. Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians make bound as tight as possible: DTC like uncertainty based correction 51 / 90

  46. Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians make bound as tight as possible: (DTC) DTC like uncertainty based correction 52 / 90

  47. Variational free-energy method (VFE) lower bound the likelihood where average of KL between two quadratic form multivariate Gaussians make bound as tight as possible: (DTC) DTC like uncertainty based correction 53 / 90

  48. Summary of VFE method optimisation of pseudo point inputs: VFE has better guarantees than FITC variational methods known to underfit (and have other biases ) no augmentation required: target is posterior over functions, which includes inducing variables ◮ pseudo-input locations are pure variational parameters (do not parameterise the generative model like they do in FITC) ◮ coherent way of adding pseudo-data: more complex posteriors require more computational resources (more pseudo-points) Rule of thumb: VFE returns better mean estimates FITC returns better error-bar estimates how should we select M = number of pseudo-points? 54 / 90

  49. How do we select M = number of pseudo-data? 3 2 1 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 55 / 90

  50. How do we select M = number of pseudo-data? 3 2 1 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 56 / 90

  51. How do we select M = number of pseudo-data? 3 2 1 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 57 / 90

  52. How do we select M = number of pseudo-data? x pseudo-dataset (input location) SMSE compute time/s 58 / 90

  53. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 59 / 90

  54. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 60 / 90

  55. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 61 / 90

  56. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 62 / 90

  57. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 63 / 90

  58. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 64 / 90

  59. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 65 / 90

  60. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 66 / 90

  61. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 67 / 90

  62. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 68 / 90

  63. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 69 / 90

  64. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 70 / 90

  65. How do we select M = number of pseudo-data? 3 2 Exact 1 VFE 0 y -1 -2 -3 0 200 400 600 800 1000 1200 1400 1600 1800 2000 x 0 10 SMSE -2 10 0 1 10 10 compute time/s 71 / 90

  66. Power Expectation Propagation and Gaussian Processes 72 / 90

  67. A Brief History of Gaussian Process Approximations approximate generative model exact generative model methods employing exact inference approximate inference pseudo-data A Unifying View of Sparse Approximate Gaussian Process Regression FITC VFE Quinonero-Candela & PITC EP Rasmussen, 2005 (FITC, PITC, DTC) DTC PP A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...) FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression” 73 / 90

  68. EP pseudo-point approximation true posterior 74 / 90

  69. EP pseudo-point approximation true posterior 74 / 90

  70. EP pseudo-point approximation marginal posterior likelihood true posterior 74 / 90

  71. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 74 / 90

  72. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 74 / 90

  73. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 74 / 90

  74. EP pseudo-point approximation marginal posterior likelihood true posterior approximate posterior 74 / 90

  75. EP pseudo-point approximation exact joint of new GP regression model marginal posterior likelihood approximate posterior true posterior input locations of outputs and covariance 'pseudo' data 'pseudo' data 74 / 90

  76. EP algorithm 75 / 90

  77. EP algorithm take out one 1. remove pseudo-observation likelihood cavity 75 / 90

  78. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood tilted 75 / 90

  79. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 75 / 90

  80. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family update 4. update pseudo-observation likelihood 75 / 90

  81. EP algorithm take out one 1. remove pseudo-observation likelihood cavity add in one 2. include true observation likelihood KL between unnormalised tilted stochastic processes project onto 3. project approximating family 1. minimum: moments matched at pseudo-inputs 2. Gaussian regression: matches moments everywhere update 4. update pseudo-observation likelihood 75 / 90

Recommend


More recommend