Modern Gaussian Processes: Scalable Inference and Novel Applications (Part II-b) Approximate Inference Edwin V. Bonilla and Maurizio Filippone CSIRO’s Data61, Sydney, Australia and EURECOM, Sophia Antipolis, France July 14 th , 2019 1
Challenges in Bayesian Reasoning with Gaussian Process Priors p ( f ) : prior over geology and rock properties p ( y | f ) : observation model’s likelihood $20 Million geothermal well Geol. surveys and explorations 2
Challenges in Bayesian Reasoning with Gaussian Process Priors p ( f ) : prior over geology and rock properties p ( y | f ) : observation model’s likelihood p ( f | y ) : posterior geological model: p ( f | θ ) p ( y | f ) $20 Million geothermal well p ( f | y , θ ) = � p ( f | θ ) p ( y | f ) d f � �� � hard bit Geol. surveys and explorations 2
Challenges in Bayesian Reasoning with Gaussian Process Priors p ( f ) : prior over geology and rock properties p ( y | f ) : observation model’s likelihood p ( f | y ) : posterior geological model: p ( f | θ ) p ( y | f ) $20 Million geothermal well p ( f | y , θ ) = � p ( f | θ ) p ( y | f ) d f � �� � hard bit Challenges: ◮ Non-linear likelihood models ◮ Large datasets Geol. surveys and explorations 2
Automated Probabilistic Reasoning • Approximate inference Deterministic Stochastic Goal: Build generic yet practical VI inference tools for Computational E ffi ciency practitioners and researchers MCMC Automation 3
Automated Probabilistic Reasoning • Approximate inference Deterministic Stochastic Goal: Build generic yet practical VI inference tools for Computational E ffi ciency practitioners and researchers MCMC Automation 3
Automated Probabilistic Reasoning • Approximate inference Deterministic Stochastic Goal: Build generic yet practical VI inference tools for Computational E ffi ciency practitioners and researchers • Other dimensions: MCMC ◮ Accuracy ◮ Convergence Automation 3
Outline 1 Latent Gaussian Process Models (LGPMs) 2 Variational Inference 3 Scalability through Inducing Variables and Stochastic Variational Inference (SVI) 4
Latent Gaussian Process Models (LGPMs)
Latent Gaussian Process Models (LGPMs) Supervised learning D = { x n , y n } N n =1 • Factorised GP priors over Q latent functions: f j ( x ) ∼ GP (0 , κ j ( x , x ′ ; θ )) Q � p ( F | X , θ ) = N ( F · j ; 0 , K j ) j =1 5
Latent Gaussian Process Models (LGPMs) Supervised learning D = { x n , y n } N n =1 • Factorised GP priors over Q latent functions: f j ( x ) ∼ GP (0 , κ j ( x , x ′ ; θ )) Q � p ( F | X , θ ) = N ( F · j ; 0 , K j ) j =1 • Factorised likelihood over observations N � p ( Y | X , F , φ ) = p ( Y n · | F n · , φ ) n =1 5
Latent Gaussian Process Models (LGPMs) Supervised learning D = { x n , y n } N n =1 • Factorised GP priors over Q latent functions: f j ( x ) ∼ GP (0 , κ j ( x , x ′ ; θ )) Q � p ( F | X , θ ) = N ( F · j ; 0 , K j ) j =1 • Factorised likelihood over observations N � p ( Y | X , F , φ ) = p ( Y n · | F n · , φ ) n =1 What can we model within this framework? 5
Examples of LGPMs (1) • Multi-output regression • Multi-class classification ◮ P = Q classes ◮ softmax likelihood 6
Examples of LGPMs (2) • Inversion problems 7
Examples of LGPMs (3) • Log Gaussian Cox processes (LGCPs) 8
Inference in LGPMs We only require access to ‘black-box’ likelihoods. How can we carry out inference in these general models? 9
Variational Inference
Variational Inference (VI): Optimise Rather than Integrate Recall our posterior estimation problem: 1 p ( F | Y ) p ( Y | F ) = p ( F ) p ( Y ) � �� � ���� � �� � posterior ���� prior conditional likelihood marginal likelihood 10
Variational Inference (VI): Optimise Rather than Integrate Recall our posterior estimation problem: 1 p ( F | Y ) p ( Y | F ) = p ( F ) p ( Y ) � �� � ���� � �� � posterior ���� prior conditional likelihood marginal likelihood � • Estimating p ( Y ) = p ( F ) p ( Y | F ) d F is hard 10
Variational Inference (VI): Optimise Rather than Integrate Recall our posterior estimation problem: 1 p ( F | Y ) p ( Y | F ) = p ( F ) p ( Y ) � �� � ���� � �� � posterior ���� prior conditional likelihood marginal likelihood � • Estimating p ( Y ) = p ( F ) p ( Y | F ) d F is hard • Instead, approximate q ( F | λ ) ≈ p ( F | Y ) to minimize: = E q ( F | λ ) log q ( F | λ ) def kl [ q ( F | λ ) � p ( F | Y )] p ( F | Y ) 10
Variational Inference (VI): Optimise Rather than Integrate Recall our posterior estimation problem: 1 p ( F | Y ) p ( Y | F ) = p ( F ) p ( Y ) � �� � ���� � �� � posterior ���� prior conditional likelihood marginal likelihood � • Estimating p ( Y ) = p ( F ) p ( Y | F ) d F is hard • Instead, approximate q ( F | λ ) ≈ p ( F | Y ) to minimize: = E q ( F | λ ) log q ( F | λ ) def kl [ q ( F | λ ) � p ( F | Y )] p ( F | Y ) Properties : kl [ q � p ] ≥ 0 , kl [ q � p ] = 0 iff q = p . 10
Decomposition of the Marginal Likelihood log p ( Y ) = kl [ q ( F | λ ) � p ( F | Y )] + L elbo ( λ ) KL[q ∥ p] log p ( Y ) ℒ ELBO ( λ ) Fig reproduced from Bishop (2006) • L elbo ( λ ) is a lower bound on the log marginal likelihood • The optimum is achieved when q = p • Maximizing L elbo ( λ ) ≡ minimizing kl [ q ( F | λ ) � p ( F | Y )] 11
Variational Inference Strategy • The evidence lower bound L elbo ( λ ) can be written as: def E q ( F | λ ) log p ( Y | F ) L elbo ( λ ) − kl [ q ( F | λ ) � p ( F )] = � �� � � �� � KL(approx. posterior � prior) expected log likelihood (ELL) • ELL is a model-fit term and KL is a penalty term 12
Variational Inference Strategy • The evidence lower bound L elbo ( λ ) can be written as: def E q ( F | λ ) log p ( Y | F ) L elbo ( λ ) = − kl [ q ( F | λ ) � p ( F )] � �� � � �� � KL(approx. posterior � prior) expected log likelihood (ELL) • ELL is a model-fit term and KL is a penalty term 1 • What family of distributions? 0.8 ◮ As flexible as possible 0.6 ◮ Tractability is the main 0.4 constraint 0.2 ◮ No risk of over-fitting 0 −2 −1 0 1 2 3 4 Fig from Bishop (2006) 12
Variational Inference Strategy • The evidence lower bound L elbo ( λ ) can be written as: def E q ( F | λ ) log p ( Y | F ) L elbo ( λ ) = − kl [ q ( F | λ ) � p ( F )] � �� � � �� � KL(approx. posterior � prior) expected log likelihood (ELL) • ELL is a model-fit term and KL is a penalty term 1 • What family of distributions? 0.8 ◮ As flexible as possible 0.6 ◮ Tractability is the main 0.4 constraint 0.2 ◮ No risk of over-fitting 0 −2 −1 0 1 2 3 4 Fig from Bishop (2006) We want to maximise L elbo ( λ ) wrt variational parameters λ 12
Automated VI for LGPMs (Nguyen and Bonilla, NeurIPS, 2014) Goal : Approximate intractable posterior p ( F | Y ) with variational distribution K K Q � � � q ( F | λ ) = π k q k ( F | λ ) = N ( F k ; m kj , S kj ) π k k =1 k =1 j =1 with variational parameters λ = { m kj , S kj } , 13
Automated VI for LGPMs (Nguyen and Bonilla, NeurIPS, 2014) Goal : Approximate intractable posterior p ( F | Y ) with variational distribution K K Q � � � q ( F | λ ) = π k q k ( F | λ ) = N ( F k ; m kj , S kj ) π k k =1 k =1 j =1 with variational parameters λ = { m kj , S kj } , Recall L elbo ( λ ) = - KL + ELL: • KL term can be bounded using Jensen’s inequality ◮ Exact gradients of parameters 13
Automated VI for LGPMs (Nguyen and Bonilla, NeurIPS, 2014) Goal : Approximate intractable posterior p ( F | Y ) with variational distribution K K Q � � � q ( F | λ ) = π k q k ( F | λ ) = N ( F k ; m kj , S kj ) π k k =1 k =1 j =1 with variational parameters λ = { m kj , S kj } , Recall L elbo ( λ ) = - KL + ELL: • KL term can be bounded using Jensen’s inequality ◮ Exact gradients of parameters ELL and its gradients can be estimated efficiently 13
Expected Log Likelihood Term Th.1: Efficient estimation The ELL and its gradients can be estimated using expectations over univariate Gaussian distributions. def = q k ( n ) ( F · n | λ k ( n ) ) q k ( n ) N � E q k log p ( Y | F ) = E q k ( n ) log p ( Y n · | F n · ) n =1 ∇ λ k ( n ) E q k ( n ) log p ( Y n · | F n · ) = E q k ( n ) ∇ λ k ( n ) log q k ( n ) ( F · n | λ k ( n ) ) log p ( Y n · | F 14
Expected Log Likelihood Term Th.1: Efficient estimation The ELL and its gradients can be estimated using expectations over univariate Gaussian distributions. def = q k ( n ) ( F · n | λ k ( n ) ) q k ( n ) N � E q k log p ( Y | F ) = E q k ( n ) log p ( Y n · | F n · ) n =1 ∇ λ k ( n ) E q k ( n ) log p ( Y n · | F n · ) = E q k ( n ) ∇ λ k ( n ) log q k ( n ) ( F · n | λ k ( n ) ) log p ( Y n · | F Practical consequences • Can use unbiased Monte Carlo estimates • Gradients of the likelihood are not required (only likelihood evaluations) • Holds ∀ Q ≥ 1 14
Scalability through Inducing Variables and Stochastic Variational Inference (SVI)
Recommend
More recommend