Advances in using GPs with derivative observations Gaussian Process approximations 2017 –workshop by Eero Siivola 1 , joint work with Aki Vehtari 1 , Juho Piironen 1 , Javier González 2 , Jarno Vanhatalo 3 and Olli-Pekka Koistinen 1 1 Aalto University, Finland 2 Amazon, Cambridge, UK 3 Univeristy of Helsinki, Finland
Contents of this talk I Theory behind GPs + derivatives I GP-NEB I Automatic monotonicity detection with GPs I Bayesian optimization with derivative sign information Advances in using GPs with derivative observations May 30, 2017 2/43
Theory: GP + derivative observations How to use (partial) derivatives with GPs? We need to consider two parts: I Covariance function I Likelihood function I Posterior -> Inference method Advances in using GPs with derivative observations May 30, 2017 3/43
Covariance function Nice property (See e.g. Papoulis [1991, ch. 10]): ! ∂ f ( 1 ) ∂ ⇣ f ( 1 ) , f ( 2 ) ⌘ ∂ ⇣ x ( 1 ) , x ( 2 ) ⌘ , f ( 2 ) cov = cov = k ∂ x ( 1 ) ∂ x ( 1 ) ∂ x ( 1 ) g g g and: ! ∂ f ( 1 ) , ∂ f ( 2 ) � f ( 1 ) , f ( 2 ) � ∂ 2 cov = cov ∂ x ( 1 ) g ∂ x ( 2 ) ∂ x ( 1 ) ∂ x ( 2 ) h g h � x ( 1 ) , x ( 2 ) � ∂ 2 = k ∂ x ( 1 ) g ∂ x ( 2 ) h Advances in using GPs with derivative observations May 30, 2017 4/43
x ( 1 ) , . . . , x ( n ) ⇤ T and ˜ ⇥ ⇥ ˜ x ( m ) ⇤ T , be points x ( 1 ) , . . . , ˜ Let X = X = where we observe function values and partial derivative values. The covariance between latent function values f ( 1 ) , . . . , f ( n ) ⇤ T and latent function derivative values ⇥ f X = � T ∂ ˜ g , . . . , ∂ ˜ ˜ f ( 1 ) f ( m ) f 0 X = is: ˜ x ( 1 ) x ( m ) ∂ ˜ ∂ ˜ g 2 3 g cov ( f ( 1 ) , ˜ g cov ( f ( 1 ) , ˜ ∂ f ( 1 ) ) ∂ f ( m ) ) · · · x ( 1 ) x ( m ) ∂ ˜ ∂ ˜ 6 7 . . ... 5 = K T 6 . . 7 K X , ˜ X = . . 6 7 ˜ X , X 4 g cov ( f ( n ) , ˜ g cov ( f ( n ) , ˜ ∂ f ( 1 ) ) ∂ f ( m ) ) · · · x ( 1 ) x ( m ) ∂ ˜ ∂ ˜ Advances in using GPs with derivative observations May 30, 2017 5/43
And between latent function derivative values ˜ X and ˜ f ˜ f ˜ X 2 3 ∂ 2 g cov (˜ f ( 1 ) , ˜ ∂ 2 g cov (˜ f ( 1 ) , ˜ f ( 1 ) ) f ( m ) ) · · · x ( 1 ) x ( 1 ) x ( 1 ) x ( m ) ∂ ˜ g ∂ ˜ ∂ ˜ g ∂ ˜ 6 7 . . ... 6 . . 7 K ˜ X = . . X , ˜ 6 7 4 5 ∂ 2 g cov (˜ f ( m ) , ˜ ∂ 2 g cov (˜ f ( m ) , ˜ f ( 1 ) ) f ( m ) ) · · · x ( m ) x ( 1 ) x ( m ) x ( m ) ∂ ˜ ∂ ˜ ∂ ˜ ∂ ˜ g g Advances in using GPs with derivative observations May 30, 2017 6/43
Likelihood function Observations are assumed independent given latent function values: � n ! m !! ∂ ˜ y ( i ) f ( i ) ∂ ˜ � Y Y y 0 | f X , ˜ p ( y ( i ) | f ( i ) ) p ( y , ˜ f 0 � X ) = p ˜ � ∂ x ( i ) ∂ x ( i ) � g g i = 1 i = 1 How to select the likelihood of derivatives? I If direct derivative values can be observed: Gaussian likelihood I If we only have hint about the direction: Probit likelihood with a tuning parameter (Riihimäki and Vehtari (2010)) 0 1 � ! ! a ∂ ˜ ∂ ˜ Z y ( i ) f ( i ) f ( i ) ∂ ˜ � 1 � p = Φ , where @ φ ( a )= N ( x | 0 , 1 ) d x A � ∂ x ( i ) ∂ x ( i ) ∂ x ( i ) ν � g g g �1 Advances in using GPs with derivative observations May 30, 2017 7/43
Probit likelihood with ν = 1 × 10 − 4 Probit likelihood with ν = 1 1 1 y y 0 0 − 3 − 2 − 1 0 1 2 3 − 3 − 2 − 1 0 1 2 3 x x Advances in using GPs with derivative observations May 30, 2017 8/43
Posterior distribution Posterior distribution of joint values: ✓ n ◆ ✓ m ✓ � ◆◆ � Q Q ∂ ˜ y ( i ) ∂ ˜ p ( f , ˜ f 0 | X , ˜ f ( i ) p ( y ( i ) | f ( i ) ) X ) p � ∂ x ( i ) ∂ x ( i ) � p ( f , ˜ i = 1 i = 1 g g y 0 , X , ˜ f 0 | y , ˜ X ) = Z Different parts: I p ( f , ˜ f 0 | X , ˜ X ) is Gaussian I p ( y ( i ) | f ( i ) ) are Gaussian � ✓ ◆ y ( i ) � ∂ ˜ ∂ ˜ f ( i ) I p Gaussian/probit � ∂ x ( i ) ∂ x ( i ) � g g The posterior distribution is either Gaussian or similar as in classification problems I We might need posterior approximation methods Advances in using GPs with derivative observations May 30, 2017 9/43
Saddle point search using GPs + derivative observations I The properties of the system can be described by an energy surface I Finding a minimum energy path and the saddle point between two states is useful when determining properties of transitions Advances in using GPs with derivative observations May 30, 2017 10/43
Nudged elastic band (NEB) I Starting from an initial guess, the idea is to move the images downwards on the energy surface but keep them evenly spaced I The images are moved along a force vector, which is a resultant of two components: I (Negative) energy gradient component perpendicular to the path I A spring force parallel to the path, which tends to keep the images evenly spaced Advances in using GPs with derivative observations May 30, 2017 11/43
I The convergence of NEB may require hundreds or thousands of iterations I Each iteration requires evaluation of the energy gradient for all images, which is often a time-consuming operation Advances in using GPs with derivative observations May 30, 2017 12/43
Speedup of NEB I Repeat until convergence: 1. Evaluate the energy (and forces) at the images of the current path 2. If path not converged, approximate the energy surface using machine learning based on the observations so far 3. Find the predicted minimum energy path on the approximate surface and go to 1 I The details in paper by Peterson (2016) Advances in using GPs with derivative observations May 30, 2017 13/43
Speedup of NEB with GP and derivatives I Evaluate the energy (and forces) only at the image with the highest uncertainty I Re-approximate the energy surface and find a new MEP guess after each image evaluation I Convergence check: I If the magnitude of the force (may be accurate or approximation) is below the convergence limit for all images, we don’t move the path, but evaluate more images, until the convergence limit is not met any more or all images have been evaluated I If we manage to evaluate all images without moving the path, we know for sure if the path is converged I The details in paper by Koistinen, Maras, Vehtari and Jónsson (2016): Advances in using GPs with derivative observations May 30, 2017 14/43
Advances in using GPs with derivative observations May 30, 2017 15/43
Advances in using GPs with derivative observations May 30, 2017 16/43
Advances in using GPs with derivative observations May 30, 2017 17/43
I When evaluating the transition rates, the Hessian of the minimum points needs to be evaluated at some phase I This information can be used to improve the GP approximations, especially in the beginning, when there is little information Advances in using GPs with derivative observations May 30, 2017 18/43
Comparison of methods in heptamer case study Advances in using GPs with derivative observations May 30, 2017 19/43
Automatic monotonicity detection I Derivative sign information can be used to find monotonic input output directions I The basic idea: I Add derivative sign observations to the GP model I See if the additions affect to the probability of the data I the dimension is monotonic if not I The details in paper by Siivola, Piironen and Vehtari (2016) Advances in using GPs with derivative observations May 30, 2017 20/43
Theoretical background Energy comparison: y 0 | X , ˜ y 0 | X , ˜ E ( y , ˜ X m ) = − log p ( y , ˜ X m ) 0 1 ⇡ 1 z }| { y 0 | y , X , ˜ B C p (˜ = − log @ p ( y | X ) X m ) A ≈ E ( y | X ) . Advances in using GPs with derivative observations May 30, 2017 21/43
Energy of data E 0 GP with monotonicity assumption E regular GP < N Number of virtual observations Figure: Change in energy in reality as a function of virtual derivative sign observations Advances in using GPs with derivative observations May 30, 2017 22/43
Using automatic monotonicity detection in modelling I Monotonic dimensions can be detected from the data and used in modelling I The method makes the modelling results especially on the borders. Advances in using GPs with derivative observations May 30, 2017 23/43
Experiment I Six different functions of varying monotonicity I Different amount of noise added to training samples (signal to noise ratio (SNR) between 0 and 1) I Measure the log predictive posterior density of samples from a hold out set that resemble 20 % of the bordermost samples in the training data: L Z X lppd = log p ( y i | f ) p post ( f | x i ) df i = 1 I Do this for three different models for 200 times: I Use fixed monotonicity I Use monotonicity if the it does not change the energy (adaptive monotonicity) I Use model without derivative observations Advances in using GPs with derivative observations May 30, 2017 24/43
Recommend
More recommend