Gaussian Process Summer School Kernel Design Nicolas Durrande – PROWLER.io (nicolas@prowler.io) Sheffield, September 2017 1 / 59
Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion 2 / 59
Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion 3 / 59
We have seen during the introduction lectures that the distribution of a GP Z depends on two functions : the mean m ( x ) = E ( Z ( x )) the covariance k ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ )) In this talk, we will focus on the covariance function , which is often call the kernel . 4 / 59
We assume we have observed a function f for a limited number of time points x 1 , . . . , x n : 1.5 1.0 0.5 f ( x ) 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x The observations are denoted by f i = f ( x i ) (or F = f ( X ) ). 5 / 59
Since f in unknown, we make the general assumption that it is a sample path of a Gaussian process Z : 4 2 Z ( x ) 0 -2 -4 0.0 0.2 0.4 0.6 0.8 1.0 x 6 / 59
Combining these two informations means keeping the samples interpolating the data points : 1.5 Z ( x ) | Z ( X ) = F 1.0 0.5 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 7 / 59
The conditional distribution is still Gaussian with moments : m ( x ) = E ( Z ( x ) | Z ( X ) = F ) = k ( x , X ) k ( X , X ) − 1 F c ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ ) | Z ( X ) = F ) = k ( x , x ′ ) − k ( x , X ) k ( X , X ) − 1 k ( X , x ′ ) It can be represented as a mean function with confidence intervals. 1.5 Z ( x ) | Z ( X ) = F 1.0 0.5 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 8 / 59
Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion 9 / 59
Let Z be a random process with kernel k . Some properties of kernels can be obtained directly from their definition. Example k ( x , x ) = cov ( Z ( x ) , Z ( x )) = var ( Z ( x )) ≥ 0 ⇒ k ( x , x ) is positive . k ( x , y ) = cov ( Z ( x ) , Z ( y )) = cov ( Z ( y ) , Z ( x )) = k ( y , x ) ⇒ k ( x , y ) is symmetric . We can obtain a thinner result... 10 / 59
We introduce the random variable T = � n i = 1 a i Z ( x i ) where n , a i and x i are arbitrary. Computing the variance of T gives : � � � � = var ( T ) = cov a i Z ( x i ) , a j Z ( x j ) a i a j cov ( Z ( x i ) , Z ( x j )) i j i j � � = a i a j k ( x i , x j ) Since a variance is positive, we have � � a i a j k ( x i , x j ) ≥ 0 i j for any arbitrary n , a i and x i . Definition The functions satisfying the above inequality for all n ∈ N , for all x i ∈ D , for all a i ∈ R are called positive semi-definite functions. 11 / 59
We have just seen : k is a covariance ⇒ k is a positive semi-definite function The reverse is also true : Theorem (Loeve) k corresponds to the covariance of a GP � k is a symmetric positive semi-definite function 12 / 59
Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd : − ( x − y ) 2 � � k ( x , y ) = σ 2 exp squared exp. 2 θ 2 √ √ � � � � + 5 | x − y | 2 5 | x − y | 5 | x − y | k ( x , y ) = σ 2 Matern 5/2 1 + exp − 3 θ 2 θ θ √ √ � � � � 3 | x − y | 3 | x − y | k ( x , y ) = σ 2 Matern 3/2 1 + exp − θ θ � − | x − y | � k ( x , y ) = σ 2 exp exponential θ k ( x , y ) = σ 2 min ( x , y ) Brownian k ( x , y ) = σ 2 δ x , y white noise k ( x , y ) = σ 2 constant k ( x , y ) = σ 2 xy linear When k is a function of x − y , the kernel is called stationary . σ 2 is called the variance and θ the lengthscale . 13 / 59
14 / 59
For a few kernels, it is possible to prove they are psd directly from the definition. k ( x , y ) = δ x , y k ( x , y ) = 1 For most of them a direct proof from the definition is not possible. The following theorem is helpful for stationary kernels : Theorem (Bochner) A continuous stationary function k ( x , y ) = ˜ k ( | x − y | ) is positive definite if and only if ˜ k is the Fourier transform of a finite positive measure : � ˜ e − i ω t d µ ( ω ) k ( t ) = R 15 / 59
Example We consider the following measure : 0.0 k ( t ) = sin ( t ) Its Fourier transform gives ˜ : t 0.0 As a consequence, k ( x , y ) = sin ( x − y ) is a valid covariance x − y function. 16 / 59
Usual kernels Bochner theorem can be used to prove the positive definiteness of many usual stationary kernels The Gaussian is the Fourier transform of itself ⇒ it is psd. 1 Matérn kernels are the Fourier transforms of ( 1 + ω 2 ) p ⇒ they are psd. 17 / 59
Unusual kernels Inverse Fourier transform of a (symmetrised) sum of Gaussian gives (A. Wilson, ICML 2013) : ˜ µ ( ω ) k ( t ) − → F 0.0 0.0 The obtained kernel is parametrised by its spectrum. 18 / 59
Unusual kernels The sample paths have the following shape : 6 4 2 0 2 4 6 0 1 2 3 4 5 19 / 59
Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion 20 / 59
Changing the kernel has a huge impact on the model : Gaussian kernel: Exponential kernel: 21 / 59
This is because changing the kernel implies changing the prior Gaussian kernel: Exponential kernel: 22 / 59
In order to choose a kernel, one should gather all possible informations about the function to approximate... Is it stationary ? Is it differentiable, what’s its regularity ? Do we expect particular trends ? Do we expect particular patterns (periodicity, cycles, additivity) ? Kernels often include rescaling parameters : θ for the x axis (length-scale) and σ for the y ( σ 2 often corresponds to the GP variance). They can be tuned by maximizing the likelihood minimizing the prediction error 23 / 59
It is common to try various kernels and to asses the model accuracy. The idea is to compare some model predictions against actual values : On a test set Using leave-one-out Two (ideally three) things should be checked : Is the mean accurate (MSE, Q 2 ) ? Do the confidence intervals make sense ? Are the predicted covariances right ? Furthermore, it is often interesting to try some input remapping such as x → log ( x ) , x → exp ( x ) , ... 24 / 59
Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion 25 / 59
Making new from old : Kernels can be : Summed together ◮ On the same space k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) + k 2 ( x 2 , y 2 ) Multiplied together ◮ On the same space k ( x , y ) = k 1 ( x , y ) × k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) × k 2 ( x 2 , y 2 ) Composed with a function ◮ k ( x , y ) = k 1 ( f ( x ) , f ( y )) All these operations will preserve the positive definiteness. How can this be useful ? 26 / 59
Sum of kernels over the same space Example (The Mauna Loa observatory dataset) This famous dataset compiles the monthly CO 2 concentration in Hawaii since 1958. 440 420 400 380 360 340 320 1960 1970 1980 1990 2000 2010 2020 2030 Let’s try to predict the concentration for the next 20 years. 27 / 59
Sum of kernels over the same space We first consider a squared-exponential kernel : � � − ( x − y ) 2 k ( x , y ) = σ 2 exp θ 2 480 600 460 400 440 420 200 400 0 380 360 200 340 400 320 600 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The results are terrible ! 28 / 59
480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space What happen if we sum both kernels ? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 29 / 59
Sum of kernels over the same space What happen if we sum both kernels ? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The model is drastically improved ! 29 / 59
460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space We can try the following kernel : 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 30 / 59
Sum of kernels over the same space We can try the following kernel : 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Once again, the model is significantly improved. 30 / 59
Recommend
More recommend