kernel design
play

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2018 Introduction 2 / 57 We have seen during the introduction lectures that the distribution of a GP Z depends on two


  1. Gaussian Process Summer School Kernel Design Nicolas Durrande – PROWLER.io (nicolas@prowler.io) Sheffield, September 2018

  2. Introduction 2 / 57

  3. We have seen during the introduction lectures that the distribution of a GP Z depends on two functions: the mean m ( x ) = E ( Z ( x )) the covariance k ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ )) In this talk, we will focus on the covariance function , which is often call the kernel . 3 / 57

  4. Given some data, the conditional distribution is still Gaussian: m ( x ) = E ( Z ( x ) | Z ( X ) + ε = F ) = k ( x , X )( k ( X , X ) + τ 2 I ) − 1 F c ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ ) | Z ( X ) + ε = F ) = k ( x , x ′ ) − k ( x , X )( k ( X , X ) + τ 2 I ) − 1 k ( X , x ′ ) It can be represented as a mean function with confidence intervals. 1.5 Z ( x ) | Z ( X ) = F 1.0 0.5 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 4 / 57

  5. What is a kernel? 5 / 57

  6. Let Z be a random process with kernel k . Some properties of kernels can be obtained directly from their definition. Example k ( x , x ) = cov ( Z ( x ) , Z ( x )) = var ( Z ( x )) ≥ 0 ⇒ k ( x , x ) is positive . k ( x , y ) = cov ( Z ( x ) , Z ( y )) = cov ( Z ( y ) , Z ( x )) = k ( y , x ) ⇒ k ( x , y ) is symmetric . We can obtain a thinner result... 6 / 57

  7. We introduce the random variable T = � n i = 1 a i Z ( x i ) where n , a i and x i are arbitrary. Computing the variance of T gives:   � � � �  = var ( T ) = cov a i Z ( x i ) , a j Z ( x j ) a i a j cov ( Z ( x i ) , Z ( x j )) i j i j � � = a i a j k ( x i , x j ) ≥ 0 Since a variance is positive, we have for any arbitrary n , a i and x i . Definition The functions k satisfying � � a i a j k ( x i , x j ) ≥ 0 i j for all n ∈ N , for all x i ∈ D , for all a i ∈ R are called positive semi-definite functions. 7 / 57

  8. We have just seen: k is a covariance ⇒ k is a symmetric positive semi-definite function The reverse is also true: Theorem (Loeve) k corresponds to the covariance of a GP � k is a symmetric positive semi-definite function 8 / 57

  9. Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd: − ( x − y ) 2 � � k ( x , y ) = σ 2 exp squared exp. 2 θ 2 √ √ � � � � + 5 | x − y | 2 5 | x − y | 5 | x − y | k ( x , y ) = σ 2 Matern 5/2 1 + exp − 3 θ 2 θ θ √ √ � � � � 3 | x − y | 3 | x − y | k ( x , y ) = σ 2 Matern 3/2 1 + exp − θ θ � − | x − y | � k ( x , y ) = σ 2 exp exponential θ k ( x , y ) = σ 2 min ( x , y ) Brownian k ( x , y ) = σ 2 δ x , y white noise k ( x , y ) = σ 2 constant k ( x , y ) = σ 2 xy linear When k is a function of x − y , the kernel is called stationary . σ 2 is called the variance and θ the lengthscale . 9 / 57

  10. Examples of kernels in gpflow: Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Periodic k(x, 0.0) Linear k(x, 1.0) Polynomial k(x, 1.0) ArcCosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 2 0 2 2 0 2 2 0 2 2 0 2 10 / 57

  11. Associated samples Matern12 Matern32 Matern52 RBF 3 2 1 0 1 2 3 RationalQuadratic Constant White Cosine 3 2 1 0 1 2 3 Periodic Linear Polynomial ArcCosine 3 2 1 0 1 2 3 2 0 2 2 0 2 2 0 2 2 0 2 11 / 57

  12. For a few kernels, it is possible to prove they are psd directly from the definition. k ( x , y ) = δ x , y k ( x , y ) = 1 For most of them a direct proof from the definition is not possible. The following theorem is helpful for stationary kernels: Theorem (Bochner) A continuous stationary function k ( x , y ) = ˜ k ( | x − y | ) is positive definite if and only if ˜ k is the Fourier transform of a finite positive measure: � ˜ e − i ω t d µ ( ω ) k ( t ) = R 12 / 57

  13. Example We consider the following measure: 0.0 k ( t ) = sin ( t ) Its Fourier transform gives ˜ : t 0.0 As a consequence, k ( x , y ) = sin ( x − y ) is a valid covariance x − y function. 13 / 57

  14. Usual kernels Bochner theorem can be used to prove the positive definiteness of many usual stationary kernels The Gaussian is the Fourier transform of itself ⇒ it is psd. 1 Matérn kernels are the Fourier transforms of ( 1 + ω 2 ) p ⇒ they are psd. 14 / 57

  15. Unusual kernels Inverse Fourier transform of a (symmetrised) sum of Gaussian gives (A. Wilson, ICML 2013): ˜ µ ( ω ) k ( t ) − → F 0.0 0.0 The obtained kernel is parametrised by its spectrum. 15 / 57

  16. Unusual kernels The sample paths have the following shape: 6 4 2 0 2 4 6 0 1 2 3 4 5 16 / 57

  17. Choosing the appropriate kernel 17 / 57

  18. Changing the kernel has a huge impact on the model: Gaussian kernel: Exponential kernel: 18 / 57

  19. This is because changing the kernel implies changing the prior Gaussian kernel: Exponential kernel: 19 / 57

  20. In order to choose a kernel, one should gather all possible informations about the function to approximate... Is it stationary? Is it differentiable, what’s its regularity? Do we expect particular trends? Do we expect particular patterns (periodicity, cycles, additivity)? Kernels often include rescaling parameters: θ for the x axis (length-scale) and σ for the y ( σ 2 often corresponds to the GP variance). They can be tuned by maximizing the likelihood minimizing the prediction error 20 / 57

  21. It is common to try various kernels and to asses the model accuracy. The idea is to compare some model predictions against actual values: On a test set Using leave-one-out Two (ideally three) things should be checked: Is the mean accurate (MSE, Q 2 )? Do the confidence intervals make sense? Are the predicted covariances right? Furthermore, it is often interesting to try some input remapping such as x → log ( x ) , x → exp ( x ) , ... 21 / 57

  22. Making new from old 22 / 57

  23. Making new from old: Kernels can be: Summed together ◮ On the same space k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) + k 2 ( x 2 , y 2 ) Multiplied together ◮ On the same space k ( x , y ) = k 1 ( x , y ) × k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) × k 2 ( x 2 , y 2 ) Composed with a function ◮ k ( x , y ) = k 1 ( f ( x ) , f ( y )) All these operations will preserve the positive definiteness. How can this be useful? 23 / 57

  24. Sum of kernels over the same input space Property k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) is a valid covariance structure. This can be proved directly from the p.s.d. definition. Example Matern12 k(x, 0.03) Linear k(x, 0.03) Sum k(x, .03) 0.04 0.04 0.04 0.02 0.02 0.02 = + 0.00 0.00 0.00 0.02 0.02 0.02 0.04 0.04 0.04 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 24 / 57

  25. Sum of kernels over the same input space Z ∼ N ( 0 , k 1 + k 2 ) can be seen as Z = Z 1 + Z 2 where Z 1 , Z 2 are indenpendent and Z 1 ∼ N ( 0 , k 1 ) , Z 2 ∼ N ( 0 , k 2 ) k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) Example Z 1 ( x ) Z 2 ( x ) Z ( x ) 2.0 2.0 2.0 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 = + 0.0 0.0 0.0 0.5 0.5 0.5 1.0 1.0 1.0 1.5 1.5 1.5 2.0 2.0 2.0 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 25 / 57

  26. Sum of kernels over the same space Example (The Mauna Loa observatory dataset) This famous dataset compiles the monthly CO 2 concentration in Hawaii since 1958. 440 420 400 380 360 340 320 1960 1970 1980 1990 2000 2010 2020 2030 Let’s try to predict the concentration for the next 20 years. 26 / 57

  27. Sum of kernels over the same space We first consider a squared-exponential kernel: � � − ( x − y ) 2 k ( x , y ) = σ 2 exp θ 2 480 600 460 400 440 420 200 400 0 380 360 200 340 400 320 600 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The results are terrible! 27 / 57

  28. 480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space What happen if we sum both kernels? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 28 / 57

  29. Sum of kernels over the same space What happen if we sum both kernels? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The model is drastically improved! 28 / 57

  30. 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space We can try the following kernel: 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 29 / 57

  31. Sum of kernels over the same space We can try the following kernel: 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Once again, the model is significantly improved. 29 / 57

Recommend


More recommend