fitting covariance and multioutput gaussian processes
play

Fitting Covariance and Multioutput Gaussian Processes Neil D. - PowerPoint PPT Presentation

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Outline Constructing Covariance GP Limitations Kalman Filter Outline Constructing Covariance GP Limitations Kalman Filter Constructing


  1. Learning Covariance Parameters Can we determine covariance parameters from the data? − y ⊤ K − 1 y � � N � y | 0 , K � = 1 exp 1 2 n (2 π ) 2 | K | 2 The parameters are inside the covariance function (matrix). k i , j = k ( x i , x j ; θ )

  2. Learning Covariance Parameters Can we determine covariance parameters from the data? − y ⊤ K − 1 y � � N � y | 0 , K � = 1 exp 1 2 n (2 π ) 2 | K | 2 The parameters are inside the covariance function (matrix). k i , j = k ( x i , x j ; θ )

  3. Learning Covariance Parameters Can we determine covariance parameters from the data? 2 log | K |− y ⊤ K − 1 y log N � y | 0 , K � = − 1 2 − n 2 log 2 π The parameters are inside the covariance function (matrix). k i , j = k ( x i , x j ; θ )

  4. Learning Covariance Parameters Can we determine covariance parameters from the data? 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2 The parameters are inside the covariance function (matrix). k i , j = k ( x i , x j ; θ )

  5. Eigendecomposition of Covariance A useful decomposition for understanding the objective function. K = R Λ 2 R ⊤ λ 1 Diagonal of Λ represents distance λ 2 along axes. R gives a rotation of these axes. where Λ is a diagonal matrix and R ⊤ R = I .

  6. Capacity control: log | K | λ 1 0 Λ = 0 λ 2 λ 1

  7. Capacity control: log | K | λ 1 0 Λ = λ 2 0 λ 2 λ 1

  8. Capacity control: log | K | λ 1 0 Λ = λ 2 0 λ 2 λ 1

  9. Capacity control: log | K | λ 1 0 Λ = λ 2 0 λ 2 λ 1 | Λ | = λ 1 λ 2

  10. Capacity control: log | K | λ 1 0 Λ = λ 2 0 λ 2 λ 1 | Λ | = λ 1 λ 2

  11. Capacity control: log | K | λ 1 0 Λ = | Λ | λ 2 0 λ 2 λ 1 | Λ | = λ 1 λ 2

  12. Capacity control: log | K | λ 1 0 0 Λ = 0 λ 2 0 | Λ | λ 2 0 0 λ 3 λ 1 | Λ | = λ 1 λ 2

  13. Capacity control: log | K | λ 1 0 0 Λ = 0 λ 2 0 | Λ | λ 2 0 0 λ 3 λ 3 λ 1 | Λ | = λ 1 λ 2 λ 3

  14. Capacity control: log | K | λ 1 0 Λ = | Λ | λ 2 0 λ 2 λ 1 | Λ | = λ 1 λ 2

  15. Capacity control: log | K | | Λ | w 1 , 1 w 1 , 2 R Λ = λ 1 w 2 , 1 w 2 , 2 λ 2 | R Λ | = λ 1 λ 2

  16. Data Fit: y ⊤ K − 1 y 2 6 4 2 λ 2 y 2 0 λ 1 -2 -4 -6 -6 -4 -2 0 2 4 6 y 1

  17. Data Fit: y ⊤ K − 1 y 2 6 4 2 λ 2 y 2 0 λ 1 -2 -4 -6 -6 -4 -2 0 2 4 6 y 1

  18. Data Fit: y ⊤ K − 1 y 2 6 4 2 λ 1 y 2 0 λ 2 -2 -4 -6 -6 -4 -2 0 2 4 6 y 1

  19. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  20. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  21. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  22. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  23. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  24. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  25. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  26. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  27. Learning Covariance Parameters Can we determine length scales and noise levels from the data? 20 2 15 1 10 y ( x ) 5 0 0 -1 -5 -10 -2 10 − 1 10 0 10 1 -2 -1 0 1 2 x length scale, ℓ 2 log | K | + y ⊤ K − 1 y E ( θ ) = 1 2

  28. Gene Expression Example ◮ Given given expression levels in the form of a time series from Della Gatta et al. (2008). ◮ Want to detect if a gene is expressed or not, fit a GP to each gene (Kalaitzis and Lawrence, 2011).

  29. Kalaitzis and Lawrence BMC Bioinformatics 2011, 12 :180 http://www.biomedcentral.com/1471-2105/12/180 RESEARCH ARTICLE Open Access A Simple Approach to Ranking Differentially Expressed Gene Expression Time Courses through Gaussian Process Regression Alfredo A Kalaitzis * and Neil D Lawrence * Abstract Background: The analysis of gene expression from time series underpins many biological studies. Two basic forms of analysis recur for data of this type: removing inactive (quiet) genes from the study and determining which genes are differentially expressed. Often these analysis stages are applied disregarding the fact that the data is drawn from a time series. In this paper we propose a simple model for accounting for the underlying temporal nature of the data based on a Gaussian process. Results: We review Gaussian process (GP) regression for estimating the continuous trajectories underlying in gene expression time-series. We present a simple approach which can be used to filter quiet genes, or for the case of time series in the form of expression ratios, quantify differential expression. We assess via ROC curves the rankings produced by our regression framework and compare them to a recently proposed hierarchical Bayesian model for the analysis of gene expression time-series (BATS). We compare on both simulated and experimental data showing that the proposed approach considerably outperforms the current state of the art.

  30. 1 0.5 log 10 SNR 0 -0.5 -1 -1.5 -2 -2.5 1 1.5 2 2.5 3 3.5 log 10 length scale Contour plot of Gaussian process likelihood.

  31. 1 1 0.5 log 10 SNR 0.5 0 -0.5 y ( x ) 0 -1 -1.5 -0.5 -2 -2.5 -1 1 1.5 2 2.5 3 3.5 0 50100 150 200 250 300 log 10 length scale x Optima: length scale of 1.2221 and log 10 SNR of 1.9654 log likelihood is -0.22317.

  32. 1 1 0.5 log 10 SNR 0.5 0 -0.5 y ( x ) 0 -1 -1.5 -0.5 -2 -2.5 -1 1 1.5 2 2.5 3 3.5 0 50100 150 200 250 300 log 10 length scale x Optima: length scale of 1.5162 and log 10 SNR of 0.21306 log likelihood is -0.23604.

  33. 1 0.8 0.6 0.5 log 10 SNR 0.4 0 0.2 -0.5 y ( x ) 0 -1 -0.2 -1.5 -0.4 -2 -0.6 -2.5 -0.8 1 1.5 2 2.5 3 3.5 0 50100 150 200 250 300 log 10 length scale x Optima: length scale of 2.9886 and log 10 SNR of -4.506 log likelihood is -2.1056.

  34. Outline Constructing Covariance GP Limitations Kalman Filter

  35. Limitations of Gaussian Processes ◮ Inference is O ( n 3 ) due to matrix inverse (in practice use Cholesky). ◮ Gaussian processes don’t deal well with discontinuities (financial crises, phosphorylation, collisions, edges in images). ◮ Widely used exponentiated quadratic covariance (RBF) can be too smooth in practice (but there are many alternatives!!).

  36. Outline Constructing Covariance GP Limitations Kalman Filter

  37. Simple Markov Chain ◮ Assume 1-d latent state, a vector over time, x = [ x 1 . . . x T ]. ◮ Markov property, x i = x i − 1 + ǫ i , ǫ i ∼N (0 , α ) = ⇒ x i ∼N ( x i − 1 , α ) ◮ Initial state, x 0 ∼ N (0 , α 0 ) ◮ If x 0 ∼ N (0 , α ) we have a Markov chain for the latent states. ◮ Markov chain it is specified by an initial distribution (Gaussian) and a transition distribution (Gaussian).

  38. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 0 = 0 . 000, ǫ 1 = − 2 . 24 x 1 = 0 . 000 − 2 . 24 = − 2 . 24

  39. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 1 = − 2 . 24, ǫ 2 = 0 . 457 x 2 = − 2 . 24 + 0 . 457 = − 1 . 78

  40. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 2 = − 1 . 78, ǫ 3 = 0 . 178 x 3 = − 1 . 78 + 0 . 178 = − 1 . 6

  41. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 3 = − 1 . 6, ǫ 4 = − 0 . 292 x 4 = − 1 . 6 − 0 . 292 = − 1 . 89

  42. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 4 = − 1 . 89, ǫ 5 = − 0 . 501 x 5 = − 1 . 89 − 0 . 501 = − 2 . 39

  43. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 5 = − 2 . 39, ǫ 6 = 1 . 32 x 6 = − 2 . 39 + 1 . 32 = − 1 . 08

  44. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 6 = − 1 . 08, ǫ 7 = 0 . 989 x 7 = − 1 . 08 + 0 . 989 = − 0 . 0881

  45. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 7 = − 0 . 0881, ǫ 8 = − 0 . 842 x 8 = − 0 . 0881 − 0 . 842 = − 0 . 93

  46. Gauss Markov Chain 4 2 0 x -2 -4 0 1 2 3 4 5 6 7 8 9 t x 0 = 0, ǫ i ∼ N (0 , 1) x 8 = − 0 . 93, ǫ 9 = − 0 . 41 x 9 = − 0 . 93 − 0 . 410 = − 1 . 34

  47. Multivariate Gaussian Properties: Reminder If � � z ∼ N µ , C and x = Wz + b then � W µ + b , WCW ⊤ � x ∼ N

  48. Multivariate Gaussian Properties: Reminder Simplified : If � � 0 , σ 2 I z ∼ N and x = Wz then � 0 , σ 2 WW ⊤ � x ∼ N

  49. Matrix Representation of Latent Variables x 1 1 0 0 0 0 ǫ 1 x 2 1 1 0 0 0 ǫ 2 = × x 3 1 1 1 0 0 ǫ 3 x 4 1 1 1 1 0 ǫ 4 x 5 1 1 1 1 1 ǫ 5 x 1 = ǫ 1

  50. Matrix Representation of Latent Variables x 1 1 0 0 0 0 ǫ 1 x 2 1 1 0 0 0 ǫ 2 = × x 3 1 1 1 0 0 ǫ 3 x 4 1 1 1 1 0 ǫ 4 x 5 1 1 1 1 1 ǫ 5 x 2 = ǫ 1 + ǫ 2

  51. Matrix Representation of Latent Variables x 1 1 0 0 0 0 ǫ 1 x 2 1 1 0 0 0 ǫ 2 = × x 3 1 1 1 0 0 ǫ 3 x 4 1 1 1 1 0 ǫ 4 x 5 1 1 1 1 1 ǫ 5 x 3 = ǫ 1 + ǫ 2 + ǫ 3

  52. Matrix Representation of Latent Variables x 1 1 0 0 0 0 ǫ 1 x 2 1 1 0 0 0 ǫ 2 = × x 3 1 1 1 0 0 ǫ 3 x 4 1 1 1 1 0 ǫ 4 x 5 1 1 1 1 1 ǫ 5 x 4 = ǫ 1 + ǫ 2 + ǫ 3 + ǫ 4

  53. Matrix Representation of Latent Variables x 1 1 0 0 0 0 ǫ 1 x 2 1 1 0 0 0 ǫ 2 = × x 3 1 1 1 0 0 ǫ 3 x 4 1 1 1 1 0 ǫ 4 x 5 1 1 1 1 1 ǫ 5 x 5 = ǫ 1 + ǫ 2 + ǫ 3 + ǫ 4 + ǫ 5

  54. Matrix Representation of Latent Variables x = × ǫ L 1

  55. Multivariate Process ◮ Since x is linearly related to ǫ we know x is a also Gaussian process. ◮ Simply invoke our properties of multivariate Gaussian densities.

  56. Latent Process x = L 1 ǫ

  57. Latent Process x = L 1 ǫ ǫ ∼ N ( 0 , α I )

  58. Latent Process x = L 1 ǫ ǫ ∼ N ( 0 , α I ) = ⇒

  59. Latent Process x = L 1 ǫ ǫ ∼ N ( 0 , α I ) = ⇒ � � 0 , α L 1 L ⊤ x ∼ N 1

Recommend


More recommend