kernel partial least squares for stationary data
play

Kernel partial least squares for stationary data Tatyana - PowerPoint PPT Presentation

Kernel partial least squares for stationary data Tatyana Krivobokova, Marco Singer, Axel Munk Georg-August-Universit at G ottingen Bert de Groot Max Planck Institute for Biophysical Chemistry Van Dantzig Seminar, 06 April 2017 1 / 39


  1. Kernel partial least squares for stationary data Tatyana Krivobokova, Marco Singer, Axel Munk Georg-August-Universit¨ at G¨ ottingen Bert de Groot Max Planck Institute for Biophysical Chemistry Van Dantzig Seminar, 06 April 2017 1 / 39

  2. Motivating example Proteins • are large biological molecules • function often requires dynamics • configuration space is high-dimensional Group of Bert de Groot seeks to identify a relationship between collective atomic motions of a protein and some specific protein’s (biological) function. 2 / 39

  3. Motivating example The data from the Molecular Dynamics (MD) simulations: • Y t ∈ R is a functional quantity of interest at time t , t = 1 , . . . , n • X t ∈ R 3 N are Euclidean coordinates of N atoms at time t Stylized facts • d = 3 N is typically high, but d ≪ n • { X t } t , { Y t } t are (non-)stationary time series • some (large) atom movements might be unrelated to Y t Functional quantity Y t to be modelled a function of X t . 3 / 39

  4. Yeast aquaporin (AQY1) • Gated water channel • Y t is the opening diameter (red line) • 783 backbone atoms • n = 20 , 000 observations on 100 ns timeframe 4 / 39

  5. AQY1 time series Movements of the first atom and the diameter of channel opening 4.0 0.6 3.8 Diameter in nm 3.6 Coordinate 0.5 3.4 3.2 0.4 3.0 0.3 2.8 0 20 40 60 80 100 0 20 40 60 80 100 Time in ns Time in ns 5 / 39

  6. Model Assume Y t = f ( X t ) + ǫ t , t = 1 , . . . , n , where • { X t } t is a d -dimensional stationary time series • { ǫ t } t i.i.d. zero mean sequence independent of { X t } t • f ∈ L 2 ( P � X ), � X is independent of { X t } t and { ǫ t } t and P � X = P X 1 The closeness of an estimator � f of f is measured by � � � � 2 2 � � �� f ( � � X ) − f ( � f − f � 2 = E � X ) . X 6 / 39

  7. Simple linear case Hub, J.S. and de Groot, B. L. (2009) assumed a linear model Y i = X T i β + ǫ i , i = 1 , . . . , n , X i ∈ R d , or in matrix form Y = X β + ǫ , ignored dependence in the data and tried to regularise the estimator by using PCA. 7 / 39

  8. Motivating example PC regression with 50 components 0.8 0.6 0.6 correlation 0.5 0.4 0.4 0.2 0.3 0.0 0 20 40 60 80 100 0 10 20 30 40 50 time in ns number of components 8 / 39

  9. Motivating example Partial Least Squares (PLS) leads to superior results 0.8 0.6 correlation 0.4 0.2 PLS PCR 0.0 0 10 20 30 40 50 number of components 9 / 39

  10. Regularisation with PCR and PLS Consider a linear regression model with fixed design Y = X β + ǫ. In the following let A = X T X and b = X T Y . PCR and PLS regularise β with a transformation H ∈ R d × s s.t. 1 n � Y − XH α � 2 = H ( H T AH ) − 1 H T b , � β s = H arg min α ∈ R s where s ≤ d plays the role of a regularisation parameter. In PCR matrix H consists of the first s eigenvectors of A = X T X . 10 / 39

  11. Regularisation with PLS In PLS one derives H = ( h 1 , . . . , h s ), h i ∈ R d as follows 1 Find cov( Xh , Y ) 2 ∝ X T Y = b h 1 = arg max � h ∈ R d � h � =1 1 X T Y = X � 1 A h 1 ) − 1 h T 2 Project Y orthogonally: Xh 1 ( h T β 1 3 Iterate the procedure according to cov( Xh , Y − X � β i − 1 ) 2 , i = 2 , . . . , s � h i = arg max h ∈ R d � h � =1 Apparently, � β s is highly non-linear in Y . 11 / 39

  12. Regularisation with PLS For PLS is known that h i ∈ K i ( A , b ), i = 1 , . . . , s , where K i ( A , b ) = span { b , Ab , . . . , A i − 1 b } is a Krylov space of order i . With this the alternative definition of PLS is � β ∈K s ( A , b ) � Y − X β � 2 . β s = arg min Note that any β s ∈ K s ( A , b ) can be represented as β s = P s ( A ) b = P s ( X T X ) X T Y = X T P s ( XX T ) Y , where P s is a polynomial of degree at most s − 1. 12 / 39

  13. Regularisation with PLS For the implementation and proofs the residual polynomials R s ( x ) = 1 − xP s ( x ) are of interest. Polynomials R s • are orthogonal w.r.t. an appropriate inner product • satisfy a recurrence relation R s +1 ( x ) = a s xR s ( x ) + b s R s ( x ) + c s R s − 1 ( x ) • are convex on [0 , r s ], where r s is the first root of R s ( x ) and R s (0) = 1. 13 / 39

  14. PLS and conjugate gradient PLS is closely related to the conjugate gradient (CG) algorithm for A β = X T X β = X T Y = b . The solution of this linear equation by CG is defined by β ∈K s ( A , b ) � b − A β � 2 = arg � β ∈K s ( A , b ) � X T ( Y − X β ) � 2 . β CG = arg min min s 14 / 39

  15. CG in deterministic setting CG algorithm has been studied in Nemirovskii (1986) as follows: • Consider ¯ A β = ¯ b for a linear bounded ¯ A : H → H • Assume that only approximation A of ¯ A and b of ¯ b are given • Set � β CG = arg min β ∈K s ( A , b ) � b − A β � 2 H . s 15 / 39

  16. CG in deterministic setting Assume (A1) max {� ¯ A � op , � A � op } ≤ L , � ¯ A − A � op ≤ ǫ and � ¯ b − b � 2 H ≤ δ (A2) The stopping index s satisfies the discrepancy principle s = min { s > 0 : � b − A � β s � H < τ ( δ � � ˆ β s � H + ǫ ) } , τ > 0 (A3) β = ¯ A µ u for � u � H ≤ R , µ, R > 0 (source condition). Theorem (Nemirovskii, 1986) Let (A1) – (A3) hold and ˆ s < ∞ . Then for any θ ∈ [0 , 1] 2(1 − θ ) 2( θ + µ ) A θ ( � � ¯ s − β ) � 2 1+ µ ( ǫ + δ RL µ ) 1+ µ . H ≤ C ( µ, τ ) R β ˆ 16 / 39

  17. Kernel regression A nonparametric model Y i = f ( X i ) + ǫ i , i = 1 , . . . , n , X i ∈ R d is handled in the reproducing kernel Hilbert space (RKHS) framework. Let H be a RKHS, that is • ( H , �· , ·� H ) is a Hilbert space of functions f : R d → R with • a kernel function k : R d × R d → R , s.t. k ( · , x ) ∈ H and f ( x ) = � f , k ( · , x ) � H , x ∈ R d , f ∈ H . f = � n Unknown f is estimated by � i =1 � α i k ( · , X i ). 17 / 39

  18. Kernel regression Define operators • Sample evaluation operator (analogue of X ): T n : f ∈ H �→ { f ( X 1 ) , . . . , f ( X n ) } T ∈ R n • Sample kernel integral operator (analogue of X T / n ): n : u ∈ R n �→ n − 1 � n T ∗ i =1 k ( · , X i ) u i ∈ H • Sample kernel covariance operator (analogue of X T X / n ): n T n : f ∈ H �→ n − 1 � n S n = T ∗ i =1 f ( X i ) k ( · , X i ) ∈ H • Sample kernel (analogue of XX T / n ): K n = T n T ∗ n = n − 1 { k ( X i , X j ) } n i , j =1 18 / 39

  19. Kernel PLS and kernel CG Now we can define the kernel PLS estimator as α ∈K s ( K n , Y ) � Y − K n α � 2 = arg n , Y ) � Y − T n T ∗ n α � 2 , � α s = arg min min α ∈K s ( T n T ∗ or, equivalently, for f = T ∗ n α � n Y ) � Y − T n f � 2 , s = 1 , . . . , n . f s = arg min f ∈K s ( S n , T ∗ The kernel CG estimator is then defined as � f CG n Y ) � T ∗ n ( Y − T n f ) � 2 = arg min H . s f ∈K s ( S n , T ∗ 19 / 39

  20. Results for Kernel CG and PLS Blanchard and Kr¨ amer (2010) • used stochastic setting with i.i.d. data ( Y i , X i ) • proved convergence rates for KCG using ideas in Nemirovskii (1986), Hanke (1995), Caponnetto & de Vito (2007) • argued that the proofs for kernel CG can not be directly transferred to kernel PLS In this work we • use stochastic setting with dependent data • prove convergence rates for kernel PLS building up on Hanke (1995) and Blanchard and Kr¨ amer (2010). 20 / 39

  21. Kernel PLS: assumptions Consider now the model specified for the protein data Y t = f ( X t ) + ǫ t , t = 1 , . . . , n . Let H be a RKHS with kernel k and assume (C1) H is separable; (C2) ∃ κ > 0 s.t. | k ( x , y ) | ≤ κ , ∀ x , y ∈ R d and k is measurable; Under (C1) the Hilbert-Schmidt norm of operators from H to H is well-defined and (C2) implies that all functions in H are bounded. 21 / 39

  22. Kernel PLS: assumptions Let T and T ∗ be population versions of T n and T ∗ n : � T : f ∈ H �→ f ∈ L 2 ( P X ) � T ∗ : f ∈ L 2 ( P � � X ) �→ X ( x ) ∈ H . f ( x ) k ( · , x ) dP It implies population versions of S n and K n : S = T ∗ T and K = TT ∗ . Operators T and T ∗ are adjoint and S , K are self-adjoint. 22 / 39

  23. Kernel PLS: assumptions As in Nemirovskii (1986) we use the source condition as an assumption on regularity of f : (SC) ∃ r ≥ 0, R > 0 and u ∈ L 2 ( P � X ) s.t. f = K r u and � u � 2 ≤ R If r ≥ 1 / 2, then f ∈ L 2 ( P � X ) coincides a.s. with f H ∈ H ( f = Tf H ). The setting with r < 1 / 2 is referred to as the outer case. 23 / 39

  24. Kernel PLS: assumptions Under suitable regularity conditions due to Mercer’ theorem � K ( x , y ) = η i φ i ( x ) φ i ( y ) i i =1 for L 2 ( P � for an orthonormal basis { φ i } ∞ X ) and η 1 ≥ η 2 ≥ . . . . Hence, � � � � θ 2 � θ i φ i ( x ) ∈ L 2 ( P X ) and i H = f : f = < ∞ . η i i i The source condition corresponds to f ∈ H r , where � � � � θ 2 � θ i φ i ( x ) ∈ L 2 ( P X ) and ≤ R 2 i H r = f : f = . η 2 r i i i 24 / 39

Recommend


More recommend