Kernel partial least squares for stationary data Tatyana - PowerPoint PPT Presentation

Kernel partial least squares for stationary data Tatyana Krivobokova, Marco Singer, Axel Munk Georg-August-Universit¨ at G¨ ottingen Bert de Groot Max Planck Institute for Biophysical Chemistry Van Dantzig Seminar, 06 April 2017 1 / 39

Motivating example Proteins • are large biological molecules • function often requires dynamics • configuration space is high-dimensional Group of Bert de Groot seeks to identify a relationship between collective atomic motions of a protein and some specific protein’s (biological) function. 2 / 39

Motivating example The data from the Molecular Dynamics (MD) simulations: • Y t ∈ R is a functional quantity of interest at time t , t = 1 , . . . , n • X t ∈ R 3 N are Euclidean coordinates of N atoms at time t Stylized facts • d = 3 N is typically high, but d ≪ n • { X t } t , { Y t } t are (non-)stationary time series • some (large) atom movements might be unrelated to Y t Functional quantity Y t to be modelled a function of X t . 3 / 39

Yeast aquaporin (AQY1) • Gated water channel • Y t is the opening diameter (red line) • 783 backbone atoms • n = 20 , 000 observations on 100 ns timeframe 4 / 39

AQY1 time series Movements of the first atom and the diameter of channel opening 4.0 0.6 3.8 Diameter in nm 3.6 Coordinate 0.5 3.4 3.2 0.4 3.0 0.3 2.8 0 20 40 60 80 100 0 20 40 60 80 100 Time in ns Time in ns 5 / 39

Model Assume Y t = f ( X t ) + ǫ t , t = 1 , . . . , n , where • { X t } t is a d -dimensional stationary time series • { ǫ t } t i.i.d. zero mean sequence independent of { X t } t • f ∈ L 2 ( P � X ), � X is independent of { X t } t and { ǫ t } t and P � X = P X 1 The closeness of an estimator � f of f is measured by � � � � 2 2 � � �� f ( � � X ) − f ( � f − f � 2 = E � X ) . X 6 / 39

Simple linear case Hub, J.S. and de Groot, B. L. (2009) assumed a linear model Y i = X T i β + ǫ i , i = 1 , . . . , n , X i ∈ R d , or in matrix form Y = X β + ǫ , ignored dependence in the data and tried to regularise the estimator by using PCA. 7 / 39

Motivating example PC regression with 50 components 0.8 0.6 0.6 correlation 0.5 0.4 0.4 0.2 0.3 0.0 0 20 40 60 80 100 0 10 20 30 40 50 time in ns number of components 8 / 39

Motivating example Partial Least Squares (PLS) leads to superior results 0.8 0.6 correlation 0.4 0.2 PLS PCR 0.0 0 10 20 30 40 50 number of components 9 / 39

Regularisation with PCR and PLS Consider a linear regression model with fixed design Y = X β + ǫ. In the following let A = X T X and b = X T Y . PCR and PLS regularise β with a transformation H ∈ R d × s s.t. 1 n � Y − XH α � 2 = H ( H T AH ) − 1 H T b , � β s = H arg min α ∈ R s where s ≤ d plays the role of a regularisation parameter. In PCR matrix H consists of the first s eigenvectors of A = X T X . 10 / 39

Regularisation with PLS In PLS one derives H = ( h 1 , . . . , h s ), h i ∈ R d as follows 1 Find cov( Xh , Y ) 2 ∝ X T Y = b h 1 = arg max � h ∈ R d � h � =1 1 X T Y = X � 1 A h 1 ) − 1 h T 2 Project Y orthogonally: Xh 1 ( h T β 1 3 Iterate the procedure according to cov( Xh , Y − X � β i − 1 ) 2 , i = 2 , . . . , s � h i = arg max h ∈ R d � h � =1 Apparently, � β s is highly non-linear in Y . 11 / 39

Regularisation with PLS For PLS is known that h i ∈ K i ( A , b ), i = 1 , . . . , s , where K i ( A , b ) = span { b , Ab , . . . , A i − 1 b } is a Krylov space of order i . With this the alternative definition of PLS is � β ∈K s ( A , b ) � Y − X β � 2 . β s = arg min Note that any β s ∈ K s ( A , b ) can be represented as β s = P s ( A ) b = P s ( X T X ) X T Y = X T P s ( XX T ) Y , where P s is a polynomial of degree at most s − 1. 12 / 39

Regularisation with PLS For the implementation and proofs the residual polynomials R s ( x ) = 1 − xP s ( x ) are of interest. Polynomials R s • are orthogonal w.r.t. an appropriate inner product • satisfy a recurrence relation R s +1 ( x ) = a s xR s ( x ) + b s R s ( x ) + c s R s − 1 ( x ) • are convex on [0 , r s ], where r s is the first root of R s ( x ) and R s (0) = 1. 13 / 39

PLS and conjugate gradient PLS is closely related to the conjugate gradient (CG) algorithm for A β = X T X β = X T Y = b . The solution of this linear equation by CG is defined by β ∈K s ( A , b ) � b − A β � 2 = arg � β ∈K s ( A , b ) � X T ( Y − X β ) � 2 . β CG = arg min min s 14 / 39

CG in deterministic setting CG algorithm has been studied in Nemirovskii (1986) as follows: • Consider ¯ A β = ¯ b for a linear bounded ¯ A : H → H • Assume that only approximation A of ¯ A and b of ¯ b are given • Set � β CG = arg min β ∈K s ( A , b ) � b − A β � 2 H . s 15 / 39

CG in deterministic setting Assume (A1) max {� ¯ A � op , � A � op } ≤ L , � ¯ A − A � op ≤ ǫ and � ¯ b − b � 2 H ≤ δ (A2) The stopping index s satisfies the discrepancy principle s = min { s > 0 : � b − A � β s � H < τ ( δ � � ˆ β s � H + ǫ ) } , τ > 0 (A3) β = ¯ A µ u for � u � H ≤ R , µ, R > 0 (source condition). Theorem (Nemirovskii, 1986) Let (A1) – (A3) hold and ˆ s < ∞ . Then for any θ ∈ [0 , 1] 2(1 − θ ) 2( θ + µ ) A θ ( � � ¯ s − β ) � 2 1+ µ ( ǫ + δ RL µ ) 1+ µ . H ≤ C ( µ, τ ) R β ˆ 16 / 39

Kernel regression A nonparametric model Y i = f ( X i ) + ǫ i , i = 1 , . . . , n , X i ∈ R d is handled in the reproducing kernel Hilbert space (RKHS) framework. Let H be a RKHS, that is • ( H , �· , ·� H ) is a Hilbert space of functions f : R d → R with • a kernel function k : R d × R d → R , s.t. k ( · , x ) ∈ H and f ( x ) = � f , k ( · , x ) � H , x ∈ R d , f ∈ H . f = � n Unknown f is estimated by � i =1 � α i k ( · , X i ). 17 / 39

Kernel regression Define operators • Sample evaluation operator (analogue of X ): T n : f ∈ H �→ { f ( X 1 ) , . . . , f ( X n ) } T ∈ R n • Sample kernel integral operator (analogue of X T / n ): n : u ∈ R n �→ n − 1 � n T ∗ i =1 k ( · , X i ) u i ∈ H • Sample kernel covariance operator (analogue of X T X / n ): n T n : f ∈ H �→ n − 1 � n S n = T ∗ i =1 f ( X i ) k ( · , X i ) ∈ H • Sample kernel (analogue of XX T / n ): K n = T n T ∗ n = n − 1 { k ( X i , X j ) } n i , j =1 18 / 39

Kernel PLS and kernel CG Now we can define the kernel PLS estimator as α ∈K s ( K n , Y ) � Y − K n α � 2 = arg n , Y ) � Y − T n T ∗ n α � 2 , � α s = arg min min α ∈K s ( T n T ∗ or, equivalently, for f = T ∗ n α � n Y ) � Y − T n f � 2 , s = 1 , . . . , n . f s = arg min f ∈K s ( S n , T ∗ The kernel CG estimator is then defined as � f CG n Y ) � T ∗ n ( Y − T n f ) � 2 = arg min H . s f ∈K s ( S n , T ∗ 19 / 39

Results for Kernel CG and PLS Blanchard and Kr¨ amer (2010) • used stochastic setting with i.i.d. data ( Y i , X i ) • proved convergence rates for KCG using ideas in Nemirovskii (1986), Hanke (1995), Caponnetto & de Vito (2007) • argued that the proofs for kernel CG can not be directly transferred to kernel PLS In this work we • use stochastic setting with dependent data • prove convergence rates for kernel PLS building up on Hanke (1995) and Blanchard and Kr¨ amer (2010). 20 / 39

Kernel PLS: assumptions Consider now the model specified for the protein data Y t = f ( X t ) + ǫ t , t = 1 , . . . , n . Let H be a RKHS with kernel k and assume (C1) H is separable; (C2) ∃ κ > 0 s.t. | k ( x , y ) | ≤ κ , ∀ x , y ∈ R d and k is measurable; Under (C1) the Hilbert-Schmidt norm of operators from H to H is well-defined and (C2) implies that all functions in H are bounded. 21 / 39

Kernel PLS: assumptions Let T and T ∗ be population versions of T n and T ∗ n : � T : f ∈ H �→ f ∈ L 2 ( P X ) � T ∗ : f ∈ L 2 ( P � � X ) �→ X ( x ) ∈ H . f ( x ) k ( · , x ) dP It implies population versions of S n and K n : S = T ∗ T and K = TT ∗ . Operators T and T ∗ are adjoint and S , K are self-adjoint. 22 / 39

Kernel PLS: assumptions As in Nemirovskii (1986) we use the source condition as an assumption on regularity of f : (SC) ∃ r ≥ 0, R > 0 and u ∈ L 2 ( P � X ) s.t. f = K r u and � u � 2 ≤ R If r ≥ 1 / 2, then f ∈ L 2 ( P � X ) coincides a.s. with f H ∈ H ( f = Tf H ). The setting with r < 1 / 2 is referred to as the outer case. 23 / 39

Kernel PLS: assumptions Under suitable regularity conditions due to Mercer’ theorem � K ( x , y ) = η i φ i ( x ) φ i ( y ) i i =1 for L 2 ( P � for an orthonormal basis { φ i } ∞ X ) and η 1 ≥ η 2 ≥ . . . . Hence, � � � � θ 2 � θ i φ i ( x ) ∈ L 2 ( P X ) and i H = f : f = < ∞ . η i i i The source condition corresponds to f ∈ H r , where � � � � θ 2 � θ i φ i ( x ) ∈ L 2 ( P X ) and ≤ R 2 i H r = f : f = . η 2 r i i i 24 / 39

Kernel partial least squares for stationary data Tatyana - PowerPoint PPT Presentation

Kernel partial least squares for stationary data Tatyana Krivobokova, Marco Singer, Axel Munk Georg-August-Universit at G ottingen Bert de Groot Max Planck Institute for Biophysical Chemistry Van Dantzig Seminar, 06 April 2017 1 / 39

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

9. Equality constraints and tradeoffs More least squares Example: moving average model

8. Least squares Review of linear equations Least squares Example: curve-fitting

Linear Least Squares I Steve Marschner Cornell CS 322 Cornell CS 322 Linear Least Squares I 1

Moving Least Squares Outline The Approximation Power of Moving Least- Squares D. Levin

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Geometry of Least Squares 2 Least squares from the

Group embeddings of partial Latin squares Ian Wanless Monash University Latin squares Latin

Outline Outline Stationary Solution to Fokker Stationary Solution to Fokker- - Planck

Semi-stationary reflection, stationary reflection and combinatorics Hiroshi Sakai (joint work

Data Mining TCGA Breast and Ovarian Exomes for Novel Susceptibility Markers JOHN A.

In Line Purification of Equilibrium Mixtures. Cy Jeffries EMBL Hamburg aggregate Introduction.

Implementation Group (WRIG) Meeting #3 9 July 2020 Ground rules and virtual meeting protocols

IAPT Providers Network 7 February 2018 Andy Wright, IAPT Clinical Advisor and Sarah Boul,

Advanced CUDA: Application Examples John E. Stone Theoretical and Computational Biophysics Group

Protein Structure Michael Schroeder Joachim Haupt Melissa Adasme Biotechnology Center TU

Comp/Phys/APSc 715 Bioinformatics Visualization 4/17/2014 Bioinformatics Comp/Phys/APSc 715

3/19/2017 Resource Aquisition And Transport in Vascular Plants 1 3/19/2017 2 3/19/2017 3