kernel machine methods in genomics
play

Kernel machine methods in genomics Debashis Ghosh Departments of - PowerPoint PPT Presentation

Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Kernel machine methods in genomics Debashis Ghosh Departments of Statistics Penn State University


  1. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Kernel machine methods in genomics Debashis Ghosh Departments of Statistics Penn State University ghoshd@psu.edu May 22, 2009 / Rao Prize Conference Seminar D. Ghosh Machine learning methods

  2. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Outline Introduction 1 Background Support vector machines 2 RKHS 3 Bayesian Approach 4 Numerical examples Semiparametric model 5 SVM and splines 6 SVMs and BLUPs Simulation studies 7 Conclusions 8 D. Ghosh Machine learning methods

  3. Introduction Support vector machines RKHS Bayesian Approach Background Semiparametric model SVM and splines Simulation studies Conclusions Scientific Context High-dimensional genomic data are now very commonplace in the medical and scientific literature Gene expression microarrays, single nucleotide polymorphisms, next-generation sequencing Scientific goals: discovering new biology as well as targets for intervention D. Ghosh Machine learning methods

  4. Introduction Support vector machines RKHS Bayesian Approach Background Semiparametric model SVM and splines Simulation studies Conclusions “Large p, small n” problems • Scientific studies with small sample sizes and high-dimensional measurements are increasingly common • Examples ◦ Spectroscopy ◦ Bioinformatics • Two goals: clustering and classification D. Ghosh Machine learning methods

  5. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Support vector machines • One technique that has received a lot of attention: support vector machines (SVMs) • Claimed by Vapnik to “avoid overfitting” • Applications of SVMs: ◦ microarray data (Brown et al., 2000, PNAS; Mukherjee et al., Bioinformatics, 2001) ◦ Protein folds (Hua and Sun, Journal of Molecular Biology, 2001) ◦ PubMed Search: 1443 hits D. Ghosh Machine learning methods

  6. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Support vector machines • Suppose that we have two groups of observations • Intuition behind SVMs: find the separating hyperplane that maximizes the margin between two groups and perfectly classifies observations • margin: distance between the hyperplane and points • Sometimes to achieve perfect classification, a mapping to a higher-dimensional space is required; this is achieved through use of a kernel function. D. Ghosh Machine learning methods

  7. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions SVM optimization problem y = 1/-1 (cancer/noncancer) z =gene expression profile Use a gene expression profile to classify cancer/noncancer status. SVM classification problem formulation (separate case): 1 max � ω � y i ( ω 0 + z i T ω ) ≥ 1 s.t. i = 1 , . . . , n Classification rule: ω 0 + z T ω > 0 ˆ ⇒ y = 1 ω 0 + z T ω < 0 ˆ ⇒ y = − 1 D. Ghosh Machine learning methods

  8. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions SVM: 2-D representation D. Ghosh Machine learning methods

  9. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Research goals Develop a formal inferential and statistical framework for SVMs and more general machine learning methods Advantages Probabilistic measures of predictiveness 1 Avoid reliance on computationally intensive cross-validation 2 Generalizations to nonlinear models 3 D. Ghosh Machine learning methods

  10. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions Aside: Reproducing Kernel Hilbert Spaces (RKHS) • Let T be a general index set • RKHS: Hilbert space of real-valued functions h on T with the property that for each t ∈ T , there exists an M = M t such that | h ( t ) | ≤ M || h || H 1-1 correspondence between positive definite functions K defined on T × T with RKHS of real-valued functions on T with K as its reproducing kernel ( H K ) D. Ghosh Machine learning methods

  11. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions RKHS (cont’d.) • If f ( x ) = β 0 + h ( x ) , then the estimate of h is obtained by minimizing g ( y , f ( x )) + λ || h || 2 H K , where g ( · ) is a loss function, and λ > 0 is the smoothing parameter. • RKHS theory guarantees minimizer has form n � f λ ( x ) = β 0 + β i K ( x , x i ); i = 1 n � � h � 2 H K ≡ β i β j K ( x i , x j ) . i , j = 1 D. Ghosh Machine learning methods

  12. Introduction Support vector machines RKHS Bayesian Approach Semiparametric model SVM and splines Simulation studies Conclusions RKHS → Likelihood • Can think of objective to function for minimizing as minimizing a penalized log-likelihood − g ( y , f ( x )) − λ || h || 2 H K , where − g ( · ) is such that exp [ − g ( · )] is proportional to the likelihood function; λ is the smoothing parameter, and � h � 2 H K is the penalty function. D. Ghosh Machine learning methods

  13. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Bayesian RKHS • First level: p ( y i | z i ) ∝ exp {− g ( y i | z i ) } , i = 1 , · · · , n , where the y i are conditionally independent given z i . z i = f ( x i ) + ǫ i , where the ǫ i are iid N ( 0 , σ 2 ) random • variables. • Key difference from other Bayesian methods : introduction of ǫ i f ∈ H K ⇒ f ( x i ) = β 0 + � n • j = 1 β j K ( x i , x j | θ ) K ′ i = ( 1 , K ( x i , x 1 | θ ) , · · · , K ( x i , x n | θ )) , i = 1 , · · · , n , β = ( β 0 , . . . , β n ) D. Ghosh Machine learning methods

  14. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Hierarchical model p ( y i | z i ) ∝ exp {− g ( y i | z i ) } (1) ind z i | β , θ , σ 2 N 1 ( z i | K ′ i β , σ 2 ) ∼ (2) N n + 1 ( β | 0 , σ 2 D − 1 β , σ 2 ∗ ) IG ( σ 2 | γ 1 , γ 2 ) ∼ Π p ∼ q = 1 U ( a q 1 , a q 2 ) θ λ ∼ Gamma ( m , c ) , where D ∗ ≡ Diag ( λ 1 , λ, · · · , λ ) is a ( n + 1 ) × ( n + 1 ) diagonal matrix • Can extend to have multiple smoothing parameters D. Ghosh Machine learning methods

  15. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Candidate likelihoods • Logistic model: g ( y | z ) = y − log ( 1 + exp ( z )) • SVM likelihood: 1 g ( y | z ) = for | z | ≤ 1 ; 1 + exp ( − 2 yz ) 1 = otherwise , 1 + exp [ − y ( z + sgn ( z )))] where sgn ( u ) = 1 , 0 or − 1 according as u is greater than, equal or less than 0. D. Ghosh Machine learning methods

  16. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Hierarchical model (cont’d.) • Choices for K (i) Gaussian kernel K ( x i , x j ) = exp {−|| x i − x j || 2 } θ (ii) polynomial kernel K ( x i , x j ) = ( x i · x j + 1 ) θ , where a · b denotes the inner product of two vectors a and b . D. Ghosh Machine learning methods

  17. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Bayesian Analysis • Introduction of ǫ 1 , . . . , ǫ n facilitates use of MCMC methods • Iterate through steps of (i) update z (Metropolis-Hastings); (ii) update K , β , σ 2 (Metropolis-Hastings for K , standard conjugate for β and σ 2 ); (iii) update λ . (standard) D. Ghosh Machine learning methods

  18. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Prediction and Model Choice For a new sample with gene expression x new , the posterior predictive probability that its tissue type, denoted by y new , is cancerous is given by � p ( y new | x new , y ) = p ( y new = 1 | x new , φ , ) p ( φ | y ) d φ (3) where φ is the vector of all the model parameters. The integral given in (3) can be approximated by its Monte Carlo estimate as M � p ( y new = 1 | x new , φ ( i ) ) / M , (4) i = 1 D. Ghosh Machine learning methods

  19. Introduction Support vector machines RKHS Bayesian Approach Numerical examples Semiparametric model SVM and splines Simulation studies Conclusions Prediction and Model Choice (cont’d.) To select from the different models, we will generally use misclassification error. If test set is available, build model on training set, use them to classify test samples. No test set available, use method of Gelfand (1996) for estimating cross-validation predictive density. D. Ghosh Machine learning methods

Recommend


More recommend