 
              Feature selection in high dimension for precision medicine Chloé-Agathe Azencott Center for Computational Biology (CBIO) Mines ParisTech – Institut Curie – INSERM U900 PSL Research University, Paris, France March 21, 2017 – MACARON Workshop http://cazencott.info chloe-agathe.azencott@mines-paristech.fr @cazencott
Precision Medicine ◮ Treatment adapted to the (genetic) specificities of the patient. E.g. Trastuzumab for HER2+ breast cancer. ◮ Data-driven biology/medicine Identify similarities between patients that exhibit similar susceptibilities / prognoses / responses to treatment. 1
Sequencing costs 2
Big data! 3
Big data! 4
5
GWAS: Genome-Wide Association Studies Which genomic features explain the phenotype? p = 10 5 − 10 7 Single Nucleotide Polymorphisms (SNPs) n = 10 2 − 10 4 samples ◮ High-dimensional (large p) ◮ Low sample size (small n) 6
Google Flu Trends D. Lazer, R. Kennedy, G. King and A. Vespignani. The Parable of Google Flu: Traps in Big Data Analysis. Science 2014 ◮ p = 50 million search terms ◮ n = 1152 data points ◮ Predictive search terms include keywords related to high school basketball. 7
from the start? ? ? ? Is extracting information from this data doomed 8
GWAS successes Multiple sclerosis HaemGen consortium Ankylosing spondylitis P. Visscher, M. Brown, M. McCarthy, J. Yang. Five years of GWAS discovery. AJHG 2012. 9
Missing heritability GWAS fail to explain most of the inheritable variability of complex traits. Many possible reasons: – non-genetic / non-SNP factors – heterogeneity of the phenotype – rare SNPs – weak effect sizes – few samples in high dimension (p ≫ n) – joint effets of multiple SNPs. 10
Integrating prior knowledge Use additional data and prior knowledge to constrain the feature selection procedure. – Consistant with previously established knowledge – More easily interpretable – Statistical power. Prior knowledge can be represented as structure: – Linear structure of DNA – Groups: e.g. pathways – Networks (molecular, 3D structure). 11
Regularized relevance Set V of p variables. ◮ Relevance score R : 2 V → R Quantifies the importance of any subset of variables for the question under consideration. Ex : correlation, HSIC, statistical test of association. ◮ Structured regularizer Ω : 2 V → R Promotes a sparsity pattern that is compatible with the constraint on the feature space. Ex : cardinality Ω : S �→ |S| . ◮ Regularized relevance arg max R ( S ) − λ Ω( S ) S⊆V 12
Network-guided multi-locus GWAS Goal: Find a set of explanatory SNPs compatible with a given network structure. 13
Network-guided GWAS ◮ Additive test of association SKAT [Wu et al. 2011] � c i = ( G ⊤ ( y − µ )) 2 R ( S ) = c i i i ∈S ◮ Sparse Laplacian regularization � � Ω : S �→ W ij + α |S| i ∈S j / ∈S ◮ Regularized maximization of R � � � arg max − η |S| − λ c i W ij ���� S⊆V i ∈S i ∈S j / ∈S sparsity � �� � � �� � association connectivity 14
Minimum cut reformulation The graph-regularized maximization of score Q ( ∗ ) is equivalent to a s / t -min-cut for a graph with adjacency matrix A and two additional nodes s and t , where A ij = λ W ij for 1 ≤ i, j ≤ p and the weights of the edges adjacent to nodes s and t are defined as � c i − η � η − c i if c i > η if c i < η A si = A it = and 0 0 otherwise . otherwise SConES: S electing Con nected E xplanatory S NPs. 15
Comparison partners ◮ Univariate linear regression y k = α 0 + β G i k ◮ Lasso 1 2 || y − G β || 2 arg min + η || β || 1 2 β ∈ R p � �� � � �� � sparsity loss ◮ Feature selection with sparsity and connectivity constraints arg min L ( y , G β ) + η || β || 1 + λ Ω( β ) β ∈ R p � �� � � �� � � �� � loss connectivity sparsity – ncLasso : network connected Lasso [Li and Li, Bioinformatics 2008] – Overlapping group Lasso [Jacob et al., ICML 2009] – groupLasso : E.g. SNPs near the same gene grouped together – graphLasso : 1 edge = 1 group. 16
Runtime 10 6 CPU runtime [sec] (log-scale) 10 5 10 4 10 3 10 2 graphLasso 10 1 ncLasso 10 0 ncLasso (accelerated) SConES 10 − 1 linear regression 10 − 2 10 2 10 3 10 4 10 5 10 6 #SNPs (log-scale) n = 200 exponential random network (2 % density) 17
Experiments: Performance on simulated data ◮ Arabidopsis thaliana genotypes n=500 samples, p=1 000 SNPs TAIR Protein-Protein Interaction data ∼ 50.10 6 edges ◮ Higher power and lower FDR than comparison partners except for groupLasso when groups = causal structure ◮ Fairly robust to missing edges ◮ Fails if network is random. 18
Arabidopsis thaliana flowering time 17 flowering time phenotypes [Atwell et al., Nature, 2010] p ∼ 170 000 SNPs (after MAF filtering) n ∼ 150 samples 165 candidate genes [Segura et al., Nat Genet 2012] Correction for population structure : regress out PCs. 19 ✶✵✳✶✷✹✷✴❥❝s✳✵✾✻✾✹✶
Arabidopsis thaliana flowering time # candidate genes hit # selected SNPs 600 450 10 300 5 150 0 e o o o 0 S t s s s E e o o o S a s s s t s s s E i a a a n a s s s r L L L o a a a n i a r C p c L L L o v a n C i u S v p c n n o u S i U r n o g U r g ◮ SConES selects about as many SNPs as other network-guided approaches but detects more candidates. 20
Arabidopsis thaliana flowering time Predictivity of selected SNPs 1 . 0 Lasso ncLasso groupLasso SConES 0 . 8 0 . 6 R 2 0 . 4 0 . 2 0 . 0 0W LN22 21
SConES: S electing Con nected E xplanatory S NPs ◮ s elects con nected, e xplanatory S NPs; ◮ incorporates large networks into GWAS; ◮ is efficient , effective and robust . C.-A. Azencott, D. Grimm, M. Sugiyama, Y. Kawahara and K. Borgwardt (2013) Efficient network-guided multi-locus association mapping with graph cuts , Bioinformatics 29 (13), i171–i179 doi:10.1093/bioinformatics/btt238 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❝♦♥❡s ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❞♦♠✐♥✐❦❣r✐♠♠✴❡❛s②●❲❆❙❈♦r❡ 22
Multi-trait GWAS Increase sample size by jointly performing GWAS for multiple related phenotypes 23
Toxicogenetics / Pharmacogenomics Tasks (phenotypes) = chemical compounds F. Eduati, L. Mangravite, et al. (2015) Prediction of human population responses to toxic compounds by a collaborative competition. Nature Biotechnology, 33 (9), 933–940 doi: 10.1038/nbt.3299 24
Multi-SConES T related phenotypes. ◮ Goal: obtain similar sets of features on related tasks.   T � � � �   arg max c i − η |S| − λ W ij − µ |S t − 1 ∆ S t |     � �� � S 1 ,..., S T ⊆V t =1 i ∈S i ∈S j / ∈S task sharing S ∆ S ′ = ( S ∪ S ′ ) \ ( S ∩ S ′ ) (symmetric difference) ◮ Can be reduced to single-task by building a meta-network. 25
Multi-SConES: Multiple related tasks Simulations: retrieving causal features Single task Two tasks Three tasks Four tasks Model 1 1.0 1.0 0.8 0.8 0.6 0.6 MCC MCC 0.4 0.4 0.2 0.2 0 0 CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC Model 4 Model 2 Single task Two tasks Three tasks Four tasks Single task Two tasks Three tasks Four tasks 1.0 1.0 0.8 0.8 0.6 0.6 MCC MCC 0.4 0.4 0.2 0.2 0 0 CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC CR LA EN GL GR AG SC CR LA GR SC CR LA GR SC CR LA GR SC Model 3 Single task Two tasks Three tasks Four tasks Single task Two tasks Three tasks Four tasks 1.0 M. Sugiyama, C.-A. Azencott, D. Grimm, Y. Kawahara and K. Borgwardt (2014) Multi-task feature selection on multiple networks via maximum flows , SIAM ICDM, 199–207 doi:10.1137/1.9781611973440.23 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴♠❛❤✐t♦✲s✉❣✐②❛♠❛✴▼✉❧t✐✲❙❈♦♥❊❙ ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ 26
Leveraging similarity between tasks Use prior knowledge about the relationship between the tasks: Ω ∈ R T × T     T  T  � � � � � �   Ω − 1 arg max c i − η |S| − λ W ij − µ   tu   S 1 ,..., S T ⊆V t =1  u =1  i ∈S i ∈S j / i ∈S t ∩S u ∈S   � �� � task sharing Can also be mapped to a meta-network. Code: ❤tt♣✿✴✴❣✐t❤✉❜✳❝♦♠✴❝❤❛❣❛③✴s❢❛♥ 27
Recommend
More recommend