high dimensional omics data analysis using a variable
play

High-dimensional omics data analysis using a variable screening - PowerPoint PPT Presentation

High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI) Cong Liu Oct. 3 rd , 2016 International Conference on Genome Informatics (GIW) High-throughput Omics Data P >> N


  1. High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI) Cong Liu Oct. 3 rd , 2016 International Conference on Genome Informatics (GIW)

  2. High-throughput Omics Data

  3. ā€˜P >> Nā€™ Paradigm Reduce P: Feature Selection To avoid overfitting and improve model performance To provide faster and more cost-effective models To gain a deeper insight into the underlying biological processes

  4. Previous Feature Selection Methods Univariate filter methods Correlation criteria FDR correction: q-value Rank-Sum test Mutual Information Multivariate filter methods Sequential Search Forward Search Backward Search Heuristic Algorithms Genetic Algorithm Penalized Regression Ridge Regression LASSO Adaptive LASSO SCAD Elastic-Net Screen + Regression

  5. Aim: Develop a rank-based feature selection protocol with knowledge integration Background & Motivation Statistical framework Results Discussion

  6. Sure Independent Screen (i)SIS Fan and Lv 2008 proposed two-stage method 1. Select significant predictors by sorting the corresponding marginal likelihood (correlation in linear model), thus fast reducing the ultra-high dimensionality š‘ž to a relatively large scale š‘’ (e.g. š‘(š‘œ )). 2. Use a lower dimensional model selection method such as SCAD, lasso, or adaptive lasso to further reduce the model size from š‘’ to š‘’ā€² When too many predictors are involved, the basic sure screening methods might miss some important variables due to collinearity issues. In their paper they developed an iterative version of SIS to use fully the joint information of the covariates rather than marginal information.

  7. Intuition of Our Method: Screening with Prior Knowledge Integration (SKI) Target Sequencing Variants Filtering Downstream Validation

  8. General Idea of SKI protocol š‘ŗ šŸš’Œ is the rank based on external knowledge; š‘ŗ šŸš’Œ is based on correlation with response(residues); Tuning parameter šœ· could be user defined or determined by data;

  9. Estimation of šœ· Cross validation will require us to further spit the sample into training and testing, which can make the ultra-high dimensionality issue worse. We compare the š‘’š‘“š‘¤.š‘ š‘š‘¢š‘—š‘ across different š›½ ā€™s, and select the š›½ yields largest š‘’š‘“š‘¤. š‘ š‘š‘¢š‘—š‘ as the final š›½ . ;<=;>?@ ABC D;<=;>?@ E š‘’š‘“š‘¤.š‘ š‘š‘¢š‘—š‘(š›½) = 1 āˆ’ 2Ɨ ;<=;>?@ ABC D;<=;>?@ FGHH The rationale of this method is that if one set of variables is more biologically meaningful than the other, the better it could fit a ridge regression model.

  10. How to get š‘ŗ šŸš’Œ ? - Examples Domain Knowledge Database Text Mining Other Data Sources

  11. Simulation Study Experiment Dataset Knowledge Dataset š‘œ I = 200 samples (X, š‘ š‘œ R = 200 samples (Z, š‘ R ) with I ) were gene number š‘ž = 10,000 . simulated, with gene number š‘ž = 10,000 . Gene expressions and 200 clusters were simulated responses were simulated from independently, and 50 genes in the same structure as each cluster were simulated described in experiment from a multivariate normal dataset. distribution with šœˆ = 0, šœ O = non-zero coefficients š›¾ were 1 and AR(1) correlation simulated to have 0%, 50%, structure šœ = 0. 6. and 100% overlap with non- zero š›¾ in the internal settings. In each cluster, the coefficients š›¾ ā€™s of first ten were simulated from a uniform distribution(0.5,1). All other š›¾ ā€™s were set to be zeros. Continuous responses were generated from linear O = 1 regression models with šœ I (or 3).

  12. The he num umber of true ue positives among different metho hods. 1 SIS: variables were sorted by marginal Positive 4 1% 5% 10% correlation using only internal dataset; šŸ‘ 6 šŸ‘ 7 % 5 š‰ š’š š‰ š’œ šœ· 8 SIS 1 SKI 2 P 3 SIS SKI P SIS SKI P 2 SKI : variables were sorted by weighted geometric mean of two marginal 0.0 1 1 0.075 38.96 38.94 36.36 45.78 45.72 43.63 47.66 47.63 45.63 correlation based ranks using two dataset; 0.5 1 1 0.275 38.53 43.06 45.22 45.66 47.65 48.54 47.53 48.85 49.13 3 Pool : two dataset were pooled together 1.0 1 1 0.384 38.5 46.34 47.99 45.65 48.9 49.58 47.49 49.51 49.83 and treated as a single dataset, and then variables were sorted by marginal correlation; 0.0 1 3 0.090 39.10 38.97 35.01 45.81 45.80 42.94 47.71 47.72 44.03 4 Top 1%, 5% and 10% variables were 0.5 1 3 0.249 38.92 42.55 43.85 45.80 47.31 48.28 47.57 48.55 49.10 selected respectively under different settings; 1.0 1 3 0.368 39.04 45.81 47.58 45.88 48.60 49.44 47.65 49.21 49.73 5 the percentage of non-zero š›¾ ā€™s 0.0 3 1 0.113 36.84 36.43 35.77 44.61 44.01 43.37 46.69 46.57 46.19 overlapped with each other in two datasets; 0.5 3 1 0.261 37.27 42.16 44.90 45.15 47.36 48.34 47.07 48.56 49.03 6 šœ O : the variance added in internal I dataset to generate response š‘ I ; 1.0 3 1 0.374 36.91 46.01 48.89 44.76 49.42 49.51 47.12 49.86 49.90 O : the variance added in external 7 šœ 0.0 3 3 0.104 37.84 37.48 35.19 45.73 45.43 44.07 47.63 47.53 45.93 R dataset to generate response š‘ R ; 0.5 3 3 0.264 37.26 42.52 44.48 45.03 47.35 48.26 47.19 48.58 49.00 8 š›½ : the estimated value of š›½ which control the weight of two ranks in 1.0 3 3 0.355 37.05 45.20 47.37 45.1 48.6 49.39 47.05 49.36 49.76 geometric mean.

  13. The he num umber of true ue positives us using it iterativ ive and non-it iterativ ive appr pproaches whe when top 1% variables we were select cted. % 1 š‡ 2 šœ· 3 SIS 4 SKI 5 iSIS 6 iSKI 7 0 0.3 0.061 23.32 23.12 25.22 22.53 0.5 0.3 0.342 24.83 33.20 26.13 34.43 1 0.3 0.443 23.14 34.41 26.33 38.85 0 0.6 0.044 37.35 36.34 41.11 36.17 0.5 0.6 0.392 36.47 41.67 39.67 44.83 1 0.6 0.453 37.12 45.83 40.44 49.40 1 SIS: variables were sorted by marginal correlation using only internal dataset; 2 SKI : variables were sorted by weighted geometric mean of two marginal correlation based ranks using two dataset; 3 Pool : two dataset were pooled together and treated as a single dataset, and then variables were sorted by marginal correlation; 4 Top 1%, 5% and 10% variables were selected respectively under different settings; 5 the percentage of non-zero š›¾ ā€™s overlapped with each other in two datasets; 6 šœ O : the variance added in internal dataset to generate response š‘ I ; I O : the variance added in external dataset to generate response š‘ 7 šœ R ; R 8 š›½ : the estimated value of š›½ which control the weight of two ranks in geometric mean.

  14. Real Application: Drug Response Analysis Selumetinib (AZD6224) is a drug used to treat various types of cancer such as non-small cell lung cancer (NSCLC). We applied the SKI procedure to identify the potential biomarkers of response to Selumetinib using CCLE dataset. The CCLE dataset includes the drug response data (i.e. Active Area) together with its baseline omics measurement, which includes gene expression, mutation data, and copy numbers. In total there were 489 cell lines and 41872 genomic features measured. We then searched the Drug2Gene database [25] to acquire prior knowledge of association between selumetinib and genes. In total, 383 genes were identified to have associations with selumetinib.

  15. 18 variables selected by SKI procedure when top 100 variables were selected, whose association with selumetinib could be found in database. Gene Symbol Probe ID Type š‘ŗ š‘»š‘±š‘» 1 š‘ŗ š‘»š‘³š‘± 2 BRAF NA Mut 4 1 ADCK3 56997_at Exp 172 5 TESK1 7016_at Exp 194 6 DCLK2 166614_at Exp 196 8 TNIK 23043_at Exp 206 9 NUAK2 81788_at Exp 209 10 ERBB3 2065_at Exp 328 14 PRKCD 5580_at Exp 338 15 MYLK 4638_at Exp 479 20 MAP3K1 4214_at Exp 502 21 ULK3 25989_at Exp 519 23 Boxplot of squared error for FGFR1 2260_at Exp 556 25 selumtinib response SNRK 54861_at Exp 582 26 prediction RPS6KA3 6197_at Exp 623 29 STK10 6793_at Exp 691 31 MAPK9 5601_at Exp 756 34 TAOK3 51347_at Exp 761 35 PIK3CB 5291_at Exp 764 36

  16. Discussion The proposed approach is general and is not limited to any specific type of prior knowledge as long as the variables could be ranked based on some external criteria. Bergersen et al . has proposed a weighted LASSO (wLASSO) procedure with data integration, which shared a similar idea of our approach. As a screening-based method, SKI is apparently flexible to extend to more generalized fields (generalized linear models, additive models, cox models, and model-free), too. Li et al . proposed a variant methods, robust rank correlation screening (RRCS) method, which is based on the Kendall Ļ„ correlation coefficient between response and predictor variables rather than the Pearson correlation of SIS

Recommend


More recommend