personalized regression enables sample specific pan
play

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis - PowerPoint PPT Presentation

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata 1 Cancer is Complex Di ff erent mutations


  1. Personalized Regression Enables Sample-Specific Pan-Cancer Analysis Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata � 1

  2. Cancer is Complex • Di ff erent mutations can cause similar phenotypes. • There are many possible driver mutations. • Do we need to build a single model that works for all cancers? • Could we build a di ff erent model for each type of cancer? • But cancer “type” may not correspond to any single clinical covariate. Ben Lengerich | ISMB 2018 � 2

  3. The Extreme: Sample- Specific Models • What if we try to understand tumors one at a time? • Could we use simple models that each work for a single patient? • Enable new types of questions to be asked: “How does this tumor’s model di ff er from the cohort’s?” Ben Lengerich | ISMB 2018 � 3

  4. Our Goal Sample-Specific , Pan-Cancer Models: Samples Model Parameters Ben Lengerich | ISMB 2018 � 4

  5. Why Sample-Specific Models? Deep Learning Sample-Specific Mixtures Mixed E ff ects Varying-Coe ffi cient Universal E ff ects Personal E ff ects Complicated E ff ects Simple E ff ects “This tumor is due to a “Self-driving cars” mutation in gene TP53” Ben Lengerich | ISMB 2018 5

  6. Why Pan-Cancer Models? • Share information between rare and common cancer types • Uncover molecular subtypes • If we can handle clinical covariates well, tissue type can be simply treated as Number of Samples by Tissue Type in TCGA 1 another covariate Ben Lengerich | ISMB 2018 1. cancergenome.nih.gov 6

  7. Related Work Unknown Sample-Specific Covariate General Framework? Models? E ff ects? Varying-Coe ffi cient [1] Known Structure [2,3,4] Sample-Specific Network Estimation [5,6] Personalized Regression 1. Hastie and Tibshirani. Journal of the Royal Statistical Society 1993 2. Song et al. NIPS 2009, 3. Kolar et al. NIPS 2009, 4. Parikh et al. ISMB 2011 5. Kuijjer et al. Arxiv 2015, 6. Liu et al. Nucleic Acids Research 2016 Ben Lengerich | ISMB 2018 7

  8. Personalized Regression • From estimating a single model: Y = X β T + ϵ • To estimating sample-specific models: β (1) Y ( i ) = X ( i ) β ( i ) T + ϵ ( i ) β (2) Samples ⋮ β ( N ) Model Parameters Overparameterized, but not hopeless! Ben Lengerich | ISMB 2018 8

  9. Personalized Regression • Define the sample-specific loss functional to be minimized: N ∑ ℒ ( i ) ( β ( i ) ; d β , d U ) ℒ ( β ; d β , d U ) ∝ i =1 ρ β ϱ ( i ) ℒ ( i ) ( β ( i ) ; d β , d U ) ∝ f ( X ( i ) , Y ( i ) , β ( i ) ) λ ( β ( i ) ) + + γ ( d β , d U ) Prediction Loss Distance-Matching Regularization Overparameterized, but not hopeless! Ben Lengerich | ISMB 2018 9

  10. Distance Matching Regularization • Main idea: Distance between sample parameters should be similar to distance between sample covariates . • Define a regularization loss functional to be minimized: 2 γ ( d β , d U ) = γ ∑ ϱ ( i ) ( d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) ) − j ≠ i covariate distance parameter distance Pairwise distances between all samples Ben Lengerich | ISMB 2018 10

  11. Distance Metrics Can Be Learned From Data • Define distance metrics as linear combinations of feature- wise distance metrics: d β ( x , y ) = [ | x 1 − y 1 | , …, | x P − y P | ] ϕ T β d U ( x , y ) = [ d U 1 ( x 1 , y 1 ), …, d U K ( x K , y K ) ] ϕ T U ϕ β , ϕ U • After optimization, we can inspect the values in to understand contributions to personalization. • User must supply covariate-specific distance metrics. • Can use complicated covariate distance metrics. Ben Lengerich | ISMB 2018 � 11

  12. When is Personalized Regression Useful? • We are seeking a model for inference , not necessarily most accurate predictive model. • We are seeking relatively simple personalized e ff ects, not complex universal e ff ects. • We have covariate data which is informative of each sample. Ben Lengerich | ISMB 2018 12

  13. Experiments � 13

  14. TCGA Pan-Cancer Analysis • Model: Logistic Regression with Lasso Regularization • Task: Predict Case/Control Status • Data: • 28 primary sites • 9663 samples (8944 case, 719 control) • 4123 RNA-Seq features Number of Samples by Tissue Type in TCGA 1 • 14 clinical covariates Ben Lengerich | ISMB 2018 1. cancergenome.nih.gov 14

  15. Clinical Covariates • 14 Clinical Covariates: • Tissue Features : Disease Type, Primary Site, Days to Collection • Sample Molecular Biomarkers : Pct. Tumor Cells, Pct. Normal Cells, Pct. Tumor Nuclei, Pct. Lymphocyte Infiltration, Pct. Stromal Cells, Pct. Monocyte Infiltration, Pct. Neutrophil Infiltration • Patient Demographic Features : Age at Diagnosis, Year of Birth, Gender, Race • Traditional methods expect these data encoded as one-hot vectors, which expands dimensionality 5X! Ben Lengerich | ISMB 2018 15

  16. Personalized Models Are More Efficient with Variable Selection Selects Fewer Genes Uses each Gene in Per Sample: Fewer Samples: Red Lines Indicate Number of Variables Most Genes are Selected for Fewer Selected by Tissue-Specific Models than 500 Samples Ben Lengerich | ISMB 2018 16

  17. Personalized Regression Gives More Weight to Known Oncogenes [1] Many methods e ff ectively identify common oncogenes Few methods e ff ectively identify rare oncogenes 1. Oncogenes as annotated in COSMIC (Forbes et al. Nucleic Acids Research 2014) Ben Lengerich | ISMB 2018 17

  18. Personalized Regression Produces Sample-Specific Pan-Cancer Models Samples Red Line = oncogene Genes Ben Lengerich | ISMB 2018 18

  19. Personalized Models Reveal Molecular Subtypes Which Span Tissues • Over-represented for the GO biological process term “Modulation of Chemical Synaptic Transmission" (p <0.05FDR) • Includes genes ATP1A2, SLC6A4, ASIC1, GRM3, and SLC8A3, Samples which code for ion-transport processes . • Ion-transport processes have long been seen in vivo as an important system in thyroid cancer [1] and in vitro from leukemic cells [2], but only recently as a functional marker across di ff erent cancer types [3]. Genes 1. Filetti et al. European Journal of Endocrinology 1999 2. Morgan et al. Cancer Research 1986 Ben Lengerich | ISMB 2018 19 3. Scafoglio et al. PNAS 2015

  20. Personalized Models Form Clusters with Distinct Signatures Extracellular Processes - Antigen Extracellular Processes - Membrane Cellular Metabolism Ben Lengerich | ISMB 2018 20

  21. Personalized Regression Learns Clinical Distance Metrics Ben Lengerich | ISMB 2018 21

  22. Conclusions • Sample-specific models can give us a new perspective. • Unlock bottom-up in addition to traditional top-down analyses. • Personalized Regression with Distance-Matching Regularization e ff ectively learns sample-specific models. • Personalized Regression reveals patterns in pan-cancer transcriptomic data that are overlooked by traditional analyses. Ben Lengerich | ISMB 2018 � 22

  23. Future Work • Biological Questions - Sample-Specific Processes? • More complex personalized models • Personalized Regression for Single-Cell Data, Election Modeling, Stock Prediction Ben Lengerich | ISMB 2018 � 23

  24. Thank You Code available at: Travel to ISMB generously supported by ISCB github.com/blengerich/ personalized_regression Collaborators: • Bryon Aragam • Eric P . Xing Research supported by NIH • Contact: {blengeri, epxing} @cs.cmu.edu Ben Lengerich | ISMB 2018 24

  25. The Gory Details � 25

  26. Personalized Regression: Optimization • Define pairwise distance vectors by: = [ d β 1 ( β ( i ) P ) ] Δ ( i , j ) 1 , β ( j ) P , β ( j ) 1 ), …, d β P ( β ( i ) β = [ d U 1 ( U ( i ) K ) ] Δ ( i , j ) 1 , U ( j ) K , U ( j ) 1 ), …, d U K ( U ( i ) U • Construction of the covariate distance tensor can be amortized Ben Lengerich | ISMB 2018 26

  27. Avoiding Degenerate Solutions • Add priors to distance metrics • From: 2 γ ( d β , d U ) = γ ∑ ϱ ( i ) ( ) d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) − j ≠ i covariate distance parameter distance • To: 2 + ψ α ( d β ) + ψ υ ( d U ) γ ( d β , d U ) = γ ∑ ϱ ( i ) ( ) d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) − j ≠ i covariate distance parameter distance Ben Lengerich | ISMB 2018 27

  28. Avoiding Degenerate Solutions • Add priors to distance metrics 2 + ψ α ( d β ) + ψ υ ( d U ) γ ( d β , d U ) = γ ∑ ϱ ( i ) ( ) d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) − j ≠ i covariate distance parameter distance • where beta || 2 ψ α ( d β ) = α || ϕ β − ϕ 0 U || 2 ψ υ ( d U ) = υ || ϕ U − ϕ 0 • and we project loadings into the non-negative reals. Ben Lengerich | ISMB 2018 28

  29. Personalized Regression • Initialize at population solution • Allow each personalized model to “fine-tune” away from the central population solution (block coordinate descent) • Distance-matching regularization ensures the personalized models respect covariate structure Ben Lengerich | ISMB 2018 29

  30. Inference Procedure • Conveniently, we have already learned distance metrics to use for predictions. • On test data, we identify the closest neighbors and use their sample-specific models. Ben Lengerich | ISMB 2018 30

Recommend


More recommend