Personalized Regression Enables Sample-Specific Pan-Cancer Analysis - PowerPoint PPT Presentation

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata � 1

Cancer is Complex • Di ff erent mutations can cause similar phenotypes. • There are many possible driver mutations. • Do we need to build a single model that works for all cancers? • Could we build a di ff erent model for each type of cancer? • But cancer “type” may not correspond to any single clinical covariate. Ben Lengerich | ISMB 2018 � 2

The Extreme: Sample- Specific Models • What if we try to understand tumors one at a time? • Could we use simple models that each work for a single patient? • Enable new types of questions to be asked: “How does this tumor’s model di ff er from the cohort’s?” Ben Lengerich | ISMB 2018 � 3

Our Goal Sample-Specific , Pan-Cancer Models: Samples Model Parameters Ben Lengerich | ISMB 2018 � 4

Why Sample-Specific Models? Deep Learning Sample-Specific Mixtures Mixed E ff ects Varying-Coe ffi cient Universal E ff ects Personal E ff ects Complicated E ff ects Simple E ff ects “This tumor is due to a “Self-driving cars” mutation in gene TP53” Ben Lengerich | ISMB 2018 5

Why Pan-Cancer Models? • Share information between rare and common cancer types • Uncover molecular subtypes • If we can handle clinical covariates well, tissue type can be simply treated as Number of Samples by Tissue Type in TCGA 1 another covariate Ben Lengerich | ISMB 2018 1. cancergenome.nih.gov 6

Related Work Unknown Sample-Specific Covariate General Framework? Models? E ff ects? Varying-Coe ffi cient [1] Known Structure [2,3,4] Sample-Specific Network Estimation [5,6] Personalized Regression 1. Hastie and Tibshirani. Journal of the Royal Statistical Society 1993 2. Song et al. NIPS 2009, 3. Kolar et al. NIPS 2009, 4. Parikh et al. ISMB 2011 5. Kuijjer et al. Arxiv 2015, 6. Liu et al. Nucleic Acids Research 2016 Ben Lengerich | ISMB 2018 7

Personalized Regression • From estimating a single model: Y = X β T + ϵ • To estimating sample-specific models: β (1) Y ( i ) = X ( i ) β ( i ) T + ϵ ( i ) β (2) Samples ⋮ β ( N ) Model Parameters Overparameterized, but not hopeless! Ben Lengerich | ISMB 2018 8

Personalized Regression • Define the sample-specific loss functional to be minimized: N ∑ ℒ ( i ) ( β ( i ) ; d β , d U ) ℒ ( β ; d β , d U ) ∝ i =1 ρ β ϱ ( i ) ℒ ( i ) ( β ( i ) ; d β , d U ) ∝ f ( X ( i ) , Y ( i ) , β ( i ) ) λ ( β ( i ) ) + + γ ( d β , d U ) Prediction Loss Distance-Matching Regularization Overparameterized, but not hopeless! Ben Lengerich | ISMB 2018 9

Distance Matching Regularization • Main idea: Distance between sample parameters should be similar to distance between sample covariates . • Define a regularization loss functional to be minimized: 2 γ ( d β , d U ) = γ ∑ ϱ ( i ) ( d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) ) − j ≠ i covariate distance parameter distance Pairwise distances between all samples Ben Lengerich | ISMB 2018 10

Distance Metrics Can Be Learned From Data • Define distance metrics as linear combinations of feature- wise distance metrics: d β ( x , y ) = [ | x 1 − y 1 | , …, | x P − y P | ] ϕ T β d U ( x , y ) = [ d U 1 ( x 1 , y 1 ), …, d U K ( x K , y K ) ] ϕ T U ϕ β , ϕ U • After optimization, we can inspect the values in to understand contributions to personalization. • User must supply covariate-specific distance metrics. • Can use complicated covariate distance metrics. Ben Lengerich | ISMB 2018 � 11

When is Personalized Regression Useful? • We are seeking a model for inference , not necessarily most accurate predictive model. • We are seeking relatively simple personalized e ff ects, not complex universal e ff ects. • We have covariate data which is informative of each sample. Ben Lengerich | ISMB 2018 12

Experiments � 13

TCGA Pan-Cancer Analysis • Model: Logistic Regression with Lasso Regularization • Task: Predict Case/Control Status • Data: • 28 primary sites • 9663 samples (8944 case, 719 control) • 4123 RNA-Seq features Number of Samples by Tissue Type in TCGA 1 • 14 clinical covariates Ben Lengerich | ISMB 2018 1. cancergenome.nih.gov 14

Clinical Covariates • 14 Clinical Covariates: • Tissue Features : Disease Type, Primary Site, Days to Collection • Sample Molecular Biomarkers : Pct. Tumor Cells, Pct. Normal Cells, Pct. Tumor Nuclei, Pct. Lymphocyte Infiltration, Pct. Stromal Cells, Pct. Monocyte Infiltration, Pct. Neutrophil Infiltration • Patient Demographic Features : Age at Diagnosis, Year of Birth, Gender, Race • Traditional methods expect these data encoded as one-hot vectors, which expands dimensionality 5X! Ben Lengerich | ISMB 2018 15

Personalized Models Are More Efficient with Variable Selection Selects Fewer Genes Uses each Gene in Per Sample: Fewer Samples: Red Lines Indicate Number of Variables Most Genes are Selected for Fewer Selected by Tissue-Specific Models than 500 Samples Ben Lengerich | ISMB 2018 16

Personalized Regression Gives More Weight to Known Oncogenes [1] Many methods e ff ectively identify common oncogenes Few methods e ff ectively identify rare oncogenes 1. Oncogenes as annotated in COSMIC (Forbes et al. Nucleic Acids Research 2014) Ben Lengerich | ISMB 2018 17

Personalized Regression Produces Sample-Specific Pan-Cancer Models Samples Red Line = oncogene Genes Ben Lengerich | ISMB 2018 18

Personalized Models Reveal Molecular Subtypes Which Span Tissues • Over-represented for the GO biological process term “Modulation of Chemical Synaptic Transmission" (p <0.05FDR) • Includes genes ATP1A2, SLC6A4, ASIC1, GRM3, and SLC8A3, Samples which code for ion-transport processes . • Ion-transport processes have long been seen in vivo as an important system in thyroid cancer [1] and in vitro from leukemic cells [2], but only recently as a functional marker across di ff erent cancer types [3]. Genes 1. Filetti et al. European Journal of Endocrinology 1999 2. Morgan et al. Cancer Research 1986 Ben Lengerich | ISMB 2018 19 3. Scafoglio et al. PNAS 2015

Personalized Models Form Clusters with Distinct Signatures Extracellular Processes - Antigen Extracellular Processes - Membrane Cellular Metabolism Ben Lengerich | ISMB 2018 20

Personalized Regression Learns Clinical Distance Metrics Ben Lengerich | ISMB 2018 21

Conclusions • Sample-specific models can give us a new perspective. • Unlock bottom-up in addition to traditional top-down analyses. • Personalized Regression with Distance-Matching Regularization e ff ectively learns sample-specific models. • Personalized Regression reveals patterns in pan-cancer transcriptomic data that are overlooked by traditional analyses. Ben Lengerich | ISMB 2018 � 22

Future Work • Biological Questions - Sample-Specific Processes? • More complex personalized models • Personalized Regression for Single-Cell Data, Election Modeling, Stock Prediction Ben Lengerich | ISMB 2018 � 23

Thank You Code available at: Travel to ISMB generously supported by ISCB github.com/blengerich/ personalized_regression Collaborators: • Bryon Aragam • Eric P . Xing Research supported by NIH • Contact: {blengeri, epxing} @cs.cmu.edu Ben Lengerich | ISMB 2018 24

The Gory Details � 25

Personalized Regression: Optimization • Define pairwise distance vectors by: = [ d β 1 ( β ( i ) P ) ] Δ ( i , j ) 1 , β ( j ) P , β ( j ) 1 ), …, d β P ( β ( i ) β = [ d U 1 ( U ( i ) K ) ] Δ ( i , j ) 1 , U ( j ) K , U ( j ) 1 ), …, d U K ( U ( i ) U • Construction of the covariate distance tensor can be amortized Ben Lengerich | ISMB 2018 26

Avoiding Degenerate Solutions • Add priors to distance metrics • From: 2 γ ( d β , d U ) = γ ∑ ϱ ( i ) ( ) d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) − j ≠ i covariate distance parameter distance • To: 2 + ψ α ( d β ) + ψ υ ( d U ) γ ( d β , d U ) = γ ∑ ϱ ( i ) ( ) d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) − j ≠ i covariate distance parameter distance Ben Lengerich | ISMB 2018 27

Avoiding Degenerate Solutions • Add priors to distance metrics 2 + ψ α ( d β ) + ψ υ ( d U ) γ ( d β , d U ) = γ ∑ ϱ ( i ) ( ) d β ( β ( i ) , β ( j ) ) d U ( U ( i ) , U ( j ) ) − j ≠ i covariate distance parameter distance • where beta || 2 ψ α ( d β ) = α || ϕ β − ϕ 0 U || 2 ψ υ ( d U ) = υ || ϕ U − ϕ 0 • and we project loadings into the non-negative reals. Ben Lengerich | ISMB 2018 28

Personalized Regression • Initialize at population solution • Allow each personalized model to “fine-tune” away from the central population solution (block coordinate descent) • Distance-matching regularization ensures the personalized models respect covariate structure Ben Lengerich | ISMB 2018 29

Inference Procedure • Conveniently, we have already learned distance metrics to use for predictions. • On test data, we identify the closest neighbors and use their sample-specific models. Ben Lengerich | ISMB 2018 30

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis - PowerPoint PPT Presentation

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata 1 Cancer is Complex Di ff erent mutations

NMIJ/AIST activity on digitized metrology It enables personalized, wearable,

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

Sample stascs and linear regression NEU 466M Instructor:

The simplest way to enhance on Sensitivity is to increase V inj ! The Golden Rule of the Thumb:

Use multilevel regression and poststratification to adjust for known differences between

Simple Linear Regression Chapter 10 1 Motivation Have data (sample, x s) Want to

Multiple Regression Sample Formulas Least Squares . . . James H. Steiger Bias of the Sample R 2

Q3 2015 results 20 November 2015 Cxense enables businesses to increase digital revenue CUSTOMER

Unit 6: Introduction to linear regression 1. Introduction to regression The CDC monitors the

Estimating the ATE of an endogenously assigned treatment from a sample with endogenous selection

DEFINITION OF PERSONALIZED LEARNING Personalized learning is a

The sample data used in this presentation is intended for general use and not specific individual

-4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 s W eights, w ( )

gene-expression studies on limiting sample amounts Stefaan Derveaux 1,2 , Jolle Vermeulen 1 1

Personalized Genomics of Cancer 02-223 Personalized Medicine:

Sampling Distribution of a Statistic Recall: a statistic is a summary calculated from a sample.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Simple Linear Regression and Correlation Model for designed experiment: Y i = 0 + 1 x i +

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

M.Ph. D.R. V inj for a given C A m A C A D.R. Sensitivity

UQ, STAT2201, 2017, Lecture 8 (and part of 9). Unit 8 Two Sample Inference. Unit 9

Question 1: 2 Find the Specific Gravity of Given: W T =318 kg W S =204 kg V T

What do you notice? Neural Resonance: a property of neurons that enables them to generate

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis - PowerPoint PPT Presentation

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis Benjamin J. Lengerich, Bryon Aragam, Eric P . Xing {blengeri, naragam, epxing}@cs.cmu.edu @ben_lengerich, @itsrainingdata 1 Cancer is Complex Di ff erent mutations

NMIJ/AIST activity on digitized metrology It enables personalized, wearable,

IV.4 Topic-Specific &amp; Personalized PageRank PageRank produces one-size-fits-all

Sample sta*s*cs and linear regression NEU 466M Instructor:

The simplest way to enhance on Sensitivity is to increase V inj ! The Golden Rule of the Thumb:

Use multilevel regression and poststratification to adjust for known differences between

Simple Linear Regression Chapter 10 1 Motivation Have data (sample, x s) Want to

Multiple Regression Sample Formulas Least Squares . . . James H. Steiger Bias of the Sample R 2

Q3 2015 results 20 November 2015 Cxense enables businesses to increase digital revenue CUSTOMER

Unit 6: Introduction to linear regression 1. Introduction to regression The CDC monitors the

Estimating the ATE of an endogenously assigned treatment from a sample with endogenous selection

DEFINITION OF PERSONALIZED LEARNING Personalized learning is a

The sample data used in this presentation is intended for general use and not specific individual

-4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 s W eights, w ( )

gene-expression studies on limiting sample amounts Stefaan Derveaux 1,2 , Jolle Vermeulen 1 1

Personalized Genomics of Cancer 02-223 Personalized Medicine:

Sampling Distribution of a Statistic Recall: a statistic is a summary calculated from a sample.

Regression Diagnostics and the Forward Search 3. A Single Multivariate Sample Anthony Atkinson,

Simple Linear Regression and Correlation Model for designed experiment: Y i = 0 + 1 x i +

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

M.Ph. D.R. V inj for a given C A m A C A D.R. Sensitivity

UQ, STAT2201, 2017, Lecture 8 (and part of 9). Unit 8 Two Sample Inference. Unit 9

Question 1: 2 Find the Specific Gravity of Given: W T =318 kg W S =204 kg V T

What do you notice? Neural Resonance: a property of neurons that enables them to generate

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

Sample stascs and linear regression NEU 466M Instructor: