CAMDA ’03: Weakest Link Models for Detecting Small Groups of Genes to Predict Lung Cancer Survival Presenter: Thomas J. Richards, Ph.D. November 13, 2003
Affiliation: Dorothy P. & Richard P. Simmons Center for Interstitial Lung Diseases in the Division of Pulmonary, Allergy, and Critical Care Medicine University of Pittsburgh
In Collaboration with: Roger S. Day, Sc.D. University of Pittsburgh Department of Biostatistics and University of Pittsburgh Cancer Institute
Weakest Link Models • Make sense in biology; • Can be applied to gene expression data; • May identify novel gene interactions.
Response: Plant Growth � 5 Necessary factors: • Water; • Sunlight; • P; • K; • Ca; � How do factors combine to effect plant growth?
They don’t work together like this…
They don’t work together like this…
They may work together like this…
They may work together like this…
Contour plots of E ( Y| X ) Reality? excess sun Sun Water Sun Traditional Models excess water “Curve of Optimal Use (COU)” Water Weakest link model
They may work together like this…
Or like this…
Or like this…
Or like this…
Source: H. Frederik Nijhout, American Scientist (2003)
The Weakest Link Idea E( Y i ) = min j { ϕ j (x ij ; θ j ): j = 1, …, m} •Usually, ϕ j = ϕ for all j; •Weakest link gene minimizes ϕ; •Each patient has his/her own weakest link;
WL Model for Binary Response Data: ϕ j (x ij ; θ j ) = logit –1 ( α j + β j x ij ) E( Y i ) = min j {logit –1 ( α j + β j x ij ) : j = 1, …, m} and θ j = ( α j , β j ). Parametric Weakest Link (PWL) Model
Parametric Weakest Link Model For Survival Data λ (t; x ij ) = λ 0 (t)exp[min j { ϕ j (x ij ; θ j )}] λ (t; x ij ) = λ 0 (t)exp[min j { β jxij }]
Quantile-Matching Weakest Link (QWL) Model:
Curve of Optimal Use: − = − 1 − −∆ f p 1 F F 1 p , ? F or CDF; Normal Logistic − -1 = = − 1 − +∆ f p f p 1 F F 1 p ? -? = f f p f p. + ? ? ? ? 1 2 1 2
Data Pre-processing: Simplify! Simplify the process, minimize data handling: • Affy: • Run RMA, then generate ratios. • cDNA arrays: use ratios. • Focus on known genes only; • 2000 LocusLink IDs in all 4 data sets;
Approach to Data Analysis Gene Selection: Based on substantive hypotheses; • Use DAVID at NIAID to get gene classes: • Not optimal, but necessary in this case;
Approach to Data Analysis Groups of genes, from DAVID: • Cell Cycle (CELL, 24 genes); • Apoptosis (AP, 12 genes); • Extracellular Matrix (ECM, 18 genes); • Matrix Metalloproteinases (MMPs, 10 genes); • WNT Pathway (11 genes).
Approach to Data Analysis Form dyads of genes, for testing: • CELL.AP (288), CELL.ECM (432), … • AP.ECM (216), AP.MMP (120), … Etc. • Pair up all of the above genes with 45 genes from the Beer supplemental data.
Approach to Data Analysis Use profile likelihood to estimate a COU for each pair of genes; Use Bonferroni-by-4 on the p-values; For the direction, take the smallest of the four p-values.
Selected Results CELL.AP: 60 of 288 had adjusted p < 0.05. ECM.MMP: 37 of 180 had adjusted p < 0.05. ECM.BEER: 299 of 810 had adjusted p < 0.05. WNT.BEER: 152 of 495 had adjusted p < 0.05.
Selected Results CELL.AP, 60 significant pairs: 5 minp1p2; 17 maxp1p2; 13 maxp1q2; 25 minp1q2. ECM.MMP, 37 significant pairs: 2 minp1p2; 6 maxp1p2; 11 maxp1q2; 18 minp1q2. ECM.BEER, 299 significant pairs: 60 minp1p2; 65 maxp1p2; 100 maxp1q2; 74 minp1q2. WNT.BEER, 152 significant pairs: 32 minp1p2; 19 maxp1p2; 56 maxp1q2; 45 minp1q2.
Selected Results LocusLink ID = 4175, a Cell Cycle component, MCM6, minichromosome maintenance deficient 6 (S.cerevisae), involved in initiating replication. Biological interaction with 7 LocusLink IDs in the apoptosis class (5 in same direction): 2 minp1p2: TRAF1, TNFRSF1B; 3 maxp1p2: SFRS2IP, MCL1, TRADD; 1 maxp1q2: CRADD (good prognosis) 1 minp1q2: BCL2L2
MCM6 • MCM’s 2- 7 binds to DNA after mitosis and enable DNA replication. • MCM2 is a biomarker of proliferating cells and a marker for premalignant lung cells. • MCM6 is in a chromosomal region that is amplified in lung cancer and its mRNA level is also increased (Kaminski, Dehan unpublished data)
Selected Results II Can we find unexpected interactions? Biological interactions between Beer & ECM? ECM genes show up in every cancer dataset. Fibronectin is a predictor of melanoma invasiveness.
PAI-1 (Plasminogen Inhibitor 1) • Is a known marker of bad prognosis • Interacts significantly with at least 4 ECM genes • Vitronectin maxp1p2 ( Good Prognosis ! ) • Collagen 1A2 maxp1q2 • Collagen 9A2 minp1q1 • Collagen 5A1 minp1q1
Does it make sense? • Elevated PAI-1 activities are associated with coronary thrombosis and with a poor prognosis in many cancers • Vitronectin binding extends the lifetime of active PAI-1, which controls hemostasis and has also been implicated in angiogenesis. • The PAI-1 effects on cell adhesion and motility depend on vitronectin binding…
Conclusions Weakest Link Models: • Make sense in biology; • Can be applied to gene expression data; • May identify novel gene interactions.
Next Steps • Validation on independent data set; • Extend from dyads to triads; • Use tryads to explore pathways; • Extend to arbitrary number of genes.
Acknowledgements: Naftali Kaminski, M.D. Director, Dorothy P. & Richard P. Simmons Center for Interstitial Lung Diseases Public Defenders’ Association
Supplementary Slides
Potential Problems with Linear Models • Mechanistic model, not just predictive. • Several covariates impact a response. – Example: immune response in Melanoma. • Each covariate is “necessary.” – Necessary = “Necessary to impact response probability.” • Logistic Model is unrealistic : 18-Nov-03 Introduction: Motivation for Model 43
– Increasing a covariate always has an effect. – One covariate can be traded off for another. • Example : Branch, Bryant, et al (1997): N- acetyltransferase Metabolic Activity and Bladder Cancer. – Goal : determine role of N-acetyltransferase slow acetylator phenotype in susceptibility to occupationally related aggressive bladder cancer. – Problem : possible interaction without main effect. 18-Nov-03 Introduction: Motivation for Model 44
Interaction without Main Effects • For categorical data, not a new idea: – “Synergism” in BFH (1975). – 2 x 2 x 2 contingency table. – BFH cite Worcester [1971] model, for thromboembolism data. • My adaptation of BFH… – (To SWP3.0)
Est. RR (Controlling age, sex, alcohol, tobacco) Occupational exposure Occupational exposure Acetylator Unexposed Exposed Acetylator Unexposed Exposed Phenotype Phenotype Fast 1.0 1.0 Fast 1.0 1.0 Slow 1.1 8.0 Slow 1.1 8.0 (1.9, 3.4) = 95% ci. (1.9, 3.4) = 95% ci. p < 0.01 p < 0.01 Is there “synergy”, or “synergism”, here?
π = E Y X X , ; i i 1 2 = + logit p a ß? , i i ( ) min -1 p , f p ; or 1 ? 2 ( ) max -1 p , f p ; or 1 ? 2 = where ? , and ( ) i max -1 p , 1-f p ; or 1 ? 2 ( ) min -1 p , 1-f p ; 1 ? 2 → f : 0,1 0,1 is defined by ? − 1 ( ) = − − −∆ f p 1 F F 1 p ,where ? F is a symmetric distribution function.
The Quantile-Matching Weakest Link (QWL) Model In p 1 -p 2 space, the unit square, define a new covariate, one of: ρ = min{p 1 , p 2 } (minp1p2) ρ = max{p 1 , p 2 } (maxp1p2) ρ = max{p 1 , 1 - p 2 } (maxp1q2) ρ = min{p 1 , 1 - p 2 } (minp1q2)
QWL Model For binary response data: E[Y i | X 1 , X 2 ] = α + β ρ i For survival data: λ (t; x i ) = λ 0 (t)exp( β ρ i ) Fitting this QWL Model: Done.
Recommend
More recommend