unsupervised joint analysis of arraycgh gene expression
play

Unsupervised joint analysis of arrayCGH, gene expression data and - PowerPoint PPT Presentation

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia,


  1. Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features Christine Steinhoff 1 , Matteo Pardo 1,2 , Martin Vingron 1 1 Max Planck Institute for Molecular Genetics, Berlin, Germany 2 Sensor Lab, INFM-CNR, Brescia, Italy

  2. Outline o The data and how they have been analyzed until now o MCASV: Multiple Correspondence Analysis (MCA) with Supplementary Variables o Results Bits09, Genova - 2 - Matteo Pardo

  3. Array comparative genomic hybridization (aCGH) o aCGH measures the (mean) number of copies of DNA- stretches along the genome in order to detect copy number aberrations ( CNA ) log2(sample/control) Bits09, Genova - 3 - Matteo Pardo

  4. aCGH for the study of cancer o Well established that cancer progresses through accumulation of genomic and epigenomic aberrations o Advantages of DNA over expression analysis: • Genomic DNA is more stable than mRNA • CNAs define key genetic events driving tumorigenesis o aCGH results: • Identification of regions of frequent loss and gain • Correlation of copy-number aberrations with prognosis in a variety of cancer histologies, including breast and lymphoma. • Pinpointed new cancer genes, for example PPM1D in breast cancer and MITF in melanoma. Bits09, Genova - 4 - Matteo Pardo

  5. Expression and aCGH together in breast cancer o Few studies (~10) measured genomic and transcriptomics profiles on the same patient cohort o Causal relation CNA - gene expression is intuitive: more genes  more mRNA o Typical result: the expression level of about 60% of the genes within highly amplified regions is at least moderately elevated. o The other way around: first disease subtypes are derived from expression arrays and successively distinct patterns of CNA found Bits09, Genova - 5 - Matteo Pardo

  6. Data Integration: what is there o Central point is level at which the fusion (integration, joining) actually happens: raw data level; 1. feature level; 2. decision level (‘decision’ means as much as ’after analysis’, when 3. a decision could be taken). o In genomics, what has been really performed is ‘decision fusion’: each data source processed separately and outputs then integrated. o For expression + aCGH: e.g. first determine regions with CNAs (possibly tissue or patients -specific) and then look for differentially expressed (onco)genes inside these regions o Natural reason for pushing integration at a later stage: strong heterogeneity does not allow sensible alignments of source data. o But: loose interaction effects Bits09, Genova - 6 - Matteo Pardo

  7. Joint analysis of expression and aCGH: what is there o Berger et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix: Convenient feature (of any vector space decomposition • approach): visualization is intuitive Does not take into account biomedical covariates (grade, ER and • p53 status,…) Does not distinguish between gene states • o Lee et al., 2008: Calculate correlations between all pairs of genes, between cgh • and expression matrices Biclustering on correlation matrix: find modules, then study • enrichment No summary plot (“one point one gene”) • No consideration of medical covariates • Bits09, Genova - 7 - Matteo Pardo

  8. Joint analysis of expression and aCGH: what we do o Berger et al., 2006: unsupervised analysis with generalized singular value decomposition on fused expression-aCGH matrix: Convenient feature (of any vector space decomposition • approach): visualization is intuitive Does not take into account biomedical covariates • (grade, ER and p53 status,…) Does not distinguish between gene states • • MCASV has been applied in the context of social sciences but to our knowledge not for biological high throughput data analysis. • Quite common in France (‘French school’: Analyse Géométrique des Données !) Bits09, Genova - 8 - Matteo Pardo

  9. Correspondence Analysis o In few words: PCA for discrete data o In some more words: • Applies to contingency tables (cross tabulation of two discrete variables) • Investigates departure from independence (as chi- square test) • Investigates similarity between profile vectors (row vectors divided by the row marginals, i.e. sum of profiles components = 1). • The same applies to the columns analysis Bits09, Genova - 9 - Matteo Pardo

  10. Correspondence Analysis ctd. o As in PCA: find low dimensional projections, which maximize a criterion of preserved “information” o In PCA: criterion is total variance (= sum of distance from the mean inside the reduced dimensional space). Metric is Euclidean. o In CA: criterion is total inertia (= weighted sum of distance from the barycentre inside the reduced dimensional space). Metric is chi-square : • The weight is given by the row marginal, i.e. profiles associated to more objects count more • The chi-square norm weights each profile component by the inverse column marginal, i.e. a category which is less represented counts more o Once projection is found, supplementary points can be projected Bits09, Genova - 10 - Matteo Pardo

  11. Multiple CA o Extension to more than two discrete variables - not straightforward o Standard MCA: CA on indicator matrix o Amounts to: maximize mean projected inertia. Mean over all the tabulations of two variables. This includes variable’s self-tabulation (which also have maximal inertia) o Empirical corrections exist for excluding contributions of self-tabulation Bits09, Genova - 11 - Matteo Pardo

  12. Integrate data with different distributions grade stage Died 2 1 Yes 4 3 No 2 2 yes Discrete categories After appropriate Not symmetric normalization skew approx lognormal symmetric  Discretize all Bits09, Genova - 12 - Matteo Pardo

  13. Pipeline A A E E Data INPUT ( 1 ) ( 1 ) P P Discretization C B S C B S F C F C ( 2 ) ( 2 ) V a r V a r C o r r C o r r Filtering ( 3 ) ( 3 ) − − − − pxm pxm n xp n xp n xp n xp Cat Cat { 1,0,1 { 1,0,1 } E } E { 1,0,1 { 1,0,1 } A } A I n d i c a t o r M a t r i x I n d i c a t o r M a t r i x Indicator coding ( 4 ) ( 4 ) = = } Cat m xp } Cat m xp ( ( ) ) = = I I {0,1 {0,1 3 3 n xp n xp = = 3 3 n xp n xp I I {0,1 {0,1 } } I I {0,1 {0,1 } } A A E E P P E E A A = = MCASV t t B B [ [ I I I I ] [ ] [ I I I I ] ] ( 5 ) ( 5 ) E E A A E E A A = = * * ] t ] t B B [ [ I I I I I I E E A A P P

  14. Discretization + Filtering N xp N xp R R A E Circular Binary Segmentation Two Fold Change (R Package DNAcopy) Genes with Genes with highest highest variance correlation between aCGH across patients and expression Bits09, Genova - 14 - Matteo Pardo

  15. MCASV Nenadic, O. and Greenacre, M. (2006) Multiple Correspondence Analysis and Related Methods . Chapman & Hall/CRC, London Burt matrix: super-table of MCA: find plane maximizing inertia all contingency tables (between Project covariates on the plane genes couples) t I [ I I ] B p E A Bits09, Genova - 15 - Matteo Pardo

  16. Data Show results for correlation filter, 100 genes Bits09, Genova - 16 - Matteo Pardo

  17. MCA plot: Genes Bits09, Genova - 17 - Matteo Pardo

  18. MCA plot: clinical covariates Bits09, Genova - 18 - Matteo Pardo

  19. MCA plot: Supplementary Variables • The plot is centered on the genes’ mean profile. • Genes and covariate states which are near to the origin are less informative. Bits09, Genova - 19 - Matteo Pardo

  20. MCA plot: Supplementary Variables • Each covariate status’ value is the center of the gene patterns (=patients) having that status. • E.g., the (projection of the) mean gene pattern of the patients having tumor grade 1 is represented by the point Grade.1. Bits09, Genova - 20 - Matteo Pardo

  21. MCA plot: Supplementary Variables Tumor grade 1, 2 and 3 separate (only) along the first component  the gene pattern of a patient is determined foremostly by its tumor grade. Bits09, Genova - 21 - Matteo Pardo

  22. MCA plot: Supplementary Variables • Also ER and p53 status display considerable variation along first component • p53 mutant and ER– on the side of higher tumor grade. • ER– has highest score on 1 st component  strongest negative indicator? Bits09, Genova - 22 - Matteo Pardo

  23. MCA plot: Supplementary Variables • Tumor stages separate clearly from each other but show no order. • This hints to heterogeneity of gene patterns inside each state  Lack of genomic support for this classification? Bits09, Genova - 23 - Matteo Pardo

  24. MCA plot: Supplementary Variables • Node status has no projection on the first component  independent of tumor grade progression. • Explains part of the remaining information in the data. • Node- has biggest value on 2nd MCA component Bits09, Genova - 24 - Matteo Pardo

  25. Selected known genes MYC and ERBB2 are wellknown to be amplified and overexpressed coordinately in breast cancers having bad prognosis Bits09, Genova - 25 - Matteo Pardo

  26. Genes related to clinical state ER- GO category enrichment Bits09, Genova - 26 - Matteo Pardo

Recommend


More recommend