tiered computation
play

Tiered Computation Raymond Ng CIO, Proof Centre Acting Head, - PowerPoint PPT Presentation

Tiered Computation Raymond Ng CIO, Proof Centre Acting Head, Department of Computer Science, UBC BiT Biomarker Discovery Strategy Omics Tools and Approaches Serum and Urine PAXgene Whole Blood Plasma Albumin TRANSCRIPTOMICS


  1. Tiered Computation Raymond Ng CIO, Proof Centre Acting Head, Department of Computer Science, UBC

  2. BiT Biomarker Discovery Strategy “Omics” Tools and Approaches Serum and Urine PAXgene Whole Blood Plasma Albumin TRANSCRIPTOMICS METABOLOMICS Nascent Depleted Bound to BIOMARKER BIOLIBRARY Plasma Plasma Column PROTEOMICS RNA Extraction Plasma Depletion Blood Urine Tissue NMR & Mass Spec Affymetrix Microarray ABI 4800 iTRAQ Analysis Analysis Analysis U of Alberta Microarray Core UVic-Genome Metabolomics Laboratory, BC Proteomics Platform, Children’s Platform, Edmonton, AB Hospital, LA, CA Victoria, BC QA/QC – All sample collection and processing is done to SOP 5/4/2011 2

  3. Importance of Data Cleansing and Pre-processing A . Clinical: “Detecting potential labeling errors in microarrays by data perturbation ,’’ Bioinformatics 2006 (Malossini, Blanzieri) B. mRNA: “ MDQC: a new quality assessment method for microarrays based on quality control reports,” Bioinformatics 2007 (Cohen-Freue, Hollander et al.) C. DNA: “ Modelling Recurrent DNA Copy Number Alterations in array CGH Data ,” Bioinformatics 2006, 2007 (Shah, Murphy, Lam) 5/4/2011 3 3

  4. Microarray Quality Control Assessment Tool Chip Quality Sample Quality 400 1500 21-4 21-4 300 1000 200 500 100 17-6 302-7 25-5 0 0 0 50 100 150 200 0 50 100 150 200 Sample RNA Quality 13-3 15 320-1 13-4 10 13-6 13-2 19-1 13-5 317-10 5 0 0 50 100 150 200 5/4/2011 4 4 Sample

  5. Finding “Needles in a Haystack” 54,000 Probe Sets 2,000 Proteins/Metabolites I. Remove features with small variations across Pre-filtering all samples (rejection < 10,000 Probe Sets or otherwise) < 100 Proteins/Metabolites 5

  6. “Needles in the Haystack” (cont) II. (Univariate) Rank each 54,000 Probe Sets 2,000 Proteins/Metabolites individual feature on how well it discriminates the rejection samples from non-rejection < 10,000 Probe Sets samples < 100 Proteins/Metabolites Ranking and Filtering III. (Multi-variate) Rank groups ~ 100-500 of features together on their Genes/Proteins/Metabolites/ Clinical Variables joint discrimination power, taking correlation into account 6

  7. “Needles in the Haystack” (cont) 54,000 Probe Sets 2,000 Proteins/Metabolites < 10,000 Probe Sets < 100 Proteins/Metabolites ~ 100-500 Genes/Proteins/Metabolites/ IV. Select features to be Clinical Variables included in the panel, Panel Selection, Model Building possibly assigning different BIOMARKER PANEL weights to different features INTERNALLY VALIDATED BIOMARKER PANEL 7

  8. Rich Space for Choices 1) k samples above absolute threshold Pre-filtering (remove 2) First half using inter-quartile range probe-sets with low variability) 3) First half using empirical central mass range 1) Maximum of LIMMA, robust LIMMA and SAM Uni-variate ranking 2) LIMMA (FDR-based; per probe-set) 3) Robust LIMMA 1) FDR cut-off (FDR<0.01) Uni-variate filtering 2) Size cut-off: Top 50 probe-sets (per probe set) 3) Combination rule: FDR<0.05 but at least 50 and at most 500 probe sets 1) Stepwise Discriminant Analysis Multi-variate 2) SVM-based ranking (one step) ranking (optional) 3) Recursive Feature Elimination (multi-step) 4) Elastic Net-based (coefficients) 1) Significance of improvement cut-off Multi-variate 2) Top 50 (as returned by multi-variate ranking) filtering (optional) 3) Non-zero coefficients (Elastic Net) 1) Linear Discriminant Analysis Classifier 2) Support Vector Machine Generation 3) Random Forest 4) Elastic Net 5) Logistic regression 8

Recommend


More recommend