The Analysis of Biomedical Data - The Analysis of Biomedical Data - - The Analysis of Biomedical Data Caveats and Challenges Caveats and Challenges Caveats and Challenges Ray L. Somorjai Ray L. Somorjai Ray L. Somorjai Head, Biomedical Informatics Group Head, Biomedical Informatics Group Head, Biomedical Informatics Group Institute for Biodiagnostics Institute for Biodiagnostics Institute for Biodiagnostics National Research Council Canada National Research Council Canada National Research Council Canada Winnipeg, MB Winnipeg, MB Winnipeg, MB Canada Canada Canada
The Prime Caveat: The Prime Caveat: “There Are There Are No Panaceas No Panaceas in Data Analysis in Data Analysis” ” “There Are No Panaceas in Data Analysis” “ P. J. Huber, Annals of Statistics (1985) P. J. Huber, Annals of Statistics (1985) P. J. Huber, Annals of Statistics (1985)
Two Goals of Biomedical Data Classification: Two Goals of Biomedical Data Classification: 1. Develop Robust Classifiers 1. Develop Robust Classifiers - Capable of Reliably Classifying Unknown Patterns - Capable of Reliably Classifying Unknown Patterns 2. Identify Fewest Maximally Discriminatory Features 2. Identify Fewest Maximally Discriminatory Features (genes, proteins, chemical compounds) (genes, proteins, chemical compounds) - Find Biologically Relevant, Interpretable Features - Find Biologically Relevant, Interpretable Features Not All Classifiers Satisfy Both Requirements Not All Classifiers Satisfy Both Requirements
The Two Realities Two Realities of of Biomedical Data Biomedical Data The Two Realities of Biomedical Data The {Microarrays Microarrays (Genomics), (Genomics), Mass Spectra Mass Spectra (Proteomics) (Proteomics) { {Microarrays (Genomics), Mass Spectra (Proteomics) Magnetic Resonance, Raman & Infrared Spectra}: }: Magnetic Resonance, Raman & Infrared Spectra Magnetic Resonance, Raman & Infrared Spectra}: The Clinical Clinical Reality: Reality: The The Clinical Reality: Few Samples, , K = O(10) K = O(10) – – O(100) O(100) Few Samples Few Samples, K = O(10) – O(100) The “ “Acquisitional Acquisitional” ” Reality: Reality: The The “Acquisitional” Reality: Many Features (genes, M/Z values, spectral data points), (genes, M/Z values, spectral data points), Many Features Many Features (genes, M/Z values, spectral data points), N = O(1 000) – – O(10 000) O(10 000) N = O(1 000) N = O(1 000) – O(10 000)
Contrast Contrast Contrast Classical Statistics – – Classical Statistics Classical Statistics – N ∞ The Art of Asymptotics : N ∞ The Art of Asymptotics : N The Art of Asymptotics : with with with Modern “ “Statistics Statistics“ “ – – Modern “Statistics“ – Modern Methods Applicable when N N 0 ? Methods Applicable when Methods Applicable when N 0 ?
Two Realities Two Curses: Two Realities Two Curses: Two Realities Two Curses: The Curse of Dimensionality Dimensionality: : The Curse of The Curse of Dimensionality: Penalty for Too Many Features Too Many Features Penalty for Penalty for Too Many Features The Curse of Dataset Sparsity Dataset Sparsity: : The Curse of Dataset Sparsity: The Curse of Penalty for Too Few Samples Too Few Samples Penalty for Too Few Samples Penalty for
The Curse of Dimensionality Dimensionality The Curse of The Curse of Dimensionality or or or Penalty for Too Many Features: Too Many Features: Penalty for Penalty for Too Many Features: A Robust Classifier Robust Classifier Needs Needs a a A A Robust Classifier Needs a Sample to Feature Ratio (SFR) ≥ ≥ 10 Sample to Feature Ratio (SFR) ≥ 10 10 Sample to Feature Ratio (SFR) For Biomedical Data SFR SFR ~ 1/20 ~ 1/20 – – 1/200 1/200 For Biomedical Data For Biomedical Data SFR ~ 1/20 – 1/200
The Curse of Dataset Sparsity Dataset Sparsity: : The Curse of Dataset Sparsity: The Curse of If Too Few Samples, If Too Few Samples, If Too Few Samples, Trivial to Classify Them Perfectly Trivial to Classify Them Perfectly Trivial to Classify Them Perfectly More Samples, More Realistic Assessment of More Samples, More Realistic Assessment of More Samples, More Realistic Assessment of Intrinsic Class Overlap ( (Bayes Error Bayes Error) ) Intrinsic Class Overlap Intrinsic Class Overlap (Bayes Error)
Consequences of the Curses: Consequences of the Curses: 1. Curse of Dimensionality (SFR low) 1. Curse of Dimensionality (SFR low) - Danger of Overfitting - Danger of Overfitting - Conclusions Are Suspect - Conclusions Are Suspect - No Discriminatory Features Identified - No Discriminatory Features Identified 2. Curse of Dataset Sparsity 2. Curse of Dataset Sparsity Insidious: Insidious: - Practically Anything Seems to Work! - Practically Anything Seems to Work! - Several Equally Good Solutions: - Several Equally Good Solutions: Uniqueness Problematic - Uniqueness Problematic - Classifier Robustness Is Suspect Classifier Robustness Is Suspect
Steps of Classifier Development: Classifier Development: Steps of Classifier Development: 1. Partition Dataset into Training Training & & Validation Validation Sets Sets 1. Partition Dataset into Training & Validation Sets 1. Partition Dataset into 2. Create Optimal Classifier Optimal Classifier Using Using Training Training Set Only Set Only 2. Create 2. Create Optimal Classifier Using Training Set Only - Important to Use Important to Use External External Crossvalidation Crossvalidation - - Important to Use External Crossvalidation 3. Whenever Possible or Feasible, Validate Validate Classifier Classifier 3. Whenever Possible or Feasible, 3. Whenever Possible or Feasible, Validate Classifier with Independent Independent Validation Validation Set, Set, Not Involved with with Independent Validation Set, Not Involved in Not Involved in in Developing Classifier Developing Classifier Developing Classifier
A Classifier Classifier is Claimed is Claimed Robust Robust if if A Classifier is Claimed Robust if A Training and and Validation Validation Set Results Are of Set Results Are of Training Training and Validation Set Results Are of Comparable Accuracy Comparable Accuracy Comparable Accuracy Fallacious when Curses Are “Active”! when Curses Are “Active”! Fallacious Fallacious when Curses Are “Active”!
Developed Developed Developed Statistical Classification Strategy - - SCS SCS Statistical Classification Strategy Statistical Classification Strategy - SCS Divide and Conquer: Divide and Conquer: Divide and Conquer: Four- -Stage, Multivariate, Robust Stage, Multivariate, Robust Four Four-Stage, Multivariate, Robust 1. Visualization of High Visualization of High- -Dimensional Data Dimensional Data 1. Visualization of High-Dimensional Data 1. 2. Preprocessing/Feature Extraction (GA_ORS) (GA_ORS) 2. Preprocessing/Feature Extraction 2. Preprocessing/Feature Extraction (GA_ORS) 3. Robust Classifier (“Bootstrap” Aggregation) (“Bootstrap” Aggregation) 3. Robust Classifier 3. Robust Classifier (“Bootstrap” Aggregation) 4. Classifier Fusion (e.g. Stacked Generalization) (e.g. Stacked Generalization) 4. Classifier Fusion 4. Classifier Fusion (e.g. Stacked Generalization) Very Successful! Very Successful! Very Successful!
Stage 1- - Visualization (later) Visualization (later) Stage 1 Stage 1- Visualization (later) Stage 2- - Preprocessing Preprocessing Stage 2- Preprocessing Stage 2 a) Normalization Normalization (alignment, common area) (alignment, common area) a) a) Normalization (alignment, common area) b) Transformation Transformation (derivatives, rank ordering) (derivatives, rank ordering) b) b) Transformation (derivatives, rank ordering) c) “Feature Space Reduction Feature Space Reduction”: ”: c) “ c) “Feature Space Reduction”: ⇒ Optimal Feature Selector Critical ⇒ Critical ⇒ Optimal Feature Selector Optimal Feature Selector Critical
For Biomedical Spectra For Biomedical Spectra For Biomedical Spectra Optimal Feature Feature Selector Selector Optimal Optimal Feature Selector Optimal Region Region Selector Selector (ORS_GA) (ORS_GA) Optimal Optimal Region Selector (ORS_GA) Characteristics of ORS_GA: Characteristics of ORS_GA: Characteristics of ORS_GA: a) Retains Spectral Identity Retains Spectral Identity a) a) Retains Spectral Identity b) Feature Feature is Some is Some Function of Adjacent Data Points Function of Adjacent Data Points b) b) Feature is Some Function of Adjacent Data Points (e.g. Average or Variance) (e.g. Average or Variance) (e.g. Average or Variance) c) Genetic Algorithm Genetic Algorithm (GA) (GA)- - Driven Driven c) c) Genetic Algorithm (GA)- Driven ⇒ M ⇒ M < K << N Attributes ⇒ M < K << < K << N N Attributes Attributes
Stage 3- - Robust Robust Classifier Development Classifier Development Stage 3 Stage 3- Robust Classifier Development How Do We “ “Robustify Robustify” ”? ? How Do We How Do We “Robustify”? 1. Already Completed: Already Completed: Feature Selection Feature Selection [ORS] to [ORS] to 1. 1. Already Completed: Feature Selection [ORS] to Satisfy ( (S Sample / ample / F Feature eature R Ratio) atio) K / N ~ 5 K / N ~ 5 - - 10 10 Satisfy Satisfy (Sample / Feature Ratio) K / N ~ 5 - 10 2. “ “Bootstrap Bootstrap- -Inspired Classifier Aggregation Inspired Classifier Aggregation”: ”: 2. 2. “Bootstrap-Inspired Classifier Aggregation”:
Recommend
More recommend