The Analysis of Biomedical Data
Caveats and Challenges
Ray L. Somorjai
Head, Biomedical Informatics Group
Institute for Biodiagnostics
National Research Council Canada
Winnipeg, MB
Canada

  The Analysis of Biomedical Data
Caveats and Challenges
Ray L. Somorjai
Head, Biomedical Informatics Group
Institute for Biodiagnostics
National Research Council Canada
Winnipeg, MB
Canada

  The Prime Caveat:
"There Are No Panaceas in Data Analysis"
P. J. Huber, Annals of Statistics (1985)

  Two Goals of Biomedical Data Classification:
1. Develop Robust Classifiers
- Capable of Reliably Classifying Unknown Patterns
2. Identify Fewest Maximally Discriminatory Features
(genes, proteins, chemical compounds)
- Find Biologically Relevant, Interpretable Features
Not All Classifiers Satisfy Both Requirements

  The Two Realities of Biomedical Data
{Microarrays (Genomics), Mass Spectra (Proteomics)
Magnetic Resonance, Raman & Infrared Spectra}:
The Clinical Reality:
Few Samples, K = O(10) – O(100)
The "Acquisitional" Reality:
Many Features (genes, M/Z values, spectral data points),
N = O(1 000) – O(10 000)

  Contrast
Classical Statistics –
N ∞ The Art of Asymptotics :
with
Modern "Statistics" –
Methods Applicable when N 0 ?

  Two Realities Two Curses:
The Curse of Dimensionality:
Penalty for Too Many Features
The Curse of Dataset Sparsity:
Penalty for Too Few Samples

  The Curse of Dimensionality
or
Penalty for Too Many Features:
A Robust Classifier Needs a
Sample to Feature Ratio (SFR) ≥ 10
For Biomedical Data SFR ~ 1/20 – 1/200

  The Curse of Dataset Sparsity:
If Too Few Samples,
Trivial to Classify Them Perfectly
More Samples, More Realistic Assessment of
Intrinsic Class Overlap (Bayes Error)

  Consequences of the Curses:
1. Curse of Dimensionality (SFR low)
- Danger of Overfitting
- Conclusions Are Suspect
- No Discriminatory Features Identified
2. Curse of Dataset Sparsity
Insidious:
- Practically Anything Seems to Work!
- Several Equally Good Solutions:
Uniqueness Problematic
- Classifier Robustness Is Suspect

  Steps of Classifier Development:
1. Partition Dataset into Training & Validation Sets
2. Create Optimal Classifier Using Training Set Only
- Important to Use External Crossvalidation
3. Whenever Possible or Feasible, Validate Classifier
with Independent Validation Set, Not Involved in
Developing Classifier

  A Classifier is Claimed Robust if
Training and Validation Set Results Are of
Comparable Accuracy
Fallacious when Curses Are "Active"!

  Developed
Statistical Classification Strategy - SCS
Divide and Conquer:
Four-Stage, Multivariate, Robust
1. Visualization of High-Dimensional Data
2. Preprocessing/Feature Extraction (GA_ORS)
3. Robust Classifier ("Bootstrap" Aggregation)
4. Classifier Fusion (e.g. Stacked Generalization)
Very Successful!

  Stage 1- Visualization (later)
Stage 2- Preprocessing
a) Normalization (alignment, common area)
b) Transformation (derivatives, rank ordering)
c) "Feature Space Reduction":
⇒ Optimal Feature Selector Critical

  For Biomedical Spectra
Optimal Feature Selector
Optimal Region Selector (ORS_GA)
Characteristics of ORS_GA:
a) Retains Spectral Identity
b) Feature is Some Function of Adjacent Data Points
(e.g. Average or Variance)
c) Genetic Algorithm (GA)- Driven
⇒ M < K << N Attributes

  Stage 3- Robust Classifier Development
How Do We "Robustify"?
1. Already Completed: Feature Selection [ORS] to
Satisfy (Sample / Feature Ratio) K / N ~ 5 - 10
2. "Bootstrap-Inspired Classifier Aggregation":

