Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey Mudd
What is Microarray Data? •Microarray devices obtain RNA expression levels from gene samples •Data obtained can be used for a variety of medical purposes: diagnosis, predicting treatment outcome, etc. •Data produced are typically large and complex, which makes data mining a useful task
Standardizing Data Mining Process •Crisp-DM: Cross-Industry Standard Process model for Data Mining •Crisp-DM is a way of standardizing steps taken in a data mining process using high-level structure and terminology •Useful for describing best practice
Microarray Data Analysis Issues •Typical number of records is small (<100) due to difficulty of collecting samples •Typical number of attributes (genes) is large (many thousands) •Can lead to false positives (correlation due to chance), over-fitting •Paper suggests reducing number of genes examined (feature reduction)
Data Cleaning and Preparation •Thresholding: Determine appropriate range of values (authors used min:100, max 16,000 for Affymetrix arrays) •Normalization: Required for clustering (authors used mean 0, stddev 1) •Filtering: Remove attributes that do not vary enough across samples, such as: MaxValue(G)-MinValue(G)<500, MaxValue(G)/MinValue(G)<5
Feature Selection •Because of the large number of attributes/small number of samples, feature selection is important •Use statistical measures to determine “best genes” for each class •To avoid under representing some classes, apply heuristic of selecting equal number of genes from each class
Building Classification Models •For this data, decision trees work poorly, neural nets work well •Feature reduction alone not sufficient •Test models using a varying number of genes from each class •Five-fold sufficient, leave-one-out cross-validation considered most accurate
Case Study 1 •Leukemia data, 2 classes (AML, ALL), 38 samples training, 34 samples test (separate samples) •Filter to reduce number of genes, select top 100 based on T-values •Build neural net models, 10 genes turned out to be best subset size •97% accuracy (33/34 test record correctly classified)
Case Study 2 •Brain data, 5 classes, 42 samples (no separate test set) •Same preprocessing as Case Study 1 •Select top genes based on Signal to Noise measure, select equal number of genes per class •Build neural net models, 12 genes per class (60 total) gave best results •Lowest average error rate was 15%.
Case Study 3 •Cluster analysis, with goal of discovering natural classes •Leukemia data with 3 classes: ALL -> ALL-T and ALL-B •Same preprocessing as before, also normalize values for clustering •Used two clustering methods in Clementine package, both able to discover natural classes in data, to the authors’ satisfaction
Conclusions •Ideas presented could be applicable to other domains where balance between attributes and samples is similar (cheminfomatics or drug design) •Future work could evaluate cost-sensitive classification which minimize errors based on cost they inflict •Principled methodology can lead to good results
Recommend
More recommend