cancer classification using cancer classification using
play

Cancer Classification Using Cancer Classification Using Informative - PowerPoint PPT Presentation

Cancer Classification Using Cancer Classification Using Informative Gene Profiles Informative Gene Profiles Xue- -wen wen Chen Chen Xue Bioinformatics and Computational Life- -Sciences Laboratory Sciences Laboratory Bioinformatics and


  1. Cancer Classification Using Cancer Classification Using Informative Gene Profiles Informative Gene Profiles Xue- -wen wen Chen Chen Xue Bioinformatics and Computational Life- -Sciences Laboratory Sciences Laboratory Bioinformatics and Computational Life The university of Kansas The university of Kansas Interface 2004, Baltimore

  2. OUTLINE OUTLINE • Introduction • Introduction • Microarray • Microarray Data Analyses Data Analyses • Bootstrapped GA/Margin Methods • Bootstrapped GA/Margin Methods • Experiment Results • Experiment Results • Discussions • Discussions

  3. INTRODUCTION INTRODUCTION • Traditional biology • Traditional biology : one (or few) gene in one : one (or few) gene in one experiment, hard to capture the “ “whole picture whole picture” ” of of experiment, hard to capture the gene function gene function • Microarray • Microarray : monitor thousands of genes on a : monitor thousands of genes on a single chip simultaneously; provides a better single chip simultaneously; provides a better understanding of the interactions among genes; understanding of the interactions among genes; helps explore the underlying genetic causes of helps explore the underlying genetic causes of many human diseases. many human diseases.

  4. MICROARRAY: CANCER CLASSIFICATION MICROARRAY: CANCER CLASSIFICATION • Microarray • Microarray has been successfully applied to has been successfully applied to cancer classification problems cancer classification problems • According to • According to Dudoit Dudoit, , Fridlyand Fridlyand, and Speed, , and Speed, there are three main problems related to there are three main problems related to microarray based cancer classification: based cancer classification: microarray – Cancer discovery (clustering) – Cancer discovery (clustering) – Cancer classification into known classes – Cancer classification into known classes (supervised learning) (supervised learning) – Identification of gene – Identification of gene “ “markers markers” ” (gene selection) (gene selection)

  5. OUTLINE OUTLINE • Introduction • Introduction • Microarray • Microarray Data Analyses Data Analyses • Bootstrapped GA/Margin Methods • Bootstrapped GA/Margin Methods • Experiment Results • Experiment Results • Discussions • Discussions

  6. UNSUPERVISED METHODS: CLUSTERING UNSUPERVISED METHODS: CLUSTERING Partition genes (or samples) into homogeneous groups in order to Partition genes (or samples) into homogeneous groups in order to explore the similarity among genes explore the similarity among genes • Hierarchical Clustering • Hierarchical Clustering (Eisen Eisen et al. Proc. Natl. et al. Proc. Natl. ( Acad. Sci Sci. 1998) . 1998) Acad. • SOMs • SOMs ( (Tamayo Tamayo et al. et al. Proc. Natl. Acad. Sci Sci., ., Proc. Natl. Acad. 1999) 1999) • K • K- -means means ( (Tavazoie Tavazoie et et al. Nature Genetics, al. Nature Genetics, 1999) 1999) • More • More

  7. SUPERVISED LEARNING SUPERVISED LEARNING • Learning (Training) Task • Learning (Training) Task – Given: Expressed gene profiles of cells and their class – Given: Expressed gene profiles of cells and their class lables lables – Learn: Models distinguishing cells of one class from – Learn: Models distinguishing cells of one class from cells in other classes (genes are features) cells in other classes (genes are features) • Classification (Test) Task • Classification (Test) Task – Given: Expression profile of a cell whose class is – Given: Expression profile of a cell whose class is unknown unknown – Test: Predict the class to which this cell belongs – Test: Predict the class to which this cell belongs

  8. SUPERVISED LEARNING METHODS SUPERVISED LEARNING METHODS • Neural Networks • Neural Networks ( (Mateos Mateos et al. 2002) et al. 2002) • K • K- -nearest Neighbors nearest Neighbors ( (Theilhaber Theilhaber et al. 2002) et al. 2002) • Support Vector Machines • Support Vector Machines (Brown et al. 2000) (Brown et al. 2000) • Fisher • Fisher Discriminant Discriminant Analysis Analysis ( (Dudoit Dudoit et al. 2002) et al. 2002) • Decision Trees • Decision Trees ( (Dubitzky Dubitzky et al. 2000) et al. 2000) • And more • And more

  9. CHALLENGES IN LEARNING MICROARRAY DATA CHALLENGES IN LEARNING MICROARRAY DATA • High dimensionality: • High dimensionality: in in microarray microarray data analysis, the data analysis, the number of features (genes) is normally much larger number of features (genes) is normally much larger than the # of training samples. than the # of training samples. • Often noisy and not normally distributed ( • Often noisy and not normally distributed (Hunter et Hunter et al. 2001, bioinformatics) ) al. 2001, bioinformatics • Too many features are not desirable in learning: • Too many features are not desirable in learning: poor poor generalization is expected (or overfitting overfitting). ). generalization is expected (or • Essential to reduce the # of genes to use • Essential to reduce the # of genes to use

  10. GENE SELECTION (MARKER IDENTIFICATION) GENE SELECTION (MARKER IDENTIFICATION) • Feature selection is essential • Feature selection is essential to reduce the test to reduce the test errors in microarray microarray data classification. data classification. errors in • Given such huge amount of data • Given such huge amount of data, we need to , we need to remove genes irrelevant to the learning remove genes irrelevant to the learning problems problems • For diagnostics or identification • For diagnostics or identification of therapeutic of therapeutic targets, a small subset of discriminant discriminant genes is genes is targets, a small subset of needed needed

  11. GENE SELECTION GENE SELECTION Golub et al. (1999): et al. (1999): [mean(+) [mean(+) – – mean( mean(- -)]/[std(+) + std( )]/[std(+) + std(- -)]. )]. Golub Xing et al. (2001): information gain to rank genes. information gain to rank genes. Xing et al. (2001): Long et al. (2001): t t- -test with a Gaussian model test with a Gaussian model Long et al. (2001): Furey et al. (2000): et al. (2000): the Fisher score the Fisher score Furey Newton et al.(2001): a Gamma a Gamma- -Gamma Gamma- -Bernoulli model Bernoulli model Newton et al.(2001): Kerr et al., (2000): ANOVA A F ANOVA A F- -statistics statistics Kerr et al., (2000): Dudoit et al. (2002): et al. (2002): a nonparametric t a nonparametric t- -test test Dudoit Bo and Jonassen Jonassen (2002), (2002), Inza Inza et al. (2002): et al. (2002): Forward selection Forward selection Bo and Khan et al. (2001): PCA PCA Khan et al. (2001): Li et al. (2001): GA/ GA/knn knn Li et al. (2001): more … … more Univariate vs. vs. Multivariate Multivariate Univariate Filter vs. wrapper wrapper Filter vs.

  12. IN THIS PAPER IN THIS PAPER • A method for: • A method for: – Cancer classification and gene identification – Cancer classification and gene identification – Simultaneously – Simultaneously • Wrapper methods • Wrapper methods

  13. OUTLINE OUTLINE • Introduction • Introduction • Microarray • Microarray Data Analyses Data Analyses • Bootstrapped GA/Margin Methods • Bootstrapped GA/Margin Methods • Experiment Results • Experiment Results • Discussions • Discussions

  14. Gene Selection: General Idea Gene Selection: General Idea Feature Criterion Search = + Selection Function Algorithm Criterion function: s s hould generalize (predict) well Criterion function: hould generalize (predict) well (wrapper); particularly important in microarray microarray data data (wrapper); particularly important in classifications, since very limited training samples are available. le. classifications, since very limited training samples are availab Search algorithms: eff eff icient for very high Search algorithms: icient for very high- -d data (e.g., # d data (e.g., # genes ~ 2000) in terms of both computation time and solutions genes ~ 2000) in terms of both computation time and solutions Margin: : ability to generalize; used as the criterion function Margin ability to generalize; used as the criterion function GAs: : better performance than SFS, much faster than GAs better performance than SFS, much faster than exhaustive search; used as the search algorithm exhaustive search; used as the search algorithm Bootstrapping: because of limited training samples because of limited training samples Bootstrapping:

  15. MAXIMUM MARGIN MAXIMUM MARGIN =-1 =+1 maximizing the margin margin (the minimum distance (the minimum distance maximizing the between a hyperplane hyperplane that separates two classes and that separates two classes and between a the closest training samples to the decision surface). the closest training samples to the decision surface). Motivation: Obtain tightest possible bounds for Obtain tightest possible bounds for Motivation: generalization ; is capable of avoiding overfitting overfitting generalization ; is capable of avoiding

  16. MARGIN MARGIN H1 H2 d + d - H ● Define Define the hyperplane H the hyperplane H such such that: that: ● +b ≥ ≥ +1 • w +1 when when y y i =+1 i • w +b i =+1 x i x +b ≤ ≤ - • w -1 1 when when y y i =- -1 1 i • w +b i = x i x

Recommend


More recommend