supervised classification and outliers detection in gene
play

Supervised classification and outliers detection in gene expression - PowerPoint PPT Presentation

Supervised classification and outliers detection in gene expression data Laurent Br eh elin and Fran cois Major LIRMM, Montpellier, France LBIT, Montr eal, Qu ebec 1. Gene expression data and classification 2. Outliers detection


  1. Supervised classification and outliers detection in gene expression data Laurent Br´ eh´ elin and Fran¸ cois Major LIRMM, Montpellier, France LBIT, Montr´ eal, Qu´ ebec 1. Gene expression data and classification 2. Outliers detection 3. Results

  2. Gene expression data 1 2 3 p−1 p ... 1 x11 x12 x13 ... 2 x21 x22 • Huge number of genes; ... 3 x31 ... 4 • low number of samples; • high level of noise; • missing values; • few discriminant genes. 0 5 10 15 20 25 n−1 n

  3. Applications • Cancer diagnosis: – annotation (tumor vs. normal); – detection (early for better treatment); – distinction (cancers with same clinical symptoms); – prediction (prognostic). • Biological interest : – what are the genes? – what are the classification rules? – etc.

  4. Gene selection Why ? • eliminate noise ; • reduce computing time ; • understand better. 0 5 10 15 20 25 Selection scheme: 1. Gene scoring. e.g., g-score : s g = | m g 0 − m g 1 | s g 0 + s g 1 . 2. Selection of the best k genes.

  5. Learning a classifier - 1 • G a set of selected genes. • x = ( x 1 , . . . , x n ) a new sample. Probabilist approach : P ( x G | c ) P ( c ) P ( c | x G ) = argmax P ( x G | c ) P ( c ) c MAP = argmax = argmax P ( x G ) c ∈{ 0 , 1 } c ∈{ 0 , 1 } c ∈{ 0 , 1 } • Estimating the P ( c ) : P (0) = # ex. class 0 � # ex. P (1) = # ex. class 1 � # ex. • The problem is more difficult for P ( x G | c ).

  6. Learning a classifier - 2 The naive Bayes approach : Gene expression levels are conditionally independent given the class: � P ( x G | c ) = P ( x g | c ) . g ∈ G 0.16 Normal assumption : 0.14 P ( x g | c ) ∼ N ( x g ; µ gc , σ 2 gc ) . 0.12 0.1 0.08 and we have : 0.06 • � µ gc = m gc 0.04 • � σ 2 gc = s 2 0.02 gc 0 -10 -5 0 5 10 15 20

  7. Learning a classifier - 2 The naive Bayes approach : Gene expression levels are conditionally independent given the class: � P ( x G | c ) = P ( x g | c ) . g ∈ G 0.16 0.16 Normal assumption : 0.14 0.14 P ( x g | c ) ∼ N ( x g ; µ gc , σ 2 gc ) . 0.12 0.12 0.1 0.1 0.08 0.08 and we have : 0.06 0.06 • � µ gc = m gc 0.04 0.04 • � σ 2 gc = s 2 0.02 0.02 gc 0 0 -10 -10 -5 -5 0 0 5 5 10 10 15 15 20 20

  8. Evaluating the classifier Low number of samples → cross-validation. Leave-one-out procedure: : X , the complete set of samples. Data foreach x ∈ X do Learning using X − x ; Classify x ; Return the fault coverage;

  9. Evaluating the classifier Low number of samples → cross-validation. Leave-one-out procedure: : X , the complete set of samples. Data Genes selection; foreach x ∈ X do Learning using X − x ; Classify x ; Return the fault coverage;

  10. Non-biased Leave-one-out : X , the complete set of samples. Data foreach x ∈ X do Gene selection using X − x ; Learning using X − x ; Classify x ; Return the fault coverage;

  11. Non-biased Leave-one-out : X , the complete set of samples. Data foreach x ∈ X do Gene selection using X − x ; Learning using X − x ; Classify x ; Return the fault coverage; 45 45 biaise biaise 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 0 20 40 60 80 100 0 20 40 60 80 100 Breast cancer SAGE data

  12. Non-biased Leave-one-out : X , the complete set of samples. Data foreach x ∈ X do Gene selection using X − x ; Learning using X − x ; Classify x ; Return the fault coverage; 45 45 45 45 biaise biaise biaise biaise non biaise non biaise 40 40 40 40 35 35 35 35 30 30 30 30 25 25 25 25 20 20 20 20 15 15 15 15 10 10 10 10 5 5 5 5 0 0 0 0 0 0 20 20 40 40 60 60 80 80 100 100 0 0 20 20 40 40 60 60 80 80 100 100 Breast cancer SAGE data

  13. Outliers Outlier : A gene expression measurement which differs surprisingly from the other measurements obtained for the same gene on other samples of the same class. 0 5 10 15 20 25 30 Outliers bias : • the estimates of the model parameters ( µ cg and σ 2 cg ) ; • the gene score. Ex. g-score : s g = | m g 0 − m g 1 | s g 0 + s g 1

  14. Origins • Intrinsic factors : the surprising measurement is actually the true measure. It results from rare but non impossible biological phenomena. • Extrinsic factors : Error measurement : – material reasons; – human reasons; – inherent limits of the measurement method; – . . .

  15. Outlier detection - 1 Principle : • assume that the data, with the possible exception of any outlier, form a sample of a given distribution —here the normal distribution; • use a reasonable test statistical to decide whether or not the suspect measure- ment is an outlier. 0 5 10 15 20 25 30 The Thompson statistic : T gc = | x ∗ gc − m gc | s gc The greater T gc the more x ∗ gc is unlikely.

  16. Outlier detection - 2 0 5 10 15 20 25 30 The rule : If T gc ≥ τ αc then x ∗ gc is an outlier. How can we set τ αc ? • Compare with what is expected in the null hypothesis H 0 that there is no spurious observation (i.e. all points belong to the same normal distribution). • Find τ αc so that P ( T gc > τ αc | H 0 ) = α. (e.g. α = 10 − 5 .)

  17. Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

  18. Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

  19. Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; foreach gene g do Compute T g 0 and T g 1 using X − x ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

  20. Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; foreach gene g do Compute T g 0 and T g 1 using X − x ; if T g 0 > τ α 0 then remove x ∗ g 0 ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

  21. Leave-one-out & outliers 0 5 10 15 20 25 30 : X and α Data foreach x ∈ X do Compute τ α 0 and τ α 1 from α ; foreach gene g do Compute T g 0 and T g 1 using X − x ; if T g 0 > τ α 0 then remove x ∗ g 0 ; if T g 1 > τ α 1 then remove x ∗ g 1 ; Select a set of genes using X − x ; Estimate parameters using X − x ; Classify x ; Return the fault coverage;

  22. Gene selection & outliers Use the outlier detection in the gene selection procedure.

  23. Gene selection & outliers Use the outlier detection in the gene selection procedure. 0 5 10 15 20 25 30 35 40

  24. Gene selection & outliers Use the outlier detection in the gene selection procedure. 0 5 10 15 20 25 30 35 40 : X , α ′ and x Data Compute the τ α ′ 0 and τ α ′ 1 ; foreach gene g do Compute T ′ g 0 and T ′ g 1 ; if T ′ g 0 > τ α ′ 0 and T ′ g 1 > τ α ′ 1 then Reject gene g ; else Compute the score of g ; Return the best genes;

  25. Experiments • α ′ = 10 − 2 ; • α = 10 − 2 , 10 − 5 , 10 − 10 , 10 − 15 , 10 − 20 . 55 70 60 50 45 50 40 40 35 30 20 30 25 10 NB NB NB+OD NB+OD KNN KNN 0 20 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Breast cancer : Lymphome : • 78 samples : 44 vs. 34. • 58 samples : 32 vs. 26. • ∼ 24000 genes. • ∼ 7000 genes.

  26. Conclusions • Outlier detection can improve the performance of the naive Bayes classifier. • Naive Bayes classifier + outlier detection: – simple approach; – low computing time; – can achieve better results than more sophisticated methods. Several questions: • interest of outlier detection combined with other approaches: KNNs, SVMs, weighted voting approach, . . . • comparison with more robust estimates (e.g. median vs. mean); • outlier origins: intrinsic or extrinsic factors?

Recommend


More recommend