machine learning
play

Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 - PowerPoint PPT Presentation

DUBii - Module - Statistics with R Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 Institut Franais de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Universit (AMU) Lab.


  1. DUBii - Module - Statistics with R Machine learning Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity ( TAGC )

  2. Brain-learning exercise : assign individuals to groups based on their features

  3. Conceptual illustration with two predictor variables In the next slides, we will n provide you with a higher- resolution of the plots, which represent represent a study case. Exercise: assign intuitively n each individual (black dot) to one of the two groups (A, B). At each step, ask yourself q the following questions. Which criterion did you use q to assign an individual to a group? How confident do you feel q for each of your predictions? What is the effect of the q respective means? What is the effect of the q respective standard deviations? What is the effect of the q correlations between the 3 two variables?

  4. Conceptual illustration with two variables – Study case 1 Inspect the distribution of points for the two groups n of individuals (pink, blue) on the 2-dimensional feature space. X2 (Feature 1) X1 (Feature 1) 4

  5. Conceptual illustration with two variables – Study case 2 Effect of the group centre location . n X2 (Feature 1) X1 (Feature 1) 5

  6. Conceptual illustration with two variables – Study case 3 Effect of the group variance . n X2 (Feature 1) X1 (Feature 1) 6

  7. Conceptual illustration with two variables – Study case 4 Effect of the group variance . n X2 (Feature 1) X1 (Feature 1) 7

  8. Conceptual illustration with two variables – Study case 5 Impact of the group-specific variances n (heteroscedasticity of the data) X2 (Feature 1) X1 (Feature 1) 8

  9. Conceptual illustration with two variables – Study case 6 Impact of the group-specific n variances (heteroscedasticity of the data) X2 (Feature 1) X1 (Feature 1) 9

  10. Conceptual illustration with two variables – Study case 7 Effect of the co variance between n features. X2 (Feature 1) X1 (Feature 1) 10

  11. Conceptual illustration with two variables – Study case 8 Effect of the co variance between n features X2 (Feature 1) X1 (Feature 1) 11

  12. Conceptual illustration with two variables – Study case 9 Group-specific co variances between n features. q The two groups have different covariance matrices: the clouds of points are elongated in different directions. q How does this difference affects group assignments ? X2 (Feature 1) X1 (Feature 1) 12

  13. Statistics Applied to Bioinformatics Multivariate analysis Introduction Jacques van Helden ORCID 0000-0002-8799-8584 Institut Français de Bioinformatique ( IFB ) French node of the European ELIXIR bioinformatics infrastructure Aix-Marseille Université (AMU) Lab. Theory and Approaches of Genomic Complexity ( TAGC )

  14. Multivariate data n Each row represents one object (also called unit) n Each column represents one variable variable 1 variable 2 ... variable p individual 1 x 11 x 21 ... x p1 individual 2 x 12 x 22 ... x p2 individual 3 x 13 x 23 ... x p3 individual 4 x 14 x 24 ... x p4 individual 5 x 15 x 25 ... x p5 individual 6 x 16 x 26 ... x p6 individual 7 x 17 x 27 ... x p7 individual 8 x 18 x 28 ... x p8 ... ... ... ... ... individual n x 1n x 2n ... x pn

  15. Multivariate data with an outcome variable n The outcome variable (also called criterion variable) can be q qualitative (nominal) : classes (e.g. cancer type) q quantitative (e.g. survival expectation for a cancer patient) Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 individual 1 x 11 x 21 ... x p1 y 1 individual 2 x 12 x 22 ... x p2 y 2 individual 3 x 13 x 23 ... x p3 y 3 individual 4 x 14 x 24 ... x p4 y 4 individual 5 x 15 x 25 ... x p5 y 5 individual 6 x 16 x 26 ... x p6 y 6 individual 7 x 17 x 27 ... x p7 y 7 individual 8 x 18 x 28 ... x p8 y 8 ... ... ... ... ... ... individual n x 1n x 2n ... x pn y n

  16. Predictive approaches - Training set n The training set is used to build a predictive function n This function is used to predict the value of the outcome variable for new objects Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 Training set individual 1 x 11 x 21 ... x p1 y 1 individual 2 x 12 x 22 ... x p2 y 2 individual 3 x 13 x 23 ... x p3 y 3 ... ... ... ... ... ... individual N_train x 1n x 2n ... x pn y n Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 Set to predict individual 1 x 11 x 21 ... x p1 ? individual 2 x 12 x 22 ... x p2 ? individual 3 x 13 x 23 ... x p3 ? ... ... ... ... ... ... individual N_pred x 1n x 2n ... x pn ?

  17. Predictor variables Outcome variable Evaluation of prediction with a testing set variable 1 variable 2 ... variable p variable p+1 Training set individual 1 x 11 x 12 ... x 1p y 1 individual 2 x 21 x 22 ... x 2p y 2 individual 3 x 31 x 32 ... x 3p y 3 ... ... ... ... ... ... individual ntrain x n1 x n2 ... x np y n Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 variable p+1 (known value) (predicted) Testing set individual 1 x 11 x 12 ... x 1p y 1 y' 1 individual 2 x 21 x 22 ... x 2p y 2 y' 2 individual 3 x 31 x 32 ... x 3p y 3 y' 3 ... ... ... ... ... ... ... individual ntest x n1 x n2 ... x np y ntest y' ntest Predictor variables Outcome variable variable 1 variable 2 ... variable p variable p+1 Set to predict individual 1 x 11 x 12 ... x 1p ? individual 2 x 21 x 22 ... x 2p ? individual 3 x 31 x 32 ... x 3p ? ... ... ... ... ... ...

  18. Flowchart of the approaches in multivariate analysis Multidimensional multivariate table X distance matrix scaling Reduction of dimensions outcome - variable selection variable Y? - principal component analysis none quantitative nominal Exploratory analysis Visualisation Cluster analysis Regression analysis Supervised classification Predicted value of Discovered classes Assignment of individuals Graphical a quantitative variable + individual assignment to predefined classes representations y est = f(x) g=f(x)

  19. Quizz Check your understanding of the concepts presented in the previous slides by applying them to your own data. 1. Describe in one sentence a typical case of multidimensional data that is handled in your domain. 2. Explain how you would organise this dataset into a multivariate structure q What would correspond to the individuals? q What would correspond to the variables? q How many individuals (n) would you have? q How many variables (p) would you have? q Do you dispose of one or several outcome variable(s)? q If so, are these quantitative, qualitative or both? 3. Based on the conceptual framework defined above, which kind of approaches would be you envisage to extract which kind of relevant information from this data? Note that several approaches can be combined to address different questions.

  20. Historical (vintage) examples

  21. Historical example of clustering heat map Spellman et al. (1998). n Systematic detection of genes regulated in a n periodic way during the cell cycle. Several experiments were regrouped, with n various ways of synchronization (elutriation, cdc mutants, …) ~800 genes showing a periodic patterns of n expression were selected (by Fourier analysis) Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9, 3273-97.Time profiles of yeast cells followed during cell cycle.

  22. Stress response in yeast Gasch et al. (2000) tested the transcriptional n response of yeast genome to q Various stress conditions (heat shock, osmotic shock, …) q Drugs q Alternative carbon sources q … The heatmap shows clusters of genes having n similar profiles of responses to the different types of stress. Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. & Brown, P. O. 22 (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell 11, 4241- 57.

  23. Cancer types (Golub, 1999) Compared the profiles of n expression of ~7000 human genes in patients suffering from two different cancer types: ALL or AML, respectively. Selected the 50 genes most n correlated with the cancer type. Goal: use these genes as n molecular signatures for the diagnostic of new patients. Golub, T. R., Slonim, D. K., Tamayo, P., n Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene 23 expression monitoring. Science 286, 531-7.

Recommend


More recommend