statistical analysis of meta omics data sandra plancade
play

Statistical analysis of meta-omics data Sandra Plancade INRA - PowerPoint PPT Presentation

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in Agriculture) 24 fvrier 2016 Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 1 / 24 1 Presentation of meta-omics


  1. Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in Agriculture) 24 février 2016 Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 1 / 24

  2. 1 Presentation of meta-omics 2 Sequencing of metagenomics data 3 Statistical analysis of metagenomics data 4 Some of my topics of interest Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 2 / 24

  3. 1 Presentation of meta-omics 2 Sequencing of metagenomics data 3 Statistical analysis of metagenomics data 4 Some of my topics of interest Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 3 / 24

  4. Microbial ecosystems Microbial ecosystem = population of bacteria that interact in a given environment Ñ Exple : soil, sea water, gut ã A varying proportion of bacteria are not genotyped neither cultivable. Before metagenomics : analysis of bacteria culture. Metagenomics = analysis of bacterial genes in a given biological sample. ( ‰ genomics = analysis of the genome of a given organism) Metagenomics made possible by technological advances. Ñ NGS (next generation sequencing) ã Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 4 / 24

  5. Meta-omics data Meta-omics data= omics data measured on a population of bacteria in a given environment. Metagenomics data = DNA of bacteria. Two types of measures : ˛ only 16S gene, characteristic of the species ˛ all genes (Whole Genome Sequencing) Ñ widely studied ã Meta-transcriptomics data = RNA of bacteria Meta-proteomics data = proteins of bacteria Ñ New ã DNA Ñ RNA Ñ proteins function � genomics transcriptomics proteomics metabolomics Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 5 / 24

  6. 1 Presentation of meta-omics 2 Sequencing of metagenomics data 3 Statistical analysis of metagenomics data 4 Some of my topics of interest Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 6 / 24

  7. Metagenomics WGS (Whole Genome Sequencing) or shotgun Next generation sequencing AGGCTGCCA GCCATTCAGTCA GCAGGCTA . . . . . . Genes cut in small Biological List of 30-100 sequences that are sample millions of reads « read » by the populationof machine bacteria Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 7 / 24

  8. Construction of a catalogue from a large number of sample sample n sample 1 AGGCTGCCA GTACGTAAG . . . GCCATTCAGTCA AGCCTAGTCT . . . . . . AGGCTGCCA Pool of GCCATTCAGTCA reads GTACGTAAG Assemble by AGCCTAGTCT Bruijn graph . . . CGCAAT GCAATCG CGCAATCG Long sequence of CGCATTTGAGCTAGCCTAGCATCGAGG nucleotides Délimitation of genes : sequences caracteristic begining/end of gene Metagenomics CGCATTTG AGCTAGCCTA GCATCGAGGC CTTA catalogue Ñ In gut, Metahit catalogue = 10 millions of genes. ã Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 8 / 24

  9. Compute metagenomic abundances in a biological sample : 10 M genes AGGCTGCCA 10 M genes GCCATTCAGTCA n samples GCAGGCTA A i,j . . . Gene counts = # Reads « mapped » reads mapped on the catalogue Reads from a Matrix of biological abundances sample counts of gene g Abundance of gene g “ p length of gene g q ˆ p # reads mapped q Characteristics of the data ˛ High technical variability ˛ Very large dimension : log(p)>n ˛ In gut, 200-500,000 genes present in each sample : high sparsity Dimension reduction ˛ Grouping of genes based on sequence (similarity between proteins translated in sillico) : COG (Cluster of Orthologous Genes) Ñ Functional grouping. ã ˛ MGS (MetaGenomics Species) : grouping by covariance of abundances. ˛ Gene annotation (KEGG) : bank of genes whose function has been identified. Ñ Limited to known bacterial genes. ã Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 9 / 24

  10. 16s metagenomics data 16s : gene characteristic of species Data : matrix of abundances of bacterial species (100/1000 variables) Phylogenetic tree : tree that represents evolutionnary relashionships between species. Ñ built from distances between the nucleotide sequences of 16s genes. ã Ñ Structure in variables. ã Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 10 / 24

  11. Comparison 16s/WGS 16S ˛ Less expensive ˛ More widely used ( ñ more specific statistical methods) ˛ Less technical variability. ˛ Ecology issues : present/absent species in given conditions, co-presence... WGS ˛ Large number of variables ˛ High technical variability ˛ Functional analysis. Controverse : phylogenetic grouping correspond approximately to functional grouping Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 11 / 24

  12. To sum up, metagenomics data are : of large/very large dimension (very) noisy highly correlated sparse potentially structured Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 12 / 24

  13. Other meta-omics data Meta-transcriptomics : similar to metagenomics Meta-proteomics and metabolomics : Technologies similar to omics (GC-MS, MS-MS) ˛ Fractionning of molecules (metabolites/proteins) in fragments (ions/peptides) ˛ Identifications of fragments by their M/Z spectra compared to a bank of peptides/ions ˛ Recovering of molecules abundances. Difficulty : identification requires alignement, more difficult for molecules present in few biological samples. Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 13 / 24

  14. 1 Presentation of meta-omics 2 Sequencing of metagenomics data 3 Statistical analysis of metagenomics data 4 Some of my topics of interest Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 14 / 24

  15. General biological issues Ecology : description of species present in the environment. ˛ Difference between conditions (ex :comparison of soil samples from different geographics area) ˛ Co-presence of species. Functionality : how does microbiote works ? ˛ Interactions between bacteria ˛ Link between microbiote and phenotypes/omics data Ñ Related statistical questions may be unprecised. ã Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 15 / 24

  16. Usual statistical approaches Multiple testing (differential analysis) ˛ zero-inflated parametric models. ˛ permutation tests [White et al , PLoS Comput. Bio. 2009] Mixed models (multiple time-points) [Le Cao et al 2015] X j α j i ` β j i p t q “ f j p t q + + ε i,j p t q i t lo omo on looomooon time effect : random individual splines effect Adaptation of multivariate analysis methods ˛ Centered Log-Ratio transformation + methods based on correlation (PLS...) ˛ Variance decomposition (multi-sites measurements) ˛ Methodes based on distance matrices ˛ Penalisation contraining structure based on phylogenic trees [Chen 2012] Variables selection by sparse multivariate methods Bi-clustering : Non-negative Matrix Factorization Network inference : GGM Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 16 / 24

  17. Example of anaysis based on distance matrices Goal : test the effect of race on rumen microbiote for cow. Data : ˛ p X u,k q , u “ 1 , . . . , N , k “ 1 , . . . , p : 16S measurement of abundances in p bacterial species for N cows ˛ Y u P t 1 , . . . , a u : races ˛ "ANOVA" notations : X i,j,k : i “ 1 , . . . , a : category (race) j “ 1 , . . . , n : repetition (cow) k “ 1 , . . . , p : variable (species) Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 17 / 24

  18. Example of anaysis based on distance matrices Goal : test the effect of race on rumen microbiote for cow. Data : ˛ p X u,k q , u “ 1 , . . . , N , k “ 1 , . . . , p : 16S measurement of abundances in p bacterial species for N cows ˛ Y u P t 1 , . . . , a u : races ˛ "ANOVA" notations : X i,j,k : i “ 1 , . . . , a : category (race) j “ 1 , . . . , n : repetition (cow) k “ 1 , . . . , p : variable (species) Unifrac distance based on phylogeny between 2 16S samples. o Sample 1 o o x x x Sample 2 x Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 17 / 24

  19. Example of anaysis based on distance matrices Goal : test the effect of race on rumen microbiote for cow. Data : ˛ p X u,k q , u “ 1 , . . . , N , k “ 1 , . . . , p : 16S measurement of abundances in p bacterial species for N cows ˛ Y u P t 1 , . . . , a u : races ˛ "ANOVA" notations : X i,j,k : i “ 1 , . . . , a : category (race) j “ 1 , . . . , n : repetition (cow) k “ 1 , . . . , p : variable (species) Unifrac distance based on phylogeny between 2 16S samples. Shared edges Unshared edges o Sample 1 o o x x x Sample 2 x Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of 17 / 24

Recommend


More recommend