Diversity in vivo, Multicore in silico : How to link metagenomics and community ecology Alain Franc & al. Agreenium, INRA and Université de Bordeaux, Toulouse Bordeaux, France October 2014
Plan • Diversity and ecology • Molecular systematics • Discrete mathematics for molecular systematics • Tools for discrete mathematics for molecular systematics • Case study: amazonian trees and dimensionality reduction • Case study: diatoms and inventories through NGS • Next future
DIVERSITY AND ECOLOGY
Some examples Biodiversity and Applied Mathematics Molecular inprint of evolution : discrete mathematics and statistical modelling - Global alignment (very hard problem) - Inferring large phylogenies (very hard problem) - Coalescence models (technical, rich domain) - Genetic distances and evolutionary distances Donoghue & al., 2009 Ecological modelling : dynamical systems - Community assembly - Diffuse coevolution (geographical mosaic …) A challenge : How to link and assemble those two modelling domains?
MOLECULAR SYSTEMATICS
Evolution
Few Many traits : individuals genome wide cover Many individuals Few DNA regions of interest
DISCRETE MATHEMATICS FOR MOLECULAR SYSTEMATICS
Taxonomy on Edit distance Definition: The edit distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. k itten → s itten (substitution of 'k' with 's') sitt e n → sitt i n (substitution of 'e' with 'i') sittin → sittin g (insert 'g' at the end).
A pint of methods … What we have: genetic distances between sequences What we want: an evolutionary distance as branch length on a phylogenetic tree (time?) Tool for linking both: a graph for visualizing pariwise distance network according to athreshold
Ultrametric distances A taxon is … a disc … a clique a clade
Graphs … From Wikipedia, graphs http://fr.wikipedia.org/wiki/Th%C3%A9orie_des_graphes
Two useful notions on graphs Clique Connected component Basis for Phylogenetics Basis for BLAST Ultrametrics Finding: easy Finding: hard (NP complete)
TOOLS FOR DISCRETE MATHEMATICS FOR MOLECULAR SYSTEMATICS
A couple of success stories … HPC and turbulences Astrophysics, Climate change, Earth Sciences, Bioinformatics, etc … From HPC to distributed computing Computational science Tighter links with services Science as service for communities
What about Biology? Discrete mathematics on words and strings: from sequences to proteins Genetics and random processes population genetics, inferring phylogenies , … Ecology, system biology, and ([very] large) dynamical systems Bioinformatics Integrative biology Medecine Agriculture Environment Response to climate change
An experiment with Turing (IDRIS) ? Scaling Millions of reads exact calculation no heuristics (local alignment) Flows of several 10 ×To / job Towards diagnotics for community ecology
towards a Shared, virtual Biodiversity Lab Distinguish, mobilize, and unite three types of knowledge and skills - Evolutionary biology and ecology - Applied mathematics and statistical modelling - Computer Sciences and High Performance Computing
How does it work? One module Modularity and networking in workflows Galaxy servers enables to implement this As soon as a command line launches a module A network of modules
Galaxy Workflows
Where is it possible to compute? • Local Galaxy server • Mesocentre (Tier 2) Avakas From a unique portal 1000 cores the Galaxy instance • Tier 1 (IDRIS, one pipeline, not via Galaxy) • EGI GRID France-Grille • Cloud (on going, with UPV Valencia) Where from? From any computer connected to internet Currenty available from French Guiana (IP Cayenne works with it)
CASE STUDY: AMAZONIAN TREES AND DIMENSIONALITY REDUCTION
CASE STUDY: DIATOMS AND INVENTORIES THROUGH NGS
An example for taxonomic Annotation from NGS Pairwise distances Distance matrix From local alignment Building a graph Selection of a barcoding gap Computing connex components and cliques Statistics on taxa and characters Visualisation
metabarcoding + NGS on diatoms communities Cross validation False-positives False-negatives Abundances Taxonomic inventory Quality Indices
Taxonomic inventories Microscopy mock community Metabarcoding Metabarcoding rbc L / 454 / RSYST DB rbc L / PGM / RSYST DB 100% homology 99% homology 40 000 reads 54 000 reads Kermarrec et al 2013 19/21 17/21 false-negatives = 2 sp under 0.6% false-negatives: 2 false-positives = taxonomy pb 3 sp under 1% ( Gomphonema sp complex ) 1 sp 1.9% 3 false-positives = 1 to 5 reads
Taxonomic inventories Microscopy Lake Geneva Seasonal dynamics of benthic diatoms Monthly samplings during 1 year 10 environmental samples (April 2012 to March 2013) Metabarcoding Diatoms: scraped from 5 stones, 50 cm depth rbc L / PGM / RSYST DB 250 000 reads
NEXT FUTURE …
Molecular based taxonomy and systematics: An open route for (new) methods Sequences known by pairwise distances Distance geometry pattern recognition machine learning Clustering Multidimensional Scaling linear and nonlinear (e.g. Sammon, 1969) Manifold learning IsoMap, EigenMap, etc … Graph based methods spectral clustering
Continuum of population differentiation Complete independence Pattern recognition … Modest connectivity Substantial connectivity Panmixia (subpopulations are 46 completely congruent) After Waples and Gaggiotti, 2006, Molecular Ecology
Pattern and functions Biodiversity from populations to biomes +
Speculation: Assemblage and Scaling Item Number Living systems: Atoms 92 Molecules 10 6 ? Diversity …. 3 × 10 2 Assembly of heterogeneous parts Cell types Distributed systems 10 7 Organisms Communiti Distributed computing es For Distributed systems? http://www.fractalforums.com/images-showcase-%28rate-my-fractal%29/the-lego-molecule/?PHPSESSID=00a24d7f4234586a8e5ba4dd9c82541b One modelling goal: howto visualize /simulate large associations of small/large numbers of types with modular structures
Thanks to Team Yec’han Laizet Jean-Marc Frigerion Philippe Chaumeil HPC Pierre Gay MCIA Bordeaux Sylvie Thérond IDRIS Michel Daydé e-Biothon Vincent Breton idGC, GIS FG (Meta)barcoding LMGE, Clermont Didier Debroas Gisèle Bronner Carrtel, Thonon Agès Bouchez Frédéric Rimet Isabelle Domaizon AMAP Jean-François Molino Daniel Sabatier IP Cayenne Benoit de Thoisy Anne Lavergne Sourakhata Tirera
Recommend
More recommend