Découverte dans les réseaux biologiques hétérogènes : l'expérience Adalab Céline Rouveirol, LIPN
The Reproducibility Crisis ■ One of the most important current issues in biology is ‘The reproducibility crisis’ - Billions of euros wasted. ■ ‘There is growing alarm about results that cannot be reproduced. Explanations include increased levels of scrutiny, complexity of experiments and statistics, and pressures on researchers.’ Nature 2016. ■ We require Automation to ensure reproducibility. Credit : Ross King
The Concept of a Robot Scientist Computer systems capable of originating their own experiments, physically executing them, interpreting the results, and then repeating the cycle. Background Analysis Hypothesis Formation Knowledge Results Experiment Final Theory Robot Interpretation selection Journées BIOSS-IA - 22 juin 2017 3 Credit : Ross King
Eve: Robot Scientist Journées BIOSS-IA - 22 juin 2017 4 Credit : Ross King
Scientific Goals ■ To make scientific research more efficient: cheaper, faster, better. ■ Our vision is that within 10 years many scientific discoveries will be made by teams of human and robot scientists. ■ This collaboration will produce scientific knowledge more efficiently than either could alone. Credit : Ross King
Scientific Goals ■ A framework for semi-automated and automated knowledge discovery by teams of human and robot scientists. ■ Integrating advances in knowledge representation, ontology engineering, semantic technologies, machine learning, bioinformatics, and automated experimentation. Credit : Ross King
The Diauxic Shift ■ Yeast ( S. cerevisiae ). ■ First turn sugar into ethanol. ■ Then turn ethanol into CO 2 . ■ Cancer ■ Ageing Credit : Ross King
The Diauxic Shift Typical culture-density profile of a fermentative batch culture of Diauxic shift: when yeast S.cerevisiae. ( Saccharomyces cerevisiae ) is grown on glucose with oxygen it first produces ethanol, and when the glucose is exhausted it reorganises (shifts) its metabolism to grow using the ethanol it previously produced (Dickinson, 1999) Credit : Daniel Trejo
Key Challenges ■ The AdaLab system needs to be: – autonomous and perceptive to human requirements (its scientific collaborators). – able to continuously learn, adapt and improve in the “real world” complex environment of scientific research. – capable of continuous cycles of scientific hypothesis formation and experimentation that will improve its scientific knowledge (models) . Credit : Ross King
AdaLab Structure Journées BIOSS-IA - 22 juin 2017 10 Credit : Ross King
Machine Learning objectives ■ Context – Knowledge intensive – Scarce data – Limited experiment panel (gene knock out, growth curve) ■ Learning probabilistic graphical models from scarce data ■ Model revision from partially observed data ■ Experiment design
Objectives Data collection ■ Collection of bioinformatic data about the yeast diauxic shift. ■ Development of an integrated metabolic and gene signalling network Simulation ■ Development of simulation tools, including both regulatory and metabolic model simulation ■ Phenotype predictions from genotype using the integrated model.
Diauxic Shift model Geistlinger et al. A comprehensive gene regulatory network for the diauxic shift in Saccharomyces cerevisiae.Nucleic Acids Res. 2013 • 100 genes from which 68 are transcription factor • There are 212 proven interactions from 322 in total • 1133 annotations from 410 articles Credit : Daniel Trejo
Metabolic network models Credit : Daniel Trejo
Gene expression data ■ M3D (Faith et al. 2008), 247 experiments and 5520 probes. ■ Derisi et al. (1997), Diauxic shift- 7 time points. Samples were taken at times 0 hr; 9.5 hr; 11.5 hr; 13.5 hr; 15.5 hr; 18.5 hr and 20.5 hr ■ Brauer et al (2005), Diauxic shift- 14 time points 1 chemostat (steady state). Samples taken at 7.25hr; 7.5.h; 7.75.hr; 8.hr; 8.25hr; 8.5.hr; 8.75.hr; 9.hr; 9.25hr; 9.5.hr; 9.75.hr; 10.hr Credit : Daniel Trejo
Network completion from GE data ■ « Core » state of the art networks (regulation + metabolic) model is already available – Zimmer is not perfect (false negatives) – Need to get a confidence over Zimmer edges and quantify them – Need to identify new likely edges/nodes (some known influential genes do not occur in this model) ■ (Few) public dynamic GE datasets are available – Noisy, High #genes/#observations ratio (« fat » data)
Variable selection : Influential Transcription Factors Active Inactive Activated targets 1.2 Repressed targets 1.0 0.8 Density 0.6 0.4 0.2 0.0 X a − X r − 1 0 1 2 r na + s 2 s 2 Target gene expression a r nr X : mean a : activated n : size s 2 : variance r : repressed − Tool$availability: CoRegNet Nicolle et al., CoRegNet: reconstruction and integrated analysis of co- regulatory networks, Bioinformatics , 2015 Elati et al., LICORN: learning co-operative regulation networks from expression data. Bioinformatics , 23:2407-2414 , 2007
Learning algorithm for "fat" data (LIPN) ■ Proposal: Model Averaging method over multiple spanning arborescences – Biased components, to offset "data fragmentation" of fat data – Introduction of models diversity to get better results • Data perturbation + Edge sampling – On regulatory structure only Edge Dataset ranking Resample A->B 1 Data B->C 1 Dataset A->C 0.5 set Resample … Consensual model Dataset Resample Threshold Partial Weighted Arboresc Digraphs ences
Learning algorithm for "fat" gene expression data ■ Results: Tested on DREAM 8 Challenge, with promising results (top 3) ■ Digest: high impact of diversity parameters (sampling ratios) Coutant et al, Jobim 2017 and MLSB 2017
Adalab results ■ Network Inference over subsets of the yeast genes – Subset of genes given by Evry university, from CoRegNet results – Learning on different nodes set and different data settings – With or without prior information (e.g. Zimmer edges)
Simulation in ADAlab GRN model : ,&R1;+*#;)&;5#5(+R%"&%$'1(+m+%;0+h 8 Y+h @ Y+III+h > Y Gaussian linear m+.%(+%+ 0%&/"*'!"#$$%"&'.,(/0 +&K bayesian network
Simulation in Adalab ■ Possibility to make predictions from the model – Exporting to Adalab simulator format – Directly interfacing with CoRegFlux from Evry university Glucose Ethanol Biomass
Ongoing work Journées BIOSS-IA - 22 juin 2017 23
Model revision § GRN only § Experiments : gene KO -> growth curves § Gene states over time are not observed : rely on simulation / inference - Infer partial gene states consistent with observed growth curves (backward simulation) - Forward GRN simulation given KO - A gene is inconsistent if its forward and backward simulated states « disagree »
Model revision § Ranking nodes with respect to their observed inconsistency (taking into account its neighborhood in the model) § Candidate revisions : modifying the Markov blanket of highly ranked nodes (classic rewires : adding a link, deleting a link, inverting a link) § Simulating these updated models for KO experiments § Select the KO experiment for which those models most disagree
Growth -> Metabolic Genes : workflow Original growth Partial metabolic gene inference data FBA with known growth Metabolites reactions bounds over time Constraint & value GPR_i max / min expressions propagation GPR_i GPR_1 = max(gene1, min(gene2, gene3), … ) (incomplete) values of GPR_2 = min(gene4, gene2, gene1) FBA … . program
Growth curves -> Metabolic Genes : current results Partial metabolic gene inference Growth curve reconstruction quality Red : WITHOUT metabolite data Green & Blue : original curves Red : inferred curve Blue : WITH metabolite data
Adalab participants ■ Université Paris Nord : Dominique Bouthinon Coutant, Guillaume Santini, Henry Soldano, ■ Université d’Evry : Mohamed Elati, Daniel Trejo ■ Katholieke Universteit Leuven / INRIA Lille : Jan Ramon ■ Manchester University :Martin Carpenter, Ross King Katherine Ropper ■ Brunel University : Jacek Grzebyta, Larisa Soldatova,
Recommend
More recommend