Sophisticated models in Bio++ Julien Dutheil, Bastien Boussau Birc, Aarhus; LBBE, Lyon Friday, December 19th 2008 J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 1 / 13
Models of sequence evolution A tree a b c J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 2 / 13
Models of sequence evolution A tree A model of substitution a b c J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 2 / 13
Models of substitution in Bio++ • for proteins and nucleic acids (codons: soon! ) • with a gamma law to account for evolutionary rate heterogeneities between sites • possibility for a class of invariant sites • possibility for covarion (heterotachous) models: • on-off models (Tuffley and Steel 1998) • change between rates of evolution (Galtier 2001) J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 3 / 13
Homogeneous and branch-heterogeneous models in Bio++ Homogeneous model a b c J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 4 / 13
Homogeneous and branch-heterogeneous models in Bio++ Homogeneous model Heterogeneous model a a b b c c J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 4 / 13
A simple model of substitution: Tamura’s (1992) • κ : Transition/transversion ratio • θ : Equilibrium G+C content Galtier and Gouy, Mol. Biol. Evol. 1998. J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 5 / 13
Galtier and Gouy model of sequence evolution (1998) Model Parameters a b c • 1 model per branch • each model is characterized by an equilibrium G+C content J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 6 / 13
Models in Bio++ General non-homogeneous model of substitution. In the homogeneous case, θ and κ are constant over the tree (case ’a’). In Galtier and Gouy’s 1998 model, κ is constant over the tree and one distinct θ is allowed per branch (case ’b’). Between these two extrema lay models with certain branches, but not all, sharing a common value of θ (case ’c’). In the most general case ’d’, there are two sets of parameters, one for κ and another for θ , that are shared by the branches of the tree. J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 7 / 13
Associating models to branches J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 8 / 13
Bio++ and BppSuite BppSuite is a set of programs implementing various methods for the evolutionary study of sequences: • BppDist: distance estimation and tree reconstruction • BppPars: parsimony analyses • BppML: ML reconstruction of phylogenetic trees, including using non-homogeneous models • BppSeqGen: sequence simulation, including using non-homogeneous models • BppAncestor: ancestral sequence reconstruction, including using non-homogeneous models • BppSeqMan: sequence and alignment manipulation • BppConsense: building of consensus trees • BppPhySamp: select sequences according to a tree or a distance matrix • BppReRoot: automatic re-rooting of trees J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 9 / 13
Specifying options of BppSuite programs Launching an analysis with bppml Example: bppml param=fichier.opt fichier.opt alphabet = DNA sequence . file = sequences . fasta sequence . format = Fasta sequence . sites to use = complete tree . file = tree . dnd etc... J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 10 / 13
Associating models to branches in BppSuite J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 11 / 13
Exercise • THE DATA: A well-known scientist is working on a family of homologous genes (file ”sequences.fasta”). These sequences come from closely-related species and have been named according to their species of origin: S vulg, S con, S dio, S lat, S dic. Specifically in species S dio, S lat, and S dic, the gene is found on sexual chromosomes. For each of these species, the alignment thus contains two sequences, one from the X chromosome (X is put at the end of the name), and one from the Y chromosome (Y is put at the end of the name). The famous scientist has built a rooted phylogenetic tree relating all sequences in his dataset (file ”tree.dnd”). • THE PROBLEM: The scientist suspects there might have been some Biased Gene Conversion (BGC) going on on the branch leading to the group containing sequences S dioY, S latY, and S dicY. This BGC is expected to increase the number of substitutions towards bases G and C. Your aim is to test for the presence of BGC on this branch. J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 12 / 13
Exercise • THE AIMS: • Using bppML, devise a test to see whether the data rejects BGC on this branch. • Meanwhile, try to accurately characterize the evolution in this dataset. Is there significant rate heterogeneity? Covarion-like evolution? How important has been process heterogeneity in the evolution of this dataset? • THE METHOD: • Option files have been partially filled. You need to complete them to build a proper model to make hypothesis 0 (model 0: there was no heterogeneity in the evolution of the dataset), hypothesis 1 (model 1: there has been one significant change in the evolutionary process on one particular branch), hypothesis 2 (model 2: the evolution has been globally heterogeneous, with different processes on different branches). • Play with the options to better characterize sequence evolution • Use likelihood ratio tests to compare hypotheses. BONUS QUESTION: • • Think of another way to test whether the evolutionary process has been particular on the branch of interest. BppSuite may be useful once again; you may need to do a little bit of programming. J. Dutheil, B. Boussau (Birc; LBBE) Models in Bio++ 19/12/08 13 / 13
Recommend
More recommend