Implementing phylogenetic workflows for comparative genomics using BioPerl Jason Stajich University of California, Berkeley, USA jason stajich@berkeley.edu Albert Vilella European Bioinformatics Institute, Hinxton, UK avilella@gmail.com July 21, 2007
This work is licensed by under the Creative Commons Attribution- NonCommercial-ShareAlike 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 1
Introduction to phylogenetics workflows Introduction Gene trees Lineage specific Species A expansions/ Shared gene contractions families All vs All Selection analysis sequence Species B similarity search Gene trees Orthologous Species C genes Selection analysis Genomes 2
Outline • Research questions in Comparative Genomics – Automated Orthologous and Paralogous gene identification – Sequence evolution: adaptive, constrained, and neutral – Gene family evolution: lineage-specific changes • Tools for comparative genomics – Sequence similarity & Gene family clustering – Multiple sequence alignment – Phylogenetics – Molecular evolution • BioPerl for building Pipelines – Data conversion – Running external applications – Processing results
Introduction to phylogenetics workflows Comparative Genomics • Comparisons to study evolutionary history of genomes • Identify commonalities and differences between genomes • Orthologous and unique genes among species • Paralogous gene families • Use similarity search and alignment tools to identify homologs • Use phylogenetic approaches to reconstruct evolutionary history 4
Introduction to phylogenetics workflows Principles of molecular evolution • Sequences that share significant similarity are likely homologous • Homologous sequences often have the same function • Identification of sequence differences and similarities can suggest regions with new or conserved functions • Models of sequence evolution allow inference of rates of evolution • Comparison of multiple genes and genomes can identify sequences evolving at significantly different rates • Sequences or regions with different rates may be under different selective constraint and can suggest innovation or relaxation of pressure. 5
Detecting selection between species • For aligned orthologous genes • Using codon-based methods identify where rate of change is faster in Non-Synonymous ( K A ) than in Synonymous ( K S ).
Introduction to phylogenetics workflows Gene family evolution • Changes in family content can be powerful for understanding species differences – 6% different between Humans and Chimps (Demuth et al, PLoS One 2006). – Hydrophobin expansion in basidiomycete mushrooms – C. elegans chemoreceptor family expansions (Chen et al, PNAS 2006) – Purine salvage enzyme HPRT1 family in vertebrates (Keebaugh et al, Genomics 2007) – Odorant receptor loss associated with gain of trichromatic vision in primates (Gilad et al, PLoS Biology 2004) 7
Introduction to phylogenetics workflows Local expansion of chemoreceptor genes in C. elegans 810K C. elegans Chromosome V R13D11.4 C. briggsae supercontig cb25.fpc4263 >10 1800K Y46H3A.4 >10 3005K K12D9.9 CBG21853 210K >10 CBG21853 3130K C36C5.6 C36C5.7 Chen et al, PNAS 2006; 102(1):146-151. C36C5.8 C36C5.10 CBG21857 C36C5.11 CBG21858 7 CBG21859 C36C5.2 C B G 2 1 8 6 0 C36C5.1 CBG21862 Y73C8B.4 Y73C8B.3 CBG21865 3200K Y73C8B.2 CBG21866 >10 CBG21867 310K T20D4.18 CBG21868 3430K T20D4.2 T20D4.1 >10 4338K D2063.3 F32D1.1 F32D1.2 4350K F32D1.10 8
Introduction to phylogenetics workflows Tree of Hydrophobins in 3 fungi umay UM05010 umay UM04433 ccin 10587 ccin 10586 ccin 05414 ccin 09268 ccin 05081 ccin 11692 ccin 11691 ccin 12456 ccin 12439 ccin 03506 ccin 03524 Local duplications ccin 12453 ccin 06183 ccin 06192 ccin 06185 ccin 06184 ccin 06194 ccin 08744 ccin 06204 ccin 05130 ccin 05145 ccin 00406 pchr 10481 pchr 10482 pchr 03412 pchr 08984 Local duplications pchr 06735 pchr 09319 pchr 02564 pchr 02565 pchr 02739 pchr 09062 pchr 09061 pchr 09060 pchr 09067 pchr 00495 pchr 08523 pchr 11384 pchr 11183 pchr 11134 pchr 00475 pchr 09066 pchr 00499 ccin 08205 ccin 08203 Local duplications ccin 08204 ccin 08198 ccin 08201 ccin 08202 ccin 08199 ccin 13133 ccin 05197 0.1 ccin 05199 ccin 08657 9
Introduction to phylogenetics workflows Hydrophobin expansion driven by local duplications P. chrysosporium C. cinereus 10
Introduction to phylogenetics workflows Definitions for sequence relationships • Homology - Similar sequences that share a common ancestor. • Orthology - Similar sequences that descended from a common ancestor through speciation events. • Paralogy - Similar sequences which arose through a duplication event within a species lineage. • Sequences are generally considered similar if they share at least 30% identity at the amino acid level. 11
Introduction to phylogenetics workflows Species Tree and Gene Tree Li C, Orti G, Zhang G, Lu G. BMC Evol Biology 2007; 7:44. 12
Introduction to phylogenetics workflows Gene tree/Species tree reconciliation • Parsimony – For each node in the tree identify whether it arose via duplication or speciation minimizing the number of duplication events. • Maximum Likelihood and Bayesian frameworks – Maximize likelihood of data given gene tree and species tree, inserting branches on gene tree to represent losses and gains. 13
Introduction to phylogenetics workflows Orthology and Paralogy types Hsap1 ortholog_one2one Hsap1:Mmus1 Mmus1 Duplication node between_species_paralog Speciation node Mmus1:Hsap2 Hsap2 within_species_paralog Hsap2:Hsap2' Homo sapiens Hsap2' ortholog_many2many Hsap2:Mmus2, Hsap2:Mmus2', Hsap2':Mmus2, Hsap2':Mmus2' Mmus2 within_species_paralog Euarchontoglires Mmus2:Mmus2' Mus musculus Mmus2' Hsap3 ortholog_one2many Hsap3:Mmus3, Mmus3 Hsap3:Mmus3' within_species_paralog Mmus2':Mmus3' Euarchontoglires Mmus3' Reconciled gene tree Multiple Homology Inference Sequence Alignment 14
Introduction to phylogenetics workflows Paralogous family creation through duplication • Duplication may be substrate for novel function (Ohno) • Mechanisms of duplications – Unequal crossing-over during recombination – Retrotransposition – Translocations of large regions • Different mechanisms will create different patterns of duplication – Members of a family are Local and physically clustered – Family members are dispersed – Duplicated blocks of genes 15
Introduction to phylogenetics workflows Paralogous gene relationship and inference Mmus Duplication node Speciation node ortholog_one2one Mmus:Rnor Rnor ortholog_one2one Hsap:Mmus ortholog_one2one Hsap:Rnor Mmus Hsap g e n e ortholog_one2one l o Rnor' s s Hsap:Mmus Hsap apparent_ortholog_one2one Mmus:Rnor Hsap' gene loss Dubious Duplication apparent_ortholog_one2one species_intersection_score=0 gene loss Mmus' Hsap:Rnor Rnor 16
Software, Tools, & Data sources Software and Tools 17
Software, Tools, & Data sources Software, Tools, and Data sources • Inferring Orthologous and Paralogous genes • Aligning Sequences • Phylogenetic inference and Building Trees • Testing for Selection • Evaluate gene family size changes • Data sources 18
Orthology Determination • Best reciprocal hits (or Best Bi-Directional Hits) • Refinements of BRH – InParanoid – OrthoMCL • Tree-based – SDI & RIO (Zmasek and Eddy) [Parsimony] – Softparsemap (Berglund et al) [Parsimony] – Notung (Vernot, Goldman, and Durand) [ML] – RAP (Dufayard, Duret, and Rechenmann) [ML] – primetv (Arvestad, Berglund, Lagergren, and Sennblad) [Bayesian] – NJTREE (Li et al) [Parsimony/soft constraining]
Software, Tools, & Data sources Best Reciprocal hits 20
Software, Tools, & Data sources Gene Family Building Using pairwise similarities from tools like BLAST and FASTA we can build gene family clusters. • Single-Linkage - if A → B and B → C , then a cluster would be formed of A,B,C. • Jaccard clustering - used at TIGR and Celera. Essentially single-linkage but it has an additional ability to prune things that are too far away. • MCL (TRIBE) - map sequence similarity into distances on a graph and manipulate the graph to find stable clusters of genes in a family. • hcluster sg (Treefam) - a hierarchical clustering software for sparse graphs. Hierarchical clustering under mean distance. 21
Software, Tools, & Data sources Multiple Alignments Given clusters of homologous sequences, one can examine their evolutionary history through construction of a multiple sequence alignment. • ClustalW - progressive multiple aligner • MUSCLE - progressive multiple aligner with log-expectation score • T-Coffee - progressive multiple aligner with high accuracy • ProbCons - probability consistent aligner • MAFFT - Alignmener that uses Fast Fourier Transformation 22
Phylogenetic inference and Building Trees (1) • Parsimony – PAUP* (aa or nt) – protpars, dnapars in PHYLIP (aa or nt) – LVB (nt) – TNT* (aa or nt) • Distance based – protdist or dnadist + neighbor in PHYLIP (aa or dna) – BioNJ (aa or nt) – PAUP* (aa or nt) – NJTree (aa, codon, or nt) ∗ - Not freely available
Recommend
More recommend