genomic exploration of the hemiascomycetous yeasts
play

Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on - PowerPoint PPT Presentation

1 Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08 David J. Sherman U. Bordeaux, France LaBRI CNRS & INRIA team MAGNOME Comparative genomics 2 Which And what ones


  1. 1 Genomic Exploration of the Hemiascomycetous Yeasts 3rd Workshop on Algorithms in Bioinformatics Moscow 2008-10-08 David J. Sherman U. Bordeaux, France LaBRI CNRS & INRIA team “MAGNOME”

  2. Comparative genomics 2 Which And what ones do we do we do sequence? after that? Is certainly about comparison But is also about the genomes

  3. A caricature 3 A solved problem The hard part Data Push button The hard part Biological Algorithmic – – Results otherwise otherwise not interesting not interesting

  4. 4 Hemiascomycetous yeasts Understand mechanisms Eukaryotic genomes of molecular evolution Small and compact Genome redundancy Experimental model Ortho-/para- log divergence Biotechnological interest • beer, wine, bread Expansion and contraction • assimilate hydrocarbons, of universal families tannin extracts Tandem duplications • horomones and vaccines Block duplication and Medical interest rearrangement Biodiversity Conservation of synteny Systems

  5. Comparison of evolutionary range of Hemiascomycetes and Chordates Urochordates Yarrowia lipolytica 50 Debaryomyces hansenii Ciona intestinalis 60 Kluyveromyces lactis Candida glabrata Fishes Takifugu rubripes Tetraodon negroviridis 70 Gallus gallus Birds o 80 t Saccharomyces uvarum c i r t s u s n e s 90 s e c Mammals y m Mus musculus o r Saccharomyces paradoxus a h c c a Saccharomyces cerevisiae Homo sapiens 100 S Scale : average % of amino-acid identity between complete set of orthologous proteins Dujon (2006) Trends in Genetics 22 : 375-387

  6. 6 Génolevures Sequencing Projects Génolevures 1 • 13 species, partial 0.2-0.4X • Souciet et al 2000 [21 papers] FEBS Letters 487 Génolevures 2 • 4 species complete 12X • Dujon, Sherman et al 2004 Nature 430 • Sherman et al 2006 NAR 34 Génolevures 3 • 3 species complete 12X • 2 species complete 7-12X Génolevures 4 • 4 + 5 + 5 close species, NGS

  7. 7 Nb of Genome Ty4 chrom. Size (Mb) whole genome post-duplication gene loss duplication Saccharomyces cerevisiae 16 12.1 expansion of sugar-utilisation genes loss of active Ty5 post-duplication gene loss Candida glabrata 13 12.3 loss of GAL genes loss of sex loss of all active type I retroposons Ty1 / 2 triplicated mating-type Kluyveromyces waltii 8 10.7 cassettes HO endonuclease loss of GAL genes short centromeres Kluyveromyces lactis 6 10.6 loss of class II degradation of HO Transposons Ty5 and non-LTR loss of HO retroposons Ashbya gossypii 7 9.2 loss of GAL genes non universal genetic Ty3 Tca2 Debaryomyces hansenii 7 12.2 code expansion of gene families encoding lipases, high rate of extracellular proteases etc... intron loss Candida albicans 8 14.9 loss of sex expansion of gene families encoding lipases, extracellular Yarrowia lipolytica 6 20.5 proteases, allantoin and allantoate transporters etc... Dujon (2006) Trends in Genetics 22 : 375-387

  8. 8 Genomic data for complete genomes Complete genomes sequenced by the Génoscope What is complete? • Sequence subtelomere to subtelomere • Fully assembled chromosomes • Careful manual annotation What can you do with a complete sequence? • Track chromosomal rearrangements • Analyze species- or clade-specific gain or loss • Measure expansion and contraction of protein families • Look for long-range correlations

  9. 9 What’s next? And what do we do after that? Genome Annotation • Magus annotation system • Simultaneous annotation of putative homologs Classification into protein families • Consensus ensemble clustering Comparative maps • Discovering synteny • Identifying orthologs

  10. 10 Let’s avoid teleology Genomes are thrown together from bits and pieces of things that worked, once Genome annotation has to reflect reality and not expectations It is hard, painstaking work It is not fully automatic Good tools help R. Greaves

  11. The Annotation Process 11 Legend Genomic DNA G ÉNOLEVURES technique G ÉNOLEVURES result Algorithmic Predictive methods Predictive methods sequence analysis External technology RNA genes and External data source Gene models other elements Simultaneous gene annotation Transcript Protein-coding genes Classification sequencing Homolog groups Integration Systematic compar- Curated genes Curation updates ison and consensus Complementary Protein families Annotated genome analyses Magus

  12. 12 The “big iron” Dinkum-thinkum Production Alignments Rule In silico U.I. predictions & checker components 74 cores 4 Gbyte Redundant, high disp. DB search • IBM, Dell Servers • Rules x86_64 Web • 3 web Rocks + bio roll • users 1 database Web Service Bus Web Service Bus • HMMER, NCBI BLAST, • Mini-cluster ClustalW, EMBOSS, Glimmer, Fasta, MrBayes,Phylip, Storage Genome T_Coffee, MPI-Blast, Browser Compute GROMACS • 11 Tbyte RAID Genomes d GenCore 6 database results KB Fast browser database

  13. Browsing a genome region 13

  14. Viewing a Locus on a Genome 14

  15. Validating a Gene Model

  16. Annotating Homolog Groups

  17. 17 Protein families Multi-species groups of related proteins Phylogenetic relationship → functional similarity Diversity of in silico results Need to calibrate or train methods for different phylogenetic groups New algorithm for consensus clustering that is efficient in practice

  18. What’s the goal? Blast Partition ∏ 1 E-val threshold Partition ∏ 2 homeomorphy Complete Partition ∏ 3 genomes Partition ∏ 4 Smith- Waterman Protein Partition ∏ n families homeomorphy

  19. Reconciling different in silico predictions Blast & SW Homeomorphic and sequence Nonhomeomorphic Proteomes alignments Alignments Agreement between partitions partition partition partition • Confusion matrix partition partition • Distance between partitions that is, a shortest path in a graph of fusions/fissions NP-complete

  20. Median partitions by consensus clustering Blast & SW Homeomorphic and sequence Nonhomeomorphic Proteomes alignments Alignments Partition ∏ 1 Partition ∏ 3 Partition ∏ n Partition ∏ 4 Partition ∏ 2 Compute a median partition ∏ minimizing consensus

  21. Construction and algorithm Define a similarity FRel i,j : encodes measure based on confusion matrix the composants c i Select c i in each R k maximal R k by MDC ( min. conflict regions disjoint cover ) NP-complete

  22. Efficient heuristic Relaxation: admit inexact cover (Not all proteins are in families) Resolve conflicts by election + policy For each comp. C for each c i ∈ C compute S i et D i each p votes for c i in ordre D i ↑ and S i ↓ take the winning c i in order so as to cover the most Conflict regions Conflict graph proteins p

  23. subgroups family

  24. 24

  25. Correlated gain and loss and in networks and metabolic pathways

  26. Construct a PSSM for each family 4384 families as follows Proteomes 4240 where FN = 0 GL2 Family GL2 fasta FP med 0,0 avg 3,7 max 302 PSI blast PSI blast Ev med 6e-78 max 9e-6 144 where FN > 0 PSSM Comparison FP med 4,5 avg 33 TP,TN,FP,FN max 307 and worst E-val Construction Validation

  27. Build a PSSM for each family and use to improve gene prediction Per-family size and E-value ORF criteria translations Family GL2 fasta filtering PSI blast PSI blast Loci assigned PSSM* Candidates to families *PSSM: position-specific scoring matrix for PSIBLAST

  28. Comparison with KOGs Project families on S. cerevisiae Select intersection and compare 3625 proteins (~2500 families) identities: 1901 split: 159 (4 GLS, 42 GLR, 113 GLC) merge: 117 (6 GLS, 70 GLR, 79 GLC) messy: 25 (2 GLS, 13 GLR, 23 GLC)

  29. Comparison with KOGs

  30. Comparison GLR.3294 with PIRSF (UniProt) 017196 and 017667

  31. Comparison of GLR.3292 with PIRSF 017297 and 016767

  32. 32 Comparative maps Despite similarity in size, gene content, ecological niche, yeast genomes are highly rearranged. In general, synteny is poorly conserved. In part: • evolutionary distance • artifact of WGD

  33. 33 Comparative maps But, if we focus on the Saccharomycetacae that did not undergo a recent whole genome duplication Protoploid genomes • homogeneous • low redundancy • less reshuffling

  34. Syntenic homologs are orthologs

  35. 37 So, in conclusion Comparative genomics works if you pay attention to the data • High-quality, complete genomes • Chosen from interesting phylogenetic groups Building tools and analyses works if you have a plan • Genome annotation • Protein families and subgroups • Syntenic blocks and common markers Many opportunities for further work http://genolevures.org/ ≡ http://cbi.labri.fr/Genolevures/

Recommend


More recommend