comparative genomics computational challenges
play

Comparative Genomics: Computational Challenges Bernard M.E. Moret - PowerPoint PPT Presentation

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 p. Overview Comparative approaches The genome and its evolution High-throughput data


  1. Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 – p.

  2. Overview � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

  3. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

  4. Comparative Approaches � an evolutionary perspective requires models of evolution Nothing makes sense in biology except in the light of evolution � a model of evolution requires data that reveals evolution Th. Dobzhansky, The American Biology Teacher , 1973. � data that reveals evolution must come from several organisms or tissues � hence working in the light of evolution requires comparative approaches � some organisms, incl. humans, are difficult to study in a lab setting � some experiments cannot be performed on some organisms, incl. humans, From the point of view of experimentalists and medical researchers � hence learning about these organisms is best done by studying others for practical or ethical reasons Nantes, 6/8/09 – p.

  5. Characteristics of Comparative Approaches � rely on identification of conserved patterns � develop evolutionary models for observed changes Comparative approaches � use conserved patterns for datamining and evolutionary models for analysis of mined data Translated to comparative genomics: data: whole-genome sequences conserved patterns: subsequences, distribution statistics, or combinations models: duplication/loss/rearrangement at the genome level, mutation/indel for nucleotides datamining: important components, anchors, clusters, syntenic blocks Nantes, 6/8/09 – p.

  6. Comparative Genomics Vocabulary syntenic block : conserved pattern (subject to microrearrangements) used to denote conserved block of genes (10Kbps to 1Mbps) (originally used in genetics to denote colocation on the same chromosome) genomic alignment : sequence-level alignment of complete genomes, with block-level rearrangements, duplications, and losses positive (Darwinian) selection : selects for favorable traits; accelerates observed change in affected regions negative (filtering) selection : selects against changes of most kinds; slows down observed change in affected regions ancestral reconstruction : inference of the putative contents and arrangement of the genome of a common ancestor genomic signature : originally, distribution of dinucleotide frequencies, now characterization of common patterns in a group of genomes Nantes, 6/8/09 – p.

  7. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

  8. Evolution of the Genome What evolutionary events affect the genome? nucleotide-level : “classical” sequence evolution (mutations and indels) genomic rearrangements : inversions, transpositions, translocations, and chromosomal fusion and fission duplication : gene retrotransposition, tandem duplication, segmental duplication, whole-genome duplication loss : point mutation, segmental deletion, neofunctionalization recombination : meiotic recombination, hybridization, lateral gene transfer Nantes, 6/8/09 – p.

  9. Evolutionary Models And how well understood are they? nucleotide-level : well established models with good statistics genomic rearrangements : enormous work in the last 10 years, but still parameter-poor duplication/loss : established work in lineage sorting (divergent gene evolution due to paralogs), much attention to whole-genome duplication, just starting on segmental duplications recombination : established work in population genetics, much work on identifying lateral gene transfer, detailed work on recombination just starting Nantes, 6/8/09 – p.

  10. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p. 1

  11. High-throughput data Slowly pervading laboratory-based biology: sequencing : the original high-throughput data source, now easily the most economical high-tech lab instrument gene expression : microarrays and their ilk, now inexpensive and in very widespread use transcription profiling : ChIP-chip, ChIP-seq, and future products, soon to replace microarrays mass spectrometry : for protein analysis and sequencing; now also for mixed samples (metaproteomics) SNP assays : for precise genotyping of humans other domains : cell signalling, metabolomics (e.g., fluxes), 3D imaging, time series, etc. Nantes, 6/8/09 – p. 1

  12. High-throughput sequencing developed for the human genome project, now also used for: de novo sequencing : still the most challenging resequencing : verify base calls, test assemblies, etc. deep sequencing : dense sampling or high coverage metagenomics : random sampling of microbial communities current technologies (454, Illumina) generate around 4Gbps per half-day run next-gen technologies may yield over 20Gbps per hour run, at better than 50x coverage, with very short (20bps) fragments Nantes, 6/8/09 – p. 1

  13. Can computation keep up? Computer power still follows Moore’s law, doubling every 12–15 months. However: That power is getting harder to use (parallelism is hard to exploit). Data accumulates faster than Moore’s law (sequence data alone doubles every year). 3 – 10 4 speedup. This comparison presupposes a linear relationship, 10 but most genomic analysis algorithms are much slower. High-performance computing does not help much: the fastest machines can provide only Nantes, 6/8/09 – p. 1

  14. Genome-scale computing Running time is only one facet of the problem. Comparing several genomes of a few Gbps each requires a lot of memory —at least 128GB per node. Available and not too expensive: a 16-core compute node with 128GB memory and 500GB disk can be had for $20K. Still rare: most compute clusters have “thin” nodes (2-8GB of memory), unsuited to whole-genome analysis. 2010 architectures will pack 64–128 cores per node, with 0.5–1TB memory—great for comparing a few genomes, but by then we will have 100s of vertebrate genomes. . . Nantes, 6/8/09 – p. 1

  15. � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p. 1

  16. Basic annotation per genome We want to identify: all coding genes (or exons); all noncoding genes; gene families; SINEs, LINEs, and other repeat elements; regions under positive, neutral, or negative selection. Beyond this basic level, we want to identify gene clusters, operons, alternative splicing scenarios, etc.; gene function. Nantes, 6/8/09 – p. 1

  17. Comparative annotation Using pairwise comparative approaches, we can ask for all pairwise homologies; orthology and paralogy relationships within gene families (within limits due to lack of phylogenetic information); syntenic blocks; mapping of the syntenic blocks between the two genomes (simple translocations and transpositions, with inversions); translation of functional annotations from each genome into the other. Nantes, 6/8/09 – p. 1

Recommend


More recommend