Comparative Genomics: Computational Challenges Bernard M.E. Moret - PowerPoint PPT Presentation

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 – p.

Overview � Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

� Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

Comparative Approaches � an evolutionary perspective requires models of evolution Nothing makes sense in biology except in the light of evolution � a model of evolution requires data that reveals evolution Th. Dobzhansky, The American Biology Teacher , 1973. � data that reveals evolution must come from several organisms or tissues � hence working in the light of evolution requires comparative approaches � some organisms, incl. humans, are difficult to study in a lab setting � some experiments cannot be performed on some organisms, incl. humans, From the point of view of experimentalists and medical researchers � hence learning about these organisms is best done by studying others for practical or ethical reasons Nantes, 6/8/09 – p.

Characteristics of Comparative Approaches � rely on identification of conserved patterns � develop evolutionary models for observed changes Comparative approaches � use conserved patterns for datamining and evolutionary models for analysis of mined data Translated to comparative genomics: data: whole-genome sequences conserved patterns: subsequences, distribution statistics, or combinations models: duplication/loss/rearrangement at the genome level, mutation/indel for nucleotides datamining: important components, anchors, clusters, syntenic blocks Nantes, 6/8/09 – p.

Comparative Genomics Vocabulary syntenic block : conserved pattern (subject to microrearrangements) used to denote conserved block of genes (10Kbps to 1Mbps) (originally used in genetics to denote colocation on the same chromosome) genomic alignment : sequence-level alignment of complete genomes, with block-level rearrangements, duplications, and losses positive (Darwinian) selection : selects for favorable traits; accelerates observed change in affected regions negative (filtering) selection : selects against changes of most kinds; slows down observed change in affected regions ancestral reconstruction : inference of the putative contents and arrangement of the genome of a common ancestor genomic signature : originally, distribution of dinucleotide frequencies, now characterization of common patterns in a group of genomes Nantes, 6/8/09 – p.

� Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p.

Evolution of the Genome What evolutionary events affect the genome? nucleotide-level : “classical” sequence evolution (mutations and indels) genomic rearrangements : inversions, transpositions, translocations, and chromosomal fusion and fission duplication : gene retrotransposition, tandem duplication, segmental duplication, whole-genome duplication loss : point mutation, segmental deletion, neofunctionalization recombination : meiotic recombination, hybridization, lateral gene transfer Nantes, 6/8/09 – p.

Evolutionary Models And how well understood are they? nucleotide-level : well established models with good statistics genomic rearrangements : enormous work in the last 10 years, but still parameter-poor duplication/loss : established work in lineage sorting (divergent gene evolution due to paralogs), much attention to whole-genome duplication, just starting on segmental duplications recombination : established work in population genetics, much work on identifying lateral gene transfer, detailed work on recombination just starting Nantes, 6/8/09 – p.

� Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p. 1

High-throughput data Slowly pervading laboratory-based biology: sequencing : the original high-throughput data source, now easily the most economical high-tech lab instrument gene expression : microarrays and their ilk, now inexpensive and in very widespread use transcription profiling : ChIP-chip, ChIP-seq, and future products, soon to replace microarrays mass spectrometry : for protein analysis and sequencing; now also for mixed samples (metaproteomics) SNP assays : for precise genotyping of humans other domains : cell signalling, metabolomics (e.g., fluxes), 3D imaging, time series, etc. Nantes, 6/8/09 – p. 1

High-throughput sequencing developed for the human genome project, now also used for: de novo sequencing : still the most challenging resequencing : verify base calls, test assemblies, etc. deep sequencing : dense sampling or high coverage metagenomics : random sampling of microbial communities current technologies (454, Illumina) generate around 4Gbps per half-day run next-gen technologies may yield over 20Gbps per hour run, at better than 50x coverage, with very short (20bps) fragments Nantes, 6/8/09 – p. 1

Can computation keep up? Computer power still follows Moore’s law, doubling every 12–15 months. However: That power is getting harder to use (parallelism is hard to exploit). Data accumulates faster than Moore’s law (sequence data alone doubles every year). 3 – 10 4 speedup. This comparison presupposes a linear relationship, 10 but most genomic analysis algorithms are much slower. High-performance computing does not help much: the fastest machines can provide only Nantes, 6/8/09 – p. 1

Genome-scale computing Running time is only one facet of the problem. Comparing several genomes of a few Gbps each requires a lot of memory —at least 128GB per node. Available and not too expensive: a 16-core compute node with 128GB memory and 500GB disk can be had for $20K. Still rare: most compute clusters have “thin” nodes (2-8GB of memory), unsuited to whole-genome analysis. 2010 architectures will pack 64–128 cores per node, with 0.5–1TB memory—great for comparing a few genomes, but by then we will have 100s of vertebrate genomes. . . Nantes, 6/8/09 – p. 1

� Comparative approaches � The genome and its evolution � High-throughput data and computation � What do we want to know? � Comparing two genomes � Comparing multiple genomes � Ancestral reconstruction � Challenges Nantes, 6/8/09 – p. 1

Basic annotation per genome We want to identify: all coding genes (or exons); all noncoding genes; gene families; SINEs, LINEs, and other repeat elements; regions under positive, neutral, or negative selection. Beyond this basic level, we want to identify gene clusters, operons, alternative splicing scenarios, etc.; gene function. Nantes, 6/8/09 – p. 1

Comparative annotation Using pairwise comparative approaches, we can ask for all pairwise homologies; orthology and paralogy relationships within gene families (within limits due to lack of phylogenetic information); syntenic blocks; mapping of the syntenic blocks between the two genomes (simple translocations and transpositions, with inversions); translation of functional annotations from each genome into the other. Nantes, 6/8/09 – p. 1

Comparative Genomics: Computational Challenges Bernard M.E. Moret - PowerPoint PPT Presentation

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 p. Overview Comparative approaches The genome and its evolution High-throughput data

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Comparative Genomics of Environmental Stress Responses in North American Hardwoods The

Computational Challenges in Computational Challenges in Genomics and Molecular Biology Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Comparison of Microbial Comparative Genomics using Bacteriophages and Mycoplasma bacteria

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen

WP3 EX-POST Case studies Comparative Analysis Report Deliverable no.: 3.2 Comparative Analysis

Computational Genomics Francisco Garca Garca BIER fgardos@gmail.com Mster en

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics

ACL Learning Collaborative Dual Eligible Financial Alignment Demonstrations: Rate Setting for

Contributing Causes and Lessons Learned from NRELs Recent Laser Accident Deana Luke, National

Introduction to Variant Detection Johnson et al., Blood, 2013, 122(19) Evolutionary analysis

Parallel Corpora & Alignment Aaron Smith Machine Translation VT 2016 Uppsala, 20th April

CP Violation 2HDM from collider to EDM Hao-Lin Li Amherst Center for Fundamental Interaction

Provider, Payer and Patient Alignment in WNY Presented by Kate Ebersole KEE Concepts

1 1. Polymorphism and divergence are correlated Neutral theory is a bridge between microevolution

Performance-Aligned Learning Algorithms with Statistical Guarantees Rizal Zaini Ahmad Fathony

Sambuz

Useful Links

Newsletter

Mail Us

Comparative Genomics: Computational Challenges Bernard M.E. Moret - PowerPoint PPT Presentation

Comparative Genomics: Computational Challenges Bernard M.E. Moret Laboratory for Computational Biology and Bioinformatics EPFL Nantes, 6/8/09 p. Overview Comparative approaches The genome and its evolution High-throughput data

Comparative Genomics: Comparative Genomics: Sequence, Structure, Sequence, Structure, and

Genomics Genomics extravaganza extravaganza Genomics Genomics overview overview Genomics

Comparative Genomics Comparative Genomics Common Themes Gene and functional pathway

Comparative Genomics of Environmental Stress Responses in North American Hardwoods The

Computational Challenges in Computational Challenges in Genomics and Molecular Biology Genomics

Melbourne Genomics Establishing data governance in clinical genomics Ian Pham Data Governance

Genomics extravaganza Genomics overview Genomics analysis of the structure and function of very

Outline Part 1 Introduction to Genomics Part 2 Visual Design for Genomics Part 3 Hands-On

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &amp;

clinical genomics Melbourne Genomics Health Alliance Melbourne Genomics Health Alliance Medical

High throughput methods approches in genomics D. Puthier Genomics The science for the 21st

Comparison of Microbial Comparative Genomics using Bacteriophages and Mycoplasma bacteria

Challenges of ancient genomics and pan-genomics Kay Nieselt Center for Bioinformatics Tbingen

WP3 EX-POST Case studies Comparative Analysis Report Deliverable no.: 3.2 Comparative Analysis

Computational Genomics Francisco Garca Garca BIER fgardos@gmail.com Mster en

Computational and Mathematical Biology Computational and Mathematical Biology in the Genomics

ACL Learning Collaborative Dual Eligible Financial Alignment Demonstrations: Rate Setting for

Contributing Causes and Lessons Learned from NRELs Recent Laser Accident Deana Luke, National

Introduction to Variant Detection Johnson et al., Blood, 2013, 122(19) Evolutionary analysis

Parallel Corpora &amp; Alignment Aaron Smith Machine Translation VT 2016 Uppsala, 20th April

CP Violation 2HDM from collider to EDM Hao-Lin Li Amherst Center for Fundamental Interaction

Provider, Payer and Patient Alignment in WNY Presented by Kate Ebersole KEE Concepts

1 1. Polymorphism and divergence are correlated Neutral theory is a bridge between microevolution

Performance-Aligned Learning Algorithms with Statistical Guarantees Rizal Zaini Ahmad Fathony

Sambuz

Useful Links

Newsletter

Mail Us

Melbourne Genomics Data and technology to support and enable genomics Kate Birch Data &

Parallel Corpora & Alignment Aaron Smith Machine Translation VT 2016 Uppsala, 20th April