Sequencing and decoding genomes C. Victor Jongeneel, PhD Ludwig Institute for Cancer Research Swiss Institute of Bioinformatics Center for Integrative Genomics, U. of Lausanne Victor.Jongeneel@licr.org
The Computational Biology Challenge "In principle, the string of genetic bits holds long-sought secrets of human development, physiology and medicine. In practice, our ability to transform such information into understanding remains woefully inadequate ". The Genome International Sequencing Consortium, ”Initial sequencing and analysis of the human genome,” Nature 409 : 860-921 (2001) [Emphasis added]
Outline of the talk • All about genomes • Sequencing technologies, old and new • From a test tube full of DNA to a genome sequence • And now, where are the genes? • Presenting the results: genome browsers
What is a genome? • Set of genes transmitted between generations • Made of deoxyribonucleic acid (DNA) • A long string in a four-letter alphabet (A,C,G,T) • Present in every cell of the body • Contains the master plan for building an organism • Contains all of the instructions to keep a cell alive and to allow it to divide • Copies itself at every cell division
Genome complexity • Sizes: • viruses: 10 3 to 10 5 nt • bacteria: 10 5 to 10 7 nt • Baker’s yeast: 1.35 x 10 7 nt • mammals: 3-4 x 10 9 nt • plants: 10 8 to 10 11 nt • Numbers of genes: • virus: 3 to 100 • bacteria: 1000 to 5000 • Baker’s yeast: ~6’000 • mammals: 20’000 - 30’000
Information carried by DNA centromere exons of genes locus control region telomere gene régulatory elements repetitive sequences (NB: this drawing is not to scale)
Structure of a typical vertebrate gene (CpG islands) Stop 5’- UTR 3’- UTR
The human genome • Size: 3 x 10 9 nt for a single copy (haploid) • Highly repetitive sequences (>1000 copies) 25% • Middle repetitive sequences 25-30% ? >50% total • Sizes of genes: from 900 to >2’000’000 nt (including introns) • Proportion encoding proteins: 5-7% • Number of chromosomes: 22 autosomes, 2 sex-linked chromosomes (X and Y) • Sizes of chromosomes: 5 x 10 7 to 5 x 10 8 nt
Outline of the talk • All about genomes • Sequencing technologies, old and new • From a test tube full of DNA to a genome sequence • And now, where are the genes? • Presenting the results: genome browsers
Current sequencing technology • Sequencing method developed by Fred Sanger in the 1970’s • Principle: randomly terminate synthesis of a DNA strand using a nucleotide analogue, separate products by electrophoresis in a polyacrylamide gel • Many technological improvements: • Use of fluorescently labeled nucleotides • Use of pre-cast gels in capillaries • Automation of sample handling • Use of microfluidics technologies
Sequencing machines
Sequencing machines – a current model
Sequencing centre
Raw data
Typical output of current sequencing technology • One machine handles 500 samples per day • Each sample produces 700 nt of raw sequence • A typical centre will host a hundred sequencing machines � A typical centre can produce 30 million nucleotides worth of sequence every working day • BUT this requires a very large investment and is labor intensive • There are less than 100 such centres worldwide
A real sequencing centre: DOE JGI Date(s) Total Q20* Bases Total Lanes** % Passed † Ave. Read Length ‡ 2/26/06: ABI3730 42.672 Million 65,952 92.77% 696 12/16/04: MegaBACE4000 3.512 Million 6,528 88.31% 606 2/26/06: MegaBACE4500 15.386 Million 23,424 83.27% 787 Current month (2/06) 2.147 Billion 2,047,968 94.31% 717 Last month (1/06) 2.807 Billion 4,187,132 93.69% 714 FY to Date (10/05-2/26/06) 13.056 Billion 20,517,314 95% 625 Total (3/99-2/26/06) 109.947 Billion 185,133,995 92% 653
New sequencing technologies Current leaders: 454 Life Sciences, Solexa • Main technological advances: • • Single molecule amplification on solid supports: microbeads in emulsion (454), surfaces with DNA primers (Solexa) • Bulk sequencing technologies: pyrosequencing (454), sequencing by synthesis (Solexa) • Sophisticated signal acquisition and processing Throughput from a single machine: • • 200,000 sequences of 100 nt each in 4 hours (454) • 2 Mio sequences of 25 nt each in 8 hours (Solexa) • Both technologies provide single machine throughput similar to an entire genome sequencing centre!
454 sequencing
A 454 sequencing machine (Roche Diagnostics)
Some applications of “ultra-low cost” sequencing Rapid sequencing of bacterial genomes • Sampling of “environmental” genomes • Sequencing of individual human genomes as a component of • preventative medicine. Rapid hypothesis testing for genotype–phenotype associations • In vitro and in situ gene-expression profiling at all stages in the • development of a multicellular organism Cancer research: for example, determining comprehensive mutation • sets for individual clones carrying out loss-of-heterozygosity analysis and profiling tumour sub-types for diagnosis and prognosis
Making sequence information public • Sequence databases: EMBL (Europe), GenBank (US), DDBJ (Japan) • Repositories of genome and transcriptome data • “The EMBL Nucleotide Sequence Database was frozen to make Release 85 on 30-NOV-2005. The release contains 64,739,883 sequence entries comprising 116,106,677,726 nucleotides, of which 12,088,383 entries (59,629,958,692 nucleotides) are WGS (whole genome shotgun) data.” • Trace file repositories (raw data from sequencers) at all major sequencing centres
Growth of the public sequence repository Growth of the EMBL sequence database 1.00E+12 1.00E+11 1.00E+10 Human genome Nucleotides 1.00E+09 1.00E+08 1.00E+07 Regression to an 1.00E+06 exponential: R=0.995 1.00E+05 Feb-82 Nov-84 Aug-87 May-90 Jan-93 Oct-95 Jul-98 Apr-01 Jan-04 Oct-06 Date Doubling time: ~5 months
Exponential growth in computing and sequencing From Shendure et al, Nature Reviews Genetics 5:335 (2004)
Outline of the talk • All about genomes • Sequencing technologies, old and new • From a test tube full of DNA to a genome sequence • And now, where are the genes? • Presenting the results: genome browsers
Challenges in genome sequencing Using current technology, pieces of the genome have to be • individually cloned and amplified (in bacterial vectors) before sequencing Genome sizes in hundreds of millions of nucleotides, sequence • reads in hundreds • Millions of reads will be required to obtain several-fold coverage • These reads will have to be assembled based on overlaps A sizable proportion of many genomes consists of repeated • sequences • A measured scaffold will need to be built to guide the assembly process • This scaffold will be based on a physical map of the genome
Shotgun sequencing and assembly Figures from the U. of Maryland Center for Bioinformatics and Computational Biology
Mis-assembly caused by a repeated sequence Figures from the U. of Maryland Center for Bioinformatics and Computational Biology
Scaffolds based on paired reads or BAC maps Paired reads: the distance separating ends of clones is known (within limits) BAC map: the genome is divided into pieces of 100-200 kb, which are mapped relative to each other. A minimal set is then sequenced Figures from the U. of Maryland Center for Bioinformatics and Computational Biology
Genome assembly software • The assembly of larger and larger sequence contigs is a difficult problem • Some of the most sophisticated software used in the life sciences addresses this issue • Graph theory is used extensively (Hamiltonian or Eulerian paths) in contig assembly • Examples: • Phrap (Phil Green, U. of Washington) • Celera assembler (Gene Myers, UC Berkeley) • Arachne (David Jaffe and Eric Lander, MIT)
Anatomy of a genome sequencing project (ca 2005) Genomic DNA BAC library Plasmid library New sequencing technologies Mapping End sequencing Shotgun sequencing Sequence assembly Shotgun sequencing software Contigs Super-contigs Finishing, gap closure Finished sequence
Outline of the talk • All about genomes • Sequencing technologies, old and new • From a test tube full of DNA to a genome sequence • And now, where are the genes? • Presenting the results: genome browsers
From sequence to code • A finished genome sequence is nothing but a set of long and uninformative strings (chromosomes) of ACGT • A major task in any genome sequencing project is the annotation • The annotation process tries to assign functions to sub- strings of the chromosome • Annotation is just a representation of current knowledge about how a genome works
Gene prediction methods • Two different approaches: • Extrinsic methods: • Based on similarity to known nucleotide or protein sequences • Based on similarity between genomes (comparative genomics) • Intrinsic or ab initio methods: • Based on the properties of the sequence (statistical regularities) • Based on the presence of known signals
Recommend
More recommend