1
play

1 Traditional Genome Sequencing Based on the protocol used at JGI - PDF document

BITS 2009, Mar 20th 2009 Mar 20th 2009 Next-Generation Sequencing (NGS): Next-Generation Sequencing (NGS): An Overview An Overview Francesca D Ciccarelli NGS in the Literature Next-Next SOLiD Solexa 454 NIH Grant Keywords: Next


  1. BITS 2009, Mar 20th 2009 Mar 20th 2009 Next-Generation Sequencing (NGS): Next-Generation Sequencing (NGS): An Overview An Overview Francesca D Ciccarelli NGS in the Literature Next-Next SOLiD Solexa 454 NIH Grant Keywords: Next generation sequencing; Massive parallel sequencing; Ultra-deep sequencing; Pyro-sequencing Francesca D. Ciccarelli Francesca D. Ciccarelli The $1000 Genome Project NIH Genome Centers spend > $120 million/year on genome sequences • “$100.000 Genome ”: Raw Measure of human genetic variation • “$10.000 Genome”: Sequencing of tumor genome collections SNP and disease-associated mutations Signs of natural selection within a population • “$1.000 Genome”: Personal genome Feb 2004: NHGRI launched a grant application to develop next generation sequencing technologies Nat Rev Genet 5 (2004), pp. 335–344. Curr Opin Genet Dev. 2006 16(6):545-52. Francesca D. Ciccarelli Francesca D. Ciccarelli 1

  2. Traditional Genome Sequencing Based on the protocol used at JGI (http://www.jgi.doe.gov/) I. Library Preparation 1. Shearing of DNA 2. Insertion of Fragments into a Plasmid 3. Transformation 4. Subcloning of Sheared Fragment 5. Colony Picking II. Sequencing 6. Cell Lysing 7. Rolling-Circle Amplification 8. Capillary Sequencing III. Assembly and QA 9. Assembly 10. Quality Assessment Francesca D. Ciccarelli Francesca D. Ciccarelli Limitations of Traditional Sequencing PROBLEM #1: in vivo cloning Clonal Bias and Unclonable DNA • Hard Stops (hairpins, triple helices, stem loops, high GC content) • Polymerase (Long streches of PolyA) PROBLEM #2: timing and workload • 10.000 instrument day/human genome • Only affordable by genome centers PROBLEM #3: costs • Human Genome Reference Sequence (2001; 2003) $1 billion (99.995% accurate; 99% complete) • Estimated Cost for Individual Genome: $10 million (1 year using >30 instruments) Francesca D. Francesca D. Ciccarelli Ciccarelli Sequencing Approaches Journal of Experimental Biology 210, 1518-1525 (2007) Francesca D. Ciccarelli Francesca D. Ciccarelli 2

  3. NGS on the Market Resolve inherent biases of in vivo cloning 454 (Roche) SOLiD (ABI) - emulsion PCR - emulsion PCR - pyrosequencing - sequencing by ligation - read lengths ca. 400 bp - read lengths ca. 35 bp first instrument in Oct 2005 Solexa (Illumina) - PCR on solid support - reversible terminator sequencing - read lengths ca. 35 bp first instrument in June 2007 first instrument in July 2006 Francesca D. Francesca D. Ciccarelli Ciccarelli Library Preparation 454 Solexa Margulies et al. (2005) Nature 437 (376-380) Shendure et al. (2005) Science 309 (1728-1732) SOLiD http://solid.appliedbiosystems.com Francesca D. Ciccarelli Francesca D. Ciccarelli Clonal Amplification 454 Solexa DNA attached Anneal sstDNA to the surface Emulsion in water- Bridge in-oil microreactors Amplification Double Strand Clonal Amplification Denaturation Repeat Enrichment for DNA- Cycles positive beads Francesca D. Ciccarelli Francesca D. Ciccarelli 3

  4. Clonal Amplification SOLiD Emulsion PCR Clonal Amplification Enrichment for DNA- Transfer on solid array positive beads Francesca D. Francesca D. Ciccarelli Ciccarelli Sequencing by Synthesis 454 (Pyrosequencing) Solexa (Reversible Terminators) Reaction with: Reaction with: • DNA polymerase, • DNA polymerase, • primers • ATP sulfurylase, • 4 labelled reversible terminators • luciferase • apyrase, • APS • luciferin Determine first base using laser light Addition of dNTPs one at the time Wash off and Repeat for all sequence Sequence Read Francesca D. Francesca D. Ciccarelli Ciccarelli Sequencing by Ligation SOLiD Reaction with: • Universal Primers • Ligase • Probes 1st Cycle Sequencing Probe Annealing Ligation Visualization Washing off [x 5 times] Cleavage 5 base read Francesca D. Ciccarelli Francesca D. Ciccarelli 4

  5. Sequencing by Ligation SOLiD Following Cycles Reset annealing; ligation; washing; visualization; cleavage [x 5 times] [X 5 times] Francesca D. Ciccarelli Francesca D. Ciccarelli Two Bases Encoding Capillary Electrophoresis: 1be Sequencing by Ligation: 2be • A single color does not indicate one single base • Each read contains information for 2 bases • To decode the bases you have to know one of them Deconvolution matrix SNP Detection Real SNP Miscall Francesca D. Francesca D. Ciccarelli Ciccarelli Massive Parallelization 454 Solexa Sequencing Reaction within Sequencing Reaction on planar, the PicoTiterPlate Device optically transparent surface • 1.6 million wells/plate • > 10 million clusters • ~420 kread/run (1.2 Mread/run) • ~50 Mread/run (220 Mread/run) • ~ 20 million beads (1 µ m diameter) SOLiD • ~95 Mread/run (220 Mread/run) present (6months/1year) Francesca D. Francesca D. Ciccarelli Ciccarelli 5

  6. Limitations of Traditional Sequencing PROBLEM #1: in vivo cloning Clonal Bias and Unclonable DNA • Hard Stops (hairpins, triple helices, stem loops, high GC content) • Polymerase (Long streches of PolyA) PROBLEM #2: timing and workload • 10.000 instrument day/human genome • Only affordable by genome centers PROBLEM #3: costs • Human Genome Reference Sequence (2001; 2003) $1 billion (99.995% accurate; 99% complete) • Estimated Cost for Individual Genome: $10 million (1 year using >30 instruments) Francesca D. Ciccarelli Francesca D. Ciccarelli Timing and Throughput THROUGHPUT TIMING - 96cap: 76.8kbp/run Sanger 10 runs/day - 384cap 0.3Mbp/run 454 1 - 100Mbp/run (450Mbp-1Gbp) 21h - 1.5Gbp/run (5-10Gbp) - 4 days (single fragment) Solexa 2 - 3Gbp/run (10-20Gbp) - 4.5 days (paired-end) - 6Gbp/run (17Gbp) - 3 days (single fragment) SOLiD 3 - 6 days (paired-end) - 10Gbp/run (26Gbp) 1 Roche-Italy, pers.comm. Today (6Months-1Year) 2 Illumina, pers.comm. 3 AppliedBS, pers.comm. Francesca D. Ciccarelli Francesca D. Ciccarelli Limitations of Traditional Sequencing PROBLEM #1: in vivo cloning Clonal Bias and Unclonable DNA • Hard Stops (hairpins, triple helices, stem loops, high GC content) • Polymerase (Long streches of PolyA) PROBLEM #2: timing and workload • 10.000 instrument day/human genome • Only affordable by genome centers PROBLEM #3: costs • Human Genome Reference Sequence (2001; 2003) $1 billion (99.995% accurate; 99% complete) • Estimated Cost for Individual Genome: $10 million (1 year using >30 instruments) Francesca D. Ciccarelli Francesca D. Ciccarelli 6

  7. Sequencing Costs COSTS/RUN COSTS/kBP X96 - $1 (raw kbp) Sanger x384 - $7 (consensus kb, Error=4x10 -6 ) 2 - ~ € 0.07/ kbp 454 1 € 7.195 (100Mb) - ~ € $9 (consensus kbp, Error=4x10 -5 ) 2 Solexa 2 € 3.000 (1.5Gbp) ~ € 0.002/ kbp SOLiD 3 € 2.478 (6Gbp) ~ € 0.0004/ kbp 1 Roche-Italy, pers.comm. 2 G.Church Nat Biotec 24, 139 (2006) 3 Illumina, pers.comm. 4 AppliedBS, pers.comm. Francesca D. Ciccarelli Francesca D. Ciccarelli Limitations of NGS PROBLEM #1: length of sequencing reads • Much shorter than Sanger PROBLEM #2: (huge) amount of data production • Difficult data handling and analysis PROBLEM #3: sequencing accuracy • Not a reliable standard available yet • Difficult to compare different methods Francesca D. Francesca D. Ciccarelli Ciccarelli Length of Sequencing Reads Read Length (bp) Sanger 450-850 454 250 (450) Solexa 35 (75) SOLiD 25-35 (50) - De novo Sequencing: difficult assembly - Resequencing (454) : overlapping amplicons needed - Metagenomics: difficult assignment Francesca D. Ciccarelli Francesca D. Ciccarelli 7

  8. Paired-End Sequencing tag1 insert tag2 INSERT LENGTH 454 1,5-3kbp (16kbp) Solexa 200-300bp (2kbp) SOLiD 600-10.000bp - Increase the Read Length - Help in Assembly Reconstruction - Find Structural Variation (CNV, Rearrangements, Etc) Francesca D. Ciccarelli Francesca D. Ciccarelli (Huge) Amount of Data Production 454 1Gb (10-15Gb after sequencing) Solexa 10-20Gb (0.5-2Tb after sequencing) SOLiD 40Gb (7-8 Tb after sequencing) Need ad hoc tool development for data analysis Francesca D. Ciccarelli Francesca D. Ciccarelli Sequencing Accuracy Difficult to compare because based on different technologies Raw Consensus Sanger 99.5% 99.995% (10x) 99.99% 99.5% (no homopolymers) 454 1 99.96% 97.0% (with homopolymers) (n>7; 0.7% human genome) 2 Solexa 3 98.5% 99.8% - 99.999% (15x) $ SOLiD 4 99.94% (in principle, outstanding accuracy for SNP detection) 1 Roche-Italy, pers.comm. 3 Illumina, pers.comm. 2 Nat Med 12, 852-855 (2006) 4 AppliedBS, pers.comm. Francesca D. Ciccarelli Francesca D. Ciccarelli 8

Recommend


More recommend