DNA Assembly and Finishing DNA Assembly and Finishing Latin American Course on Bioinformatics Bioinformatics for for Latin American Course on Tropical Disease Research Tropical Disease Research th to March 2 nd 2002 São Paulo Paulo – – February 17 February 17 th to March 2 nd 2002 São Arthur Gruber Arthur Gruber Faculty of Veterinary Medicine and Zootechny Zootechny Faculty of Veterinary Medicine and University of São São Paulo Paulo University of BRAZIL BRAZIL AG AG- -FMVZ FMVZ- -USP USP
Whole genome Why to assemble? Why to assemble? BAC/cosmid clone DNA fragmentation sonic disruption nebulization Current DNA sequencing methods Current DNA sequencing methods • • Small fragments generate reads of 500 of 500- -700 700 bp bp – – resolution resolution generate reads 1.0 - 2.0 kb limit of of electrophoresis electrophoresis limit Clone library pUC18 Whole genomes or large clones clones need need to to Whole genomes or large • • DNA sequencing be fragmented fragmented - - clone clone library library be random clones Short fragments fragments are are randomly randomly sequenced sequenced Short • Partial Assembly • contigs (shotgun shotgun approach) – reads are ( approach) – reads are Finishing assembled to form final consensus assembled to form final consensus quality both stands coverage sequence gap filling sequence Whole genome BAC/cosmid clone final consensus sequence AG AG- -FMVZ FMVZ- -USP USP
Shotgun Sequencing I – – random phase random phase Shotgun Sequencing I Sheared DNA: Sheared DNA: BAC clone: BAC clone: 1.0- -2.0 kb 2.0 kb 1.0 100- -200 kb 200 kb 100 Random Random Sequencing Sequencing Reads Reads Templates Templates Modified from BCM- Modified from BCM -HGSC HGSC AG AG- -FMVZ FMVZ- -USP USP
Shotgun Sequencing II - - assembly assembly Shotgun Sequencing II Single Single Low Base Low Base Consensus Consensus Mis- -Assembly Assembly Mis Stranded Stranded Quality Quality Sequence Sequence Inverted ) ( Inverted ( ) Region Region Gap Gap Modified from BCM- Modified from BCM -HGSC HGSC AG AG- -FMVZ FMVZ- -USP USP
Shotgun Sequencing III - - finishing finishing Shotgun Sequencing III Single Single Low Base Low Base Consensus Consensus Mis- -Assembly Assembly Mis Stranded Stranded Quality Quality Sequence Sequence Inverted ) ( Inverted ( ) Region Region Gap Gap Modified from BCM- Modified from BCM -HGSC HGSC AG AG- -FMVZ FMVZ- -USP USP
Shotgun Sequencing III - - finishing finishing Shotgun Sequencing III Single Single Consensus Consensus Mis- -Assembly Assembly Mis Stranded Stranded Sequence Sequence Inverted ) ( Inverted ( ) Region Region Gap Gap Modified from BCM- Modified from BCM -HGSC HGSC AG AG- -FMVZ FMVZ- -USP USP
Shotgun Sequencing III - - finishing finishing Shotgun Sequencing III Consensus Consensus Mis- -Assembly Assembly Mis Sequence Sequence Inverted ) ( Inverted ( ) Gap Gap Modified from BCM- Modified from BCM -HGSC HGSC AG AG- -FMVZ FMVZ- -USP USP
Shotgun Sequencing III - - finishing finishing Shotgun Sequencing III Consensus Consensus Sequence Sequence Gap Gap Modified from BCM- Modified from BCM -HGSC HGSC AG- AG -FMVZ FMVZ- -USP USP
Shotgun Sequencing III - - finishing finishing Shotgun Sequencing III Consensus Consensus High Accuracy Sequence: High Accuracy Sequence: < 1 error/ 10,000 bases < 1 error/ 10,000 bases Modified from BCM- Modified from BCM -HGSC HGSC AG- AG -FMVZ FMVZ- -USP USP
How to deal with the enormous amount How to deal with the enormous amount of reads generated by the high of reads generated by the high throughput DNA sequencers? throughput DNA sequencers? Sanger Centre AG AG- -FMVZ FMVZ- -USP USP
Phred/ / Phrap Phrap/ / Consed Consed Package Package Phred Phred/ / Phrap Phrap/ / Consed Consed is a is a worldwide worldwide distributed package distributed package for: for: Phred a. Trace file (chromatograms chromatograms) ) reading reading; ; a. Trace file ( b. Quality Quality ( (confidence confidence) ) assignment assignment to to each each individual base; individual base; b. c. Vector and repeat sequences identification Vector and repeat sequences identification and and masking masking; ; c. d. Sequence assembly and error probability Sequence assembly and error probability assignment assignment to to d. the consensus sequence; ; the consensus sequence e. Assembly viewing and editing Assembly viewing and editing; ; e. f. Automatic finishing Automatic finishing. . f. AG AG- -FMVZ FMVZ- -USP USP
Phred/ / Phrap Phrap/ / Consed Consed Pipeline Pipeline Phred Input chromatogram files Quality (confidence) values assignment Phred phd files - *.phd Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen Directories: : quality values - seq.fasta.screen.qual Directories Assembly Chromat_dir Chromat _dir Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace# Phd_dir Phd _dir Assembly viewing/editing Edit_dir Edit _dir Consed Finishing Autofinish + manual finishing AG AG- -FMVZ FMVZ- -USP USP
Phred Phred Genome Research 8 8 : 175 : 175- -185, 1998 185, 1998 Genome Research AG AG- -FMVZ FMVZ- -USP USP
Phred Phred Genome Research 8 8 : 186 : 186- -194, 1998 194, 1998 Genome Research AG AG- -FMVZ FMVZ- -USP USP
Phred Phred Phred is a is a program program that that performs performs several several tasks tasks: : Phred a. Reads Reads trace files trace files – – compatible with most compatible with most file file a. formats: SCF (standard : SCF (standard chromatogram chromatogram format), ), formats format ABI (373/377/3700), ESD (MegaBACE MegaBACE) ) and and LI LI - - ABI (373/377/3700), ESD ( COR. COR. b. Calls Calls bases bases – attributes a base for a base for each each b. – attributes identified peak with a a lower error lower error rate rate than the than the identified peak with standard base calling programs calling programs. . standard base AG AG- -FMVZ FMVZ- -USP USP
Phred Phred c. Assigns quality values Assigns quality values to to the the bases bases – – a “ a “Phred Phred c. value” ” based on an error rate estimation value based on an error rate estimation calculated for for each each individual base. individual base. calculated d. Creates Creates output files output files – – base base calls and quality calls and quality d. values are are written written to output files. to output files. values AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File High quality read: High quality read: - no ambiguities (Ns) no ambiguities (Ns) - - no noise - no noise - peaks very well spaced - peaks very well spaced AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File Good quality read: Good quality read: - no ambiguities (Ns) no ambiguities (Ns) - - some noise (notice baseline) - some noise (notice baseline) - peaks very well spaced - peaks very well spaced AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File Poor quality read: Poor quality read: - some ambiguities (Ns) some ambiguities (Ns) - - bad noise (notice baseline) - bad noise (notice baseline) - overlapping peaks - overlapping peaks - can be caused by bad quality template, bad matrix, low signal t - can be caused by bad quality template, bad matrix, low signal to noise rate o noise rate AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File Poor quality read: Poor quality read: - many ambiguities (Ns) many ambiguities (Ns) - - noise - noise - caused by - caused by homopolymeric homopolymeric region/ region/ polymerase polymerase slippage slippage AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File Sudden drop artifact artifact: : Sudden drop - good quality region is followed by a sudden drop of signal good quality region is followed by a sudden drop of signal - - caused by secondary structure - caused by secondary structure AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File High quality region: High quality region: - no ambiguities (Ns) no ambiguities (Ns) - - no noise - no noise - peaks very well spaced - peaks very well spaced AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File Medium quality region: Medium quality region: - some ambiguities (Ns) some ambiguities (Ns) - - no noise - no noise - peaks very well spaced - peaks very well spaced - some - some homopolymeric strectches homopolymeric strectches are not well resolved are not well resolved AG AG- -FMVZ FMVZ- -USP USP
Trace File Trace File Poor quality region - - diffusion effects and decrease in the relative mass Poor quality region diffusion effects and decrease in the relative mass difference between the sequence products: difference between the sequence products: - overlapping peaks, peaks not evenly spaced overlapping peaks, peaks not evenly spaced - - low resolution - low resolution - low confidence to base assignment - low confidence to base assignment AG AG- -FMVZ FMVZ- -USP USP
Recommend
More recommend