DNA Assembly and Finishing DNA Assembly and Finishing Latin American Course on Bioinformatics Bioinformatics for for Latin American Course on Tropical Disease Research Tropical Disease Research th to March 2 nd 2002 São Paulo Paulo – – February 17 February 17 th to March 2 nd 2002 São Arthur Gruber Arthur Gruber Faculty of Veterinary Medicine and Zootechny Zootechny Faculty of Veterinary Medicine and University of São São Paulo Paulo University of BRAZIL BRAZIL AG AG- -FMVZ FMVZ- -USP USP
Whole genome Why to assemble? BAC/cosmid clone DNA fragmentation sonic disruption nebulization Current DNA sequencing methods
• Small fragments generate reads of 500-700 bp – resolution 1.0 - 2.0 kb limit of electrophoresis
Clone library pUC18 Whole genomes or large clones need to
• DNA sequencing be fragmented - clone library
random clones Short fragments are randomly sequenced
• Partial Assembly
• contigs (shotgun approach) – reads are assembled to form final consensus
Finishing
quality both stands coverage
sequence gap filling
sequence Whole genome BAC/cosmid clone final consensus sequence
Shotgun Sequencing I – random phase
Sheared DNA: BAC clone:
1.0-2.0 kb 100-200 kb
Random Sequencing
Reads Templates
Shotgun Sequencing II - assembly
Single Low Base Consensus Mis-Assembly
Stranded Quality Sequence Inverted ( )
Region Gap
Shotgun Sequencing III - finishing
Single Low Base Consensus Mis-Assembly
Stranded Quality Sequence Inverted ( )
Region Gap
Shotgun Sequencing III - finishing
Single Consensus Mis-Assembly
Stranded Sequence Inverted ( )
Region Gap
Shotgun Sequencing III - finishing
Consensus Mis-Assembly
Sequence Inverted ( )
Gap
Shotgun Sequencing III - finishing
Consensus Sequence
Gap
Shotgun Sequencing III - finishing
Consensus High Accuracy Sequence:
< 1 error/ 10,000 bases
How to deal with the enormous amount of reads generated by the high throughput DNA sequencers?
Sanger Centre
Phred/Phrap/Consed Package
Phred/Phrap/Consed is a worldwide distributed package for:
a. Trace file (chromatograms) reading;
b. Quality (confidence) assignment to each individual base;
c. Vector and repeat sequences identification and masking;
d. Sequence assembly and error probability assignment to the consensus sequence;
e. Assembly viewing and editing;
f. Automatic finishing.
Phred/Phrap/Consed Pipeline
Input chromatogram files Quality (confidence) values assignment
Phred phd files - *.phd Conversion - phd to fasta phd2fasta.pl nucleotide sequences - seq.fasta quality values - seq.fasta.qual Vector screening and masking Cross_Match (local alignment program) x vector.seq screened/masked file - seq.fasta.screen
Directories: quality values - seq.fasta.screen.qual
Assembly Chromat_dir
Phd_dir
Assembly viewing/editing Edit_dir
Phrap assembled contigs - seq.fasta.screen.contigs assembly file - seq.fasta.screen.ace#
Consed Finishing Autofinish + manual finishing
Phred
Genome Research 8 : 175-185, 1998
Phred
Genome Research 8 : 186-194, 1998
Phred
Phred is a program that performs several tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.
b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs.
Phred
c. Assigns quality values to the bases – a "Phred value" based on an error rate estimation calculated for each individual base.
d. Creates output files – base calls and quality values are written to output files.
Trace File
High quality read:
- no ambiguities (Ns)
- no noise
- peaks very well spaced
Trace File
Good quality read:
- no ambiguities (Ns)
- some noise (notice baseline)
- peaks very well spaced
Trace File
Poor quality read:
- some ambiguities (Ns)
- bad noise (notice baseline)
- overlapping peaks
- can be caused by bad quality template, bad matrix, low signal to noise rate
Trace File
Poor quality read:
- many ambiguities (Ns)
- noise
- caused by homopolymeric region/ polymerase slippage
Trace File
Sudden drop artifact:
- good quality region is followed by a sudden drop of signal
- caused by secondary structure
Trace File
High quality region:
- no ambiguities (Ns)
- no noise
- peaks very well spaced
Trace File
Medium quality region:
- some ambiguities (Ns)
- no noise
- peaks very well spaced
- some homopolymeric strectches are not well resolved
Trace File
Poor quality region - diffusion effects and decrease in the relative mass difference between the sequence products:
- overlapping peaks, peaks not evenly spaced
- low resolution
- low confidence to base assignment
