short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1
Genomathica Assembler • Mathematica notebook for genome assembly simulation • Assembler can be found at: http://cs.brown.edu/courses/csci1820/software/minimal_asse mbler.nb • Sample FASTA genome phix174.fasta can be found in HW5 Biology: http://cs.brown.edu/courses/csci1820/software/phix174.fasta • Remember to – Change the input genome to your FASTA file ’ s location – Evaluate each cell initially, then you only need to evaluate the last two cells to re-run the assembly, and display the results respectively – Mathematica can be downloaded here: http://www.brown.edu/information-technology/software/ 2
coverage = 1 • Sequence reads are in black • Contiguous strings of assembled DNA ( contigs ) are in red
coverage = 2 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red
coverage = 3 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red
coverage = 4 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red
coverage = 5 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red
coverage = 2, paired ends
Sample prep Sequence data Raw Sequence Reads • wet-lab experimental methods to isolate, prepare, and sequence the DNA • results in a number of large FASTQ files • FASTQC can be used to check basic statistics of the files – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • many tools available for QC – e.g. http://hannonlab.cshl.edu/fastx_toolkit/
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.
http://www.ncbi.nlm.nih.gov/Traces/sra/
Genome Assembly Software • Overlap-layout-consensus • Celera: http://wgs-assembler.sourceforge.net/ • K-mer based • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html • ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/
Two graph models • A first graph model – Nodes (vertices) are contiguous sequences of k characters (k-mer) – Directed edge from v i to v j if v i [2..k]=v j [1..k-1] A C G T T C ACG CGT GTT TTC
Two graph models • De-bruijn graph – Nodes (vertices) are contiguous sequences of k-1 characters (k-1-mer) – Directed edge from v i to v j if v i [1..k-1]+v j [k-1] are a valid k-mer A C G T T C ACG CGT GTT TTC AC CG GT TT TC
Note edges that are not reflected in the input! Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Genome Assembly • Building the k-mer graph – nodes as k-mers, edges (k-1) overlap 17
Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 ACGT CGTA GACG k=3 1 1 1 ACG CGT GTA GAC
Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 1 1 TACG GTAC ACGT CGTA GACG k=3 1 1 2 1 ACG CGT GTA TAC GAC 1
Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT CGTT k=4 1 1 1 1 1 TACG GTAC ACGT CGTA GACG GTT k=3 1 1 2 2 1 ACG CGT GTA TAC GAC 2
Genome Assembly • Building the k-mer graph – nodes as k-mers, edges (k-1) overlap – nodes as (k-1)-mers, edges form k-mers 21
Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 1 ACG CGT GTA GAC k=3 1 1 1 1 AC CG GT TA GA
Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 2 1 ACG CGT GTA TAC GAC 1 k=3 1 3 2 2 AC CG GT TA GA 1
Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT GTT k=4 1 1 2 2 1 ACG CGT GTA TAC GAC 2 1 TT GT k=3 1 1 4 2 2 AC CG GT TA GA 2
Genome Assembly • Building the k-mer graph – G(k): nodes as k-mers, edges (k-1) overlap – H(k): nodes as (k-1)-mers, edges form k-mers • H(k)=G(k-1) – So it really does not matter which you choose to implement • Where does the complexity come from? – Sequencing errors, repeats, uneven coverage, contamination from other organisms, ploidy, unsequenced regions 25
Popping bubbles Error occurs in the middle of a read and is propagated to many k-mers.
Trimming tips Error creates an erroneous ending k-mer
Chimeric extensions Errors connect two nodes in the graph which do not correspond to a valid extension in the genome sequence Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Repetitive regions • Satellites, SINEs, LINEs • Homologous Genes – Ortholog: descended from the same ancestral sequence and separated by speciation – Paralog: genes created by a duplication event 29
30 Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly
Velvet assembler • Four stages – Hashing reads into k-mers – Constructing the de Bruijn graph (not all 4^k k- mers, only those that exist in input) – Correct errors – Resolve repeats • But what after? – Paper gives very little information on this... 32
The Chinese postman problem (CPP) • Compute a closed tour of minimum length that visits each edge at least once – Similar to what we want except we may want to visit edges more than once due to repeats • How do we deal with repeats? – Also, the starting and ending vertices are distinct in genome assembly • How can we convert the closed tour to an open one? 33
Your homework • You are not required to implement section 4 of http://web.eecs.umich.edu/~pettie/matching /Edmonds-Johnson-chinese-postman.pdf • You are not even required to model genome assembly as CPP • But you do have to build the k-mer graph, correct errors, resolve repeats, and compute a CPP or Eulerian-like tour. 34
Evaluating assembly • The Assemblathon2 study lists 102 measures for evaluating assembly quality. • Bradnam et al. ( 2013) Assemblathon 2: evaluating de novo methods fo genome assembly in three vertebrate species 1. NG50 scaffold length: a length x where all scaffolds of length x or longer consists of at least 50% of the genome size 2. NG50 contig length: a length x where all contigs of length x or longer consists of at least 50% of the genome size 3. Amount of gene-sized scaffolds (>25 kbp). Useful for gene finding. 4. CEGMA: Number of 458 core genes mapped
Evaluating assembly 5. Fosmid coverage: How many validated fosmid regions were captured in assembly 6. Fosmid validity: Percentage of assembly validated by validated fosmid regions 7. Validated fosmid region tag scaffold summary score: number of validated fosmid region tag pairs that match the same scaffold multiplied by the percentage of uniquely mapping tag pairs that map with correct distance. Rewards short-range accuracy. 8. and 9. Using local and global alignments of optimal map data, how well the assembly is ordered. 10. REAPR summary score: a tool that evalutes accuracy of assembly using paired reads
Recommend
More recommend