short read genome assembly
play

short read genome assembly Sorin Istrail CSCI1820 Short-read genome - PowerPoint PPT Presentation

short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1 Genomathica Assembler Mathematica notebook for genome assembly simulation Assembler can be found at:


  1. short read genome assembly Sorin Istrail CSCI1820 Short-read genome assembly algorithms 3/6/2014 1

  2. Genomathica Assembler • Mathematica notebook for genome assembly simulation • Assembler can be found at: http://cs.brown.edu/courses/csci1820/software/minimal_asse mbler.nb • Sample FASTA genome phix174.fasta can be found in HW5 Biology: http://cs.brown.edu/courses/csci1820/software/phix174.fasta • Remember to – Change the input genome to your FASTA file ’ s location – Evaluate each cell initially, then you only need to evaluate the last two cells to re-run the assembly, and display the results respectively – Mathematica can be downloaded here: http://www.brown.edu/information-technology/software/ 2

  3. coverage = 1 • Sequence reads are in black • Contiguous strings of assembled DNA ( contigs ) are in red

  4. coverage = 2 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red

  5. coverage = 3 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red

  6. coverage = 4 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red

  7. coverage = 5 • Sequence reads are in black • Contiguous strings of assembled DNA (contigs) are in red

  8. coverage = 2, paired ends

  9. Sample prep Sequence data Raw Sequence Reads • wet-lab experimental methods to isolate, prepare, and sequence the DNA • results in a number of large FASTQ files • FASTQC can be used to check basic statistics of the files – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • many tools available for QC – e.g. http://hannonlab.cshl.edu/fastx_toolkit/

  10. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

  11. http://www.ncbi.nlm.nih.gov/Traces/sra/

  12. Genome Assembly Software • Overlap-layout-consensus • Celera: http://wgs-assembler.sourceforge.net/ • K-mer based • Velvet: http://www.ebi.ac.uk/~zerbino/velvet/ • SOAP-denovo: http://soap.genomics.org.cn/soapdenovo.html • ALLPATHS-LG: http://www.broadinstitute.org/software/allpaths-lg/blog/ • IDBA-UD: http://i.cs.hku.hk/~alse/hkubrg/projects/idba_ud/

  13. Two graph models • A first graph model – Nodes (vertices) are contiguous sequences of k characters (k-mer) – Directed edge from v i to v j if v i [2..k]=v j [1..k-1] A C G T T C ACG CGT GTT TTC

  14. Two graph models • De-bruijn graph – Nodes (vertices) are contiguous sequences of k-1 characters (k-1-mer) – Directed edge from v i to v j if v i [1..k-1]+v j [k-1] are a valid k-mer A C G T T C ACG CGT GTT TTC AC CG GT TT TC

  15. Note edges that are not reflected in the input! Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

  16. Genome Assembly • Building the k-mer graph – nodes as k-mers, edges (k-1) overlap 17

  17. Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 ACGT CGTA GACG k=3 1 1 1 ACG CGT GTA GAC

  18. Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 1 1 TACG GTAC ACGT CGTA GACG k=3 1 1 2 1 ACG CGT GTA TAC GAC 1

  19. Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT CGTT k=4 1 1 1 1 1 TACG GTAC ACGT CGTA GACG GTT k=3 1 1 2 2 1 ACG CGT GTA TAC GAC 2

  20. Genome Assembly • Building the k-mer graph – nodes as k-mers, edges (k-1) overlap – nodes as (k-1)-mers, edges form k-mers 21

  21. Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 1 ACG CGT GTA GAC k=3 1 1 1 1 AC CG GT TA GA

  22. Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT k=4 1 1 2 1 ACG CGT GTA TAC GAC 1 k=3 1 3 2 2 AC CG GT TA GA 1

  23. Reads Genome assembly Genome GACGTA CGTACG GACGTACGTT TACGTT GTT k=4 1 1 2 2 1 ACG CGT GTA TAC GAC 2 1 TT GT k=3 1 1 4 2 2 AC CG GT TA GA 2

  24. Genome Assembly • Building the k-mer graph – G(k): nodes as k-mers, edges (k-1) overlap – H(k): nodes as (k-1)-mers, edges form k-mers • H(k)=G(k-1) – So it really does not matter which you choose to implement • Where does the complexity come from? – Sequencing errors, repeats, uneven coverage, contamination from other organisms, ploidy, unsequenced regions 25

  25. Popping bubbles Error occurs in the middle of a read and is propagated to many k-mers.

  26. Trimming tips Error creates an erroneous ending k-mer

  27. Chimeric extensions Errors connect two nodes in the graph which do not correspond to a valid extension in the genome sequence Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

  28. Repetitive regions • Satellites, SINEs, LINEs • Homologous Genes – Ortholog: descended from the same ancestral sequence and separated by speciation – Paralog: genes created by a duplication event 29

  29. 30 Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

  30. Compeau et al. (2011) How to apply de Bruijn graphs to genome assembly

  31. Velvet assembler • Four stages – Hashing reads into k-mers – Constructing the de Bruijn graph (not all 4^k k- mers, only those that exist in input) – Correct errors – Resolve repeats • But what after? – Paper gives very little information on this... 32

  32. The Chinese postman problem (CPP) • Compute a closed tour of minimum length that visits each edge at least once – Similar to what we want except we may want to visit edges more than once due to repeats • How do we deal with repeats? – Also, the starting and ending vertices are distinct in genome assembly • How can we convert the closed tour to an open one? 33

  33. Your homework • You are not required to implement section 4 of http://web.eecs.umich.edu/~pettie/matching /Edmonds-Johnson-chinese-postman.pdf • You are not even required to model genome assembly as CPP • But you do have to build the k-mer graph, correct errors, resolve repeats, and compute a CPP or Eulerian-like tour. 34

  34. Evaluating assembly • The Assemblathon2 study lists 102 measures for evaluating assembly quality. • Bradnam et al. ( 2013) Assemblathon 2: evaluating de novo methods fo genome assembly in three vertebrate species 1. NG50 scaffold length: a length x where all scaffolds of length x or longer consists of at least 50% of the genome size 2. NG50 contig length: a length x where all contigs of length x or longer consists of at least 50% of the genome size 3. Amount of gene-sized scaffolds (>25 kbp). Useful for gene finding. 4. CEGMA: Number of 458 core genes mapped

  35. Evaluating assembly 5. Fosmid coverage: How many validated fosmid regions were captured in assembly 6. Fosmid validity: Percentage of assembly validated by validated fosmid regions 7. Validated fosmid region tag scaffold summary score: number of validated fosmid region tag pairs that match the same scaffold multiplied by the percentage of uniquely mapping tag pairs that map with correct distance. Rewards short-range accuracy. 8. and 9. Using local and global alignments of optimal map data, how well the assembly is ordered. 10. REAPR summary score: a tool that evalutes accuracy of assembly using paired reads

Recommend


More recommend